briCoding

The ticket looked fine until I started building

I've been on the receiving end of shipped tickets with gaps no one catches until I am elbows deep in implementation. I built a skill to find them first.

For five years, across every team I’ve worked on, I’ve had the same nightmare realize: a ticket lands in the sprint, we refine the ticket as a team, all seems well. Within a few hours of picking it up and implementing, i’m blocked on something the ticket never mentioned, and I never clocked.

This isn’t a competence problem. The tickets are fine. Refinement happened and felt complete. But planning mode and implementation mode ask different questions, and only one of them has a developer in the room. By the time someone picks up the ticket, the gaps are invisible to everyone who already signed off.


The gaps

After enough of these, you start to recognize the categories. These are the ones I keep hitting:

  • Edge cases. The ticket covers the happy path. Error states, empty states, concurrent actions, none of it is there.
  • Missing designs. Implementation starts, then you realize the spec stops before the component you’re building. There’s a mockup for the main view but nothing for the confirmation modal or the error state, or maybe a design for a certain configuration is missing.
  • Copy. The button on the design has the copy “Confirm donation {donationAmount}”, but nobody asked what happens at $1,000,000.99, what happens when it gets localized into Dutch (we always joke that Dutch is extra wordy- it makes a good stress test for copy), what the empty state message is, or what the error toast reads. It becomes the dev’s problem.
  • API schema. The shape was assumed, not confirmed. You build to a contract that doesn’t exist yet, then find out the endpoint returns something different.
  • Acceptance criteria. No definition of done. You ship, PM says “that’s not what I meant.”

None of these are hard to fix once someone points them out. The problem is that nobody points them out until you’re mid-build and the sprint clock is running.


Why “just write better tickets” doesn’t work

The instinct is to push quality upstream. Better templates, more required fields, stricter definitions of ready.

It helps at the margins. It doesn’t solve the core problem.

The person writing the ticket hasn’t tried to build the thing yet. They’re thinking about what the feature does, not what a developer will need to know to implement it. They can’t see the gaps because the gaps only become visible from an implementation perspective. And by definition, that perspective doesn’t exist until someone starts the work.

This is why the same teams that have detailed ticket templates still produce tickets with missing information. The template ensures the right sections exist. It can’t ensure the right details are in them.

You need something that can read the ticket from the perspective of someone about to implement it, before that person actually starts.


The skill

I built an Agent skill that does exactly this, it’s on GitHub. It takes a ticket (title, description, acceptance criteria, whatever’s included) and runs a preflight check from the perspective of a senior engineer about to pick it up. The output is a structured list of gaps, each with a category and severity.

It’s not a linter. It’s not checking whether required fields are filled in. It’s reading the ticket the way a developer would and asking: what’s going to block me?

What this looks like in practice depends on the team:

  • Manual. A dev invokes the skill on a ticket before picking it up. Takes thirty seconds. If gaps come back, they raise them before starting work instead of discovering them mid-sprint.
  • Automated hook. The skill runs at issue creation time and posts a comment with flagged gaps. The author sees them immediately and can resolve them before the ticket even hits the backlog.
  • Handoff ritual. PM and dev run it together during sprint planning/refining as a review step. Gaps surface in a conversation where both sides can resolve them on the spot.

No prescriptive “do it this way.” Each setup fits a different workflow. The point is the same: gaps surface before implementation starts, not after.


What I learned testing it

I didn’t want to ship this on vibes. If the skill is going to tell developers their tickets have problems, it needs to be right. Specific gaps that are actually there, not generic checklist noise.

So I built an evaluation dataset: real GitHub issues, labeled by quality, with annotated ground truth for what gaps should and shouldn’t be identified. Then I ran the skill against them and scored the output.

A few findings worth sharing:

The skill finds real gaps. On issues labeled as bad or borderline, the skill achieved 0.95 recall across 12 real GitHub issues. It caught nearly every gap in the ground truth. More importantly, the gaps it surfaced were specific and named. Not “consider adding more detail” but “no error message specified for the failed upload state” or “PostgREST returns 200 even when zero rows match the filter. This is by design, not a bug.”

It generalizes across ticket types without prompt changes. The initial dataset was all bug reports. I expanded to feature requests, dependency-blocked tickets, API contract issues, and even a Rust compiler tracking issue with 50 linked sub-issues. The skill handled all of them. On the tracking issue, it correctly said “this is not an implementation ticket” and assigned Low risk instead of trying to find gaps that weren’t there.

The skill caught things the annotator missed. When evaluating a VS Code tooltip feature request, the skill pointed out that VS Code already has a Hover Provider API that covers most of what the reporter was asking for. When evaluating a Supabase “update returns 200 but nothing updates” bug, it flagged that a SELECT RLS policy could be hiding the updated row. Something I hadn’t annotated as a ground-truth gap but is genuinely the kind of thing that would block a developer. This happened consistently. The skill surfaced domain-specific insights that went beyond what a human annotator anticipated.

Trying to filter for blockers only made things worse. I tested a second prompt version that added a constraint: only surface a gap if a developer literally cannot begin or complete implementation without resolving it. Recall dropped from 0.92 to 0.68. The constraint made the skill skip concrete obvious gaps while doubling down on deeper semantic analysis. The simpler prompt was better.

The dataset exposed a framing problem. The “good” issues, the ones I expected to come back clean, still generated gaps in every run. At first that looked like a precision failure. But when reviewing the gaps, they were defensible. The issues were well-written bug reports for a triager. They were not ready-to-implement tickets for a developer. A triager needs to know if the bug is real and reproducible. A developer needs to know what exactly to build, what done looks like, and what might block them mid-work. Those are different questions, and no prompt tweak resolves a dataset that conflates them.


What’s next

The skill works. The evaluation framework works. The dataset covers five genres across twelve issues and six repos. The skill is published on GitHub. Drop the SKILL.md into your agent’s skills directory and try it on your own tickets. The repo includes a benchmark dataset of 12 real GitHub issues if you want to test it yourself.

The remaining rough edge is precision. The skill generates 5-7 gaps per issue, and about 2 are typically lower-priority observations rather than true blockers. Many of those “false positives” are genuinely useful, but they dilute the signal. A ranking instruction that surfaces only the highest-impact gaps would likely close this.

Risk calibration is mostly there. 10 out of 12 issues got the right risk level, including correctly scoring a tracking issue as Low. The remaining miss is that a single “risk level” conflates two signals: “this ticket will block a developer” and “this feature is inherently hard to build.” Separating those into readiness (can work begin?) and implementation risk (how hard is it once gaps are resolved?) would make the output sharper.

After five years of watching the same pattern repeat, having something that catches it before a developer starts coding feels like the right tool in the right place.