An agent opened this pull request. Nobody asked it to.

22 Jun

There is a version of the AI and engineering conversation that is pure hype, and a version that is pure caution. I am trying to live in neither. We use AI heavily at Voyfai, and over the last few months I have automated almost the entire path from a problem in production to a pull request that fixes it. A human still approves and merges at the end. Last week we merged our first fully autonomous pull requests: ones an agent opened on its own, from noticing the problem to writing the fix, with no human starting the work.

We have leaned on AI to help write code for a long time, like everyone has. The new part is not that an agent can write a fix. It is that nobody told it to. I want to be honest about why that human is there, because it is not the reason people usually give.

First, what the loop actually does

It runs in five stages.

It starts in observability. An agent reads our production telemetry in Datadog and looks for what changed: regressions, slow paths, error patterns, the kind of thing you would want a sharp on-call engineer to catch. It is not just matching on the word 'error.' It is looking for the shape of a problem.

When it finds something worth acting on, it writes it up as a Jira ticket, with the finding, the evidence, and enough context that the next stage can act on it.

A second agent picks up one of those tickets at a time, clones the repository fresh, reproduces the problem, and implements a fix in code. It opens a pull request the same way one of us would.

Then a third agent works through the review feedback on that pull request. Our setup already runs automated reviewers, Copilot and Codex, and humans leave comments too. The agent reads both, and for the comments that are clearly mechanical it commits the fix, one change at a time.

Then it stops, and a human approves and merges.

Here is the kind of problem it is built for. Latency on one of our endpoints starts creeping up after a deploy. Nothing pages yet, because it is still inside the limits, anomaly detection can detect that, but we want action. The first agent notices the drift and writes a ticket: which endpoint, when it started, how big the change is, and the deploy it lines up with. The second agent picks it up, reproduces it, finds an query that a recent change introduced, and opens a pull request that batches the call. Copilot flags a missing null check and the agent adds it. A human asks for a clearer variable name and the agent renames it. By the time an engineer opens that pull request, the diff is small, the description explains the cause, and the review comments are already handled. The only thing left is to decide whether the fix is right, and merge it. That whole path, from the first sign of drift to a reviewed pull request, happened without anyone being pulled away from what they were doing.

Now the honest part

That human at the end is not there because I believe a person must always have the final say. The story where the machine does the typing and the human keeps the noble judgment is becoming a comfortable one, and I do buy it up to a point. The human is there for two plainer reasons.

The first is that we are not ready to remove them yet. Trust in a system like this has to be earned with evidence, not granted because a demo looked good. A loop that opens pull requests against our production code can do real damage if it is confidently wrong, and confidently wrong is exactly what these systems are good at the moment you stop watching them. So we watch, and we are not yet at the point where I would stop.

The second reason is the one I find more interesting. Right now, the human reviewer is also our measurement. Every approval and every rejection is a data point on the question I actually care about: is this loop solving real problems, and solving them well enough that a person would have shipped the same change? What I watch is simple. Does the human merge it as it is, or do they push more commits on top first? When they reject it, why: was the diagnosis wrong, the fix wrong, or just not how we would have done it? How often does the loop open a pull request that goes nowhere? Those numbers tell me whether we are building something that genuinely solves problems, or something that produces plausible-looking pull requests that quietly waste everyone's time. Until that data gets boring, until the human is approving almost everything almost unchanged, we do not really know the loop is good. The reviewer is partly there to tell us when they are no longer needed.

So I do not think of that final human step as the destination. I think of it as scaffolding. The direction we are moving is to automate review itself and take engineers out of the business of reviewing routine changes entirely. The loop already only touches a well-scoped class of problems, the ones with a clear signal and a clear fix. The ambiguous ones, where the right answer might be 'change the product' rather than 'change the code,' start with a human and will stay that way. What I want left on an engineer's plate is exactly that: the hard, ambiguous problems, and the architectural choices about where this system should go. Not reading a diff that adds a column and re-approving the same kind of change for the hundredth time.

This is not the same as 'AI replaces engineers.' It is closer to the opposite. The routine path, find a known class of problem, write the obvious fix, clear the mechanical comments, is exactly the part of the job that was never the point. Automating it does not shrink what an engineer does. It moves them up in the value chainto the part that is genuinely hard and genuinely human. The best engineers I have worked with were never valuable because they could type a fix for an N+1 query. They were valuable because they could tell which problems were worth solving and how the system should be shaped. That is the work I want to protect, and routine review is not it.

It also helps to remember that none of this is as new as it sounds. Automating engineering work is something we have done for decades. Tests run themselves. End-to-end suites click through the product so a person does not have to. CI and CD take a merge and push it through a dozen steps to production without anyone watching. We never called that replacing engineers, and it never was. What is actually different now is not that the work is being automated. It is that the older automation could only walk a path you defined for it in advance, step by step, and it broke the moment reality stopped matching the script. An agent is handed a situation with many possible paths, which file holds the bug, which fix is right, which checks are worth running, and it can decide which one is best. That is the real change, and it is a difference of degree along a very old line, not a break from it.

The judgment moved into the automation. That is the whole story, and it is plenty.

I want to be equally clear that we are not going to rip the human out tomorrow because it sounds bold. We remove that step the same way we do everything else: in stages, on evidence, with the failure cases thought through first. The first changes we let through without a person will be the ones where being wrong costs almost nothing and is instantly reversible, and the bar moves outward from there, slowly, only when the numbers say it should. The loop already runs on a tight leash, with narrow eligibility, one item at a time, and no irreversible action.

The human stays exactly until the evidence says we have earned the right to remove them, and not a day before.

The interesting question, then, is not 'where is the line that always needs a person.' It is 'how fast can we move that line, responsibly, toward the point where a person only shows up for the problems and the architecture.' That is a design problem and a discipline problem at once, and getting it right matters far more than the code any single agent writes.

There is a catch in all of this, and it is the subject of the next post. Once your machines open their own pull requests, and your engineers are already opening far more than before, you are producing more change than any human review process was built to handle. Which is the other reason taking the human out of routine review stopped being a someday idea and became a problem I had to solve now.

aiai agent

Marco Ziccardi

An agent opened this pull request. Nobody asked it to.

First, what the loop actually does

Now the honest part

The App in the Age of Mechanical Generation

Marco Ziccardi