When agents commit
to open source.
Five major coding agents, 110,000 pull requests, real open-source repositories. Researchers at TU Delft and UC Davis followed the trail from PR creation to code survival, and found that the story doesn't end at merge.
Merge rate is a starting
point, not an answer.
AI coding agents ship pull requests to real codebases. The question isn't whether they can. The question is what happens to that code over time, and whether anyone had measured it.
Vendors selling coding agents tend to cite benchmark performance or curated case studies. Neither tells you much about what happens when an agent opens a PR on a public open-source project it has never encountered, the maintainers decide whether to merge it, and then months pass. Does the code stick? Does it get rewritten? Does it attract more review comments than a human contributor would generate?
Prior research had studied AI-generated code in controlled settings, or looked at small samples of agentic pull requests to characterize their style and structure. What was missing was a longitudinal picture: a large enough dataset to measure not just acceptance but survival. Merge rate is the easiest thing to observe. It is not the same as code quality, and it is not the same as long-term contribution value.
The researchers also wanted to compare agents to each other. There are now multiple production-grade coding agents with meaningfully different architectures and interaction models. They might differ not just in benchmark scores but in how open-source communities actually receive and absorb their contributions.
What do autonomous coding agent contributions look like in real open-source development, across multiple agents, at scale? How do they compare on merge rate, speed, interaction patterns, and the longer-term fate of the code they write?
110,000 pull requests
across five agents.
The team scraped, labeled, and analyzed a large dataset of open-source pull requests from GitHub, identifying contributions from each of five major coding agents and comparing them to human-authored PRs across a range of behavioral signals.
The dataset included pull requests with their associated metadata: commits, inline comments, code review threads, linked issues, and the files changed. That last piece is what made longitudinal analysis possible. By tracking individual files and lines over time, the researchers could measure survival and churn rates for agent-generated versus human-authored code.
The five agents span a meaningful range. OpenAI Codex and GitHub Copilot both come from the same parent company but target different workflows. Claude Code is a terminal-native agent. Google Jules is integrated into the Gemini ecosystem. Devin is a fully autonomous agent marketed explicitly for end-to-end software engineering tasks. Their integration styles differ enough that finding variation in their real-world contribution patterns is not surprising. Finding the specific shape of that variation is what required the data.
The researchers examined four dimensions for each agent: merge frequency relative to humans and each other; how quickly PRs were merged; which file types each agent touched; and how much developer commentary (comments and reviews) the PRs attracted. Then they layered in the longitudinal analysis: survival of code over time versus churn.
Code churn is when code that was committed gets deleted, replaced, or substantially rewritten within a subsequent period. High churn on merged code means the code was accepted but did not endure as written. It is distinct from a rejected PR: the contribution made it in, but it was short-lived. The researchers produced longitudinal survival and churn rate estimates for both agent-generated and human-authored code.
Gets in fast.
Doesn't always stay.
Agent activity in open-source is growing. Some agents merge more reliably than humans. Codex PRs merge in minutes, not hours. But code written by agents shows higher churn over time than code written by humans.
Across the dataset, agent contributions to open-source repositories are increasing over time. But the five agents do not behave like a homogeneous category. Claude Code and OpenAI Codex were merged at higher rates than human PRs in the sample. GitHub Copilot and Devin were merged at lower rates. Google Jules sat in the middle, with a merge rate that held roughly flat across repositories of different sizes but fell off at the highest star counts.
The variation is large enough that "how do agents perform?" is not a meaningful question. The agents are doing different things and being received differently by open-source communities.
The median human pull request in the dataset was merged in 0.4 hours. The median Codex pull request was merged in 0.5 minutes: roughly 50 times faster. Codex and Jules were the fastest-merging agents in the study. That speed almost certainly reflects the kinds of PRs these agents open rather than any special trust from maintainers: small, targeted, mechanical changes merge fast everywhere. But the gap is striking and points to real differences in how each agent scopes its work.
This is the finding that cuts against the merge-rate story. When the researchers tracked what happened to agent-authored code after it was merged, they found higher churn rates compared to human-authored code over the same period. Code that was written by agents was more likely to be deleted, replaced, or substantially rewritten in subsequent commits.
The mechanism isn't pinpointed in the paper: it could be that agent code is more brittle, that it solves problems at a lower level of abstraction that subsequent contributors immediately improve, that it's used for low-stakes tasks where rewriting is easy, or some combination. What the data shows is the pattern, not the cause. But the pattern is consistent across all five agents.
The study measures observable behavior in public open-source repositories. It does not study proprietary codebases, where the dynamics may differ substantially. Identifying PRs as agent-authored depends on labeling strategies that may not be perfectly accurate. The churn finding is a pattern, not a causal analysis: the paper does not establish why agent code churns more, only that it does. Future work on the "why" would require deeper case analysis.
What to do with
this information.
Merge rates are what vendors publish. Churn rates are what engineering teams should measure. The two are telling different stories about the same code.
Where to go
from here.
If you want to go deeper on agent contributions to real codebases.