Software Engineering Research · AI Coding Agents

When agents commit
to open source.

Five major coding agents, 110,000 pull requests, real open-source repositories. Researchers at TU Delft and UC Davis followed the trail from PR creation to code survival, and found that the story doesn't end at merge.

Core finding
Agent-authored code gets merged into open-source projects at rates comparable to or higher than human PRs, but the code churns at higher rates over time. Merge rate and code longevity are measuring different things.
scroll to explore

Merge rate is a starting
point, not an answer.

AI coding agents ship pull requests to real codebases. The question isn't whether they can. The question is what happens to that code over time, and whether anyone had measured it.

Vendors selling coding agents tend to cite benchmark performance or curated case studies. Neither tells you much about what happens when an agent opens a PR on a public open-source project it has never encountered, the maintainers decide whether to merge it, and then months pass. Does the code stick? Does it get rewritten? Does it attract more review comments than a human contributor would generate?

Prior research had studied AI-generated code in controlled settings, or looked at small samples of agentic pull requests to characterize their style and structure. What was missing was a longitudinal picture: a large enough dataset to measure not just acceptance but survival. Merge rate is the easiest thing to observe. It is not the same as code quality, and it is not the same as long-term contribution value.

The researchers also wanted to compare agents to each other. There are now multiple production-grade coding agents with meaningfully different architectures and interaction models. They might differ not just in benchmark scores but in how open-source communities actually receive and absorb their contributions.

The question this paper asks

What do autonomous coding agent contributions look like in real open-source development, across multiple agents, at scale? How do they compare on merge rate, speed, interaction patterns, and the longer-term fate of the code they write?

110,000 pull requests
across five agents.

The team scraped, labeled, and analyzed a large dataset of open-source pull requests from GitHub, identifying contributions from each of five major coding agents and comparing them to human-authored PRs across a range of behavioral signals.

Pull requests studied
110K
open-source PRs, with commits, reviews, comments, and file changes
Coding agents compared
5
Codex, Claude Code, Copilot, Jules, Devin
Research institutions
2
TU Delft & University of California Davis

The dataset included pull requests with their associated metadata: commits, inline comments, code review threads, linked issues, and the files changed. That last piece is what made longitudinal analysis possible. By tracking individual files and lines over time, the researchers could measure survival and churn rates for agent-generated versus human-authored code.

The five agents span a meaningful range. OpenAI Codex and GitHub Copilot both come from the same parent company but target different workflows. Claude Code is a terminal-native agent. Google Jules is integrated into the Gemini ecosystem. Devin is a fully autonomous agent marketed explicitly for end-to-end software engineering tasks. Their integration styles differ enough that finding variation in their real-world contribution patterns is not surprising. Finding the specific shape of that variation is what required the data.

The researchers examined four dimensions for each agent: merge frequency relative to humans and each other; how quickly PRs were merged; which file types each agent touched; and how much developer commentary (comments and reviews) the PRs attracted. Then they layered in the longitudinal analysis: survival of code over time versus churn.

What "churn" means here

Code churn is when code that was committed gets deleted, replaced, or substantially rewritten within a subsequent period. High churn on merged code means the code was accepted but did not endure as written. It is distinct from a rejected PR: the contribution made it in, but it was short-lived. The researchers produced longitudinal survival and churn rate estimates for both agent-generated and human-authored code.

Gets in fast.
Doesn't always stay.

Agent activity in open-source is growing. Some agents merge more reliably than humans. Codex PRs merge in minutes, not hours. But code written by agents shows higher churn over time than code written by humans.

Common assumption
Merge rate signals quality. If an agent's PRs get accepted, the agent is contributing meaningfully. Higher merge rate means better contributions.
What the data shows
Merge rate and code survival diverge. Agent code can earn high merge rates and still churn faster than human-authored code. Acceptance and longevity are different things.
Finding 1: Agent activity is accelerating, and agents vary significantly

Across the dataset, agent contributions to open-source repositories are increasing over time. But the five agents do not behave like a homogeneous category. Claude Code and OpenAI Codex were merged at higher rates than human PRs in the sample. GitHub Copilot and Devin were merged at lower rates. Google Jules sat in the middle, with a merge rate that held roughly flat across repositories of different sizes but fell off at the highest star counts.

The variation is large enough that "how do agents perform?" is not a meaningful question. The agents are doing different things and being received differently by open-source communities.

Finding 2: Codex PRs merge in minutes, not hours

The median human pull request in the dataset was merged in 0.4 hours. The median Codex pull request was merged in 0.5 minutes: roughly 50 times faster. Codex and Jules were the fastest-merging agents in the study. That speed almost certainly reflects the kinds of PRs these agents open rather than any special trust from maintainers: small, targeted, mechanical changes merge fast everywhere. But the gap is striking and points to real differences in how each agent scopes its work.

OpenAI Codex
Higher merge rate than humans
Fastest median merge time (0.5 min)
Merge rate roughly similar across repositories regardless of star count. Very fast integration suggests small, targeted changes.
Claude Code
Higher merge rate than humans
Merge time not highlighted as outlier
Consistently high acceptance rate across the sample. Merge rate pattern varies with repository popularity.
GitHub Copilot
Lower merge rate than humans
Slight inverted-U pattern by repo size
Merge rate follows an inverted-U with repository star count, similar to the human pattern.
Google Jules
Roughly flat merge rate
Fast merge time, alongside Codex
Merge rate stable across most repository sizes, falls off at the highest star counts. Fast to merge when accepted.
Devin
Lower merge rate than humans
Inverted-U pattern by repo size
Follows a similar size-dependent merge pattern to Copilot and human contributors. Lower overall acceptance rate.
All agents
More churn than human code over time
Higher activity, lower longevity
Regardless of individual merge rates, code written by all five agents churned at higher rates over time than human-authored code in the same repositories.
Finding 3: Agent code churns faster than human code, across all five agents

This is the finding that cuts against the merge-rate story. When the researchers tracked what happened to agent-authored code after it was merged, they found higher churn rates compared to human-authored code over the same period. Code that was written by agents was more likely to be deleted, replaced, or substantially rewritten in subsequent commits.

The mechanism isn't pinpointed in the paper: it could be that agent code is more brittle, that it solves problems at a lower level of abstraction that subsequent contributors immediately improve, that it's used for low-stakes tasks where rewriting is easy, or some combination. What the data shows is the pattern, not the cause. But the pattern is consistent across all five agents.

Scope and limitations

The study measures observable behavior in public open-source repositories. It does not study proprietary codebases, where the dynamics may differ substantially. Identifying PRs as agent-authored depends on labeling strategies that may not be perfectly accurate. The churn finding is a pattern, not a causal analysis: the paper does not establish why agent code churns more, only that it does. Future work on the "why" would require deeper case analysis.

What to do with
this information.

Merge rates are what vendors publish. Churn rates are what engineering teams should measure. The two are telling different stories about the same code.

1
For engineering teams using coding agents
Track what happens to agent-authored code after it merges, not just whether it merges. A month or a quarter after a run of agent PRs, check whether that code is still in place or has been rewritten. If churn is high, ask whether the agent is solving problems at the right level of abstraction or generating code that requires human cleanup to make maintainable.
2
For open-source maintainers
The merge speed finding for Codex (0.5 minutes versus 0.4 hours for humans) suggests a category of very small, targeted changes. That category may deserve its own review protocol. Fast-merging agent PRs may require less review per PR but more attention to whether they accumulate into patterns that create technical debt.
3
For anyone selecting a coding agent
The five agents showed meaningfully different merge rates, merge speeds, and behavioral profiles. Benchmarks and vendor demos are not a substitute for looking at how each agent's contributions are received in codebases similar to yours. The variation here is large enough that agent selection should be based on observed behavior in relevant contexts, not just capability claims.
4
For researchers and toolmakers
This paper establishes a methodology for longitudinal measurement of agent contributions. The survival and churn analysis is more informative than cross-sectional merge rates, and it's underused in both academic and practitioner evaluation of AI coding tools. Building churn tracking into standard agent evaluation frameworks would give teams a much more honest picture than acceptance rate alone.
5
A note on pace
Agent activity in open-source is growing. The data makes that clear. That growth is not inherently a problem, but it does mean maintainers will increasingly need tools and norms for evaluating contributions that weren't written by humans who will maintain them. The community infrastructure for that doesn't fully exist yet.

Where to go
from here.

If you want to go deeper on agent contributions to real codebases.

1
Read the paper
Popescu, R. M., Gros, D., Botocan, A., Pandita, R., Devanbu, P., & Izadi, M. (2026). Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time. TU Delft & University of California Davis. arXiv:2604.00917.
2
Read the companion study on agentic coding
A related empirical study of agentic pull requests on GitHub (arXiv:2509.14745) takes a complementary approach, looking at structural characteristics of agent-authored PRs rather than longitudinal survival. The two papers together give a fuller picture of how agent contributions differ from human ones.
3
Instrument churn in your own codebase
Most version control systems make this measurable. Tag commits or PRs by source (agent vs. human), then periodically check whether lines introduced in agent PRs are still present or have been rewritten. Even a rough quarterly audit can tell you whether agent code in your codebase is accumulating or being absorbed and replaced.
4
Look at merge conflicts specifically
The AgenticFlict dataset (arXiv:2604.03551) focuses on merge conflicts in AI coding agent pull requests. If your team is running multiple agents or mixing agent and human contributions on shared branches, conflict patterns are another behavioral signal worth understanding.
5
Evaluate agents on failure modes, not benchmarks
The authors of a related study on code review agents (arXiv:2604.03196) make the same point: industry claims and empirical reality often diverge. Set up small-scale experiments where agent PRs go through your actual review process, and look for patterns in what reviewers change. That observation is more informative than any benchmark score.