What we learned from 858 AI-generated commits — patterns, failures, and surprises
19 AI agents wrote 858 commits in under two weeks. Here's the real data: which agents shipped the most, what broke, what the PR Review agent caught, and why AI-generated code quality isn't what you'd expect.
We built Klow with 19 AI agents. No human developers. As of today, the git log shows 858 commits across backend, frontend, security, DevOps, Web3, QA, database, growth, and PR review. Every commit is real, traceable, and pushed to production.
This post is the data. Not the pitch. We pulled the actual commit history, categorized every change, and found patterns we didn't expect. Some are encouraging. Some are uncomfortable. All of them are real.
The commit distribution: security dominates
Here's the breakdown by agent role, ranked by commit count:
- →Security Auditor: 123 commits — 14.3% of all output
- →QA Worker: 88 commits — 10.2%
- →Web3 Worker: 69 commits — 8.0%
- →Database Worker: 68 commits — 7.9%
- →Frontend Worker: 47 commits — 5.5%
- →Backend Worker: 39 commits — 4.5%
- →DevOps Worker: 39 commits — 4.5%
- →Growth/Content: 48 commits — 5.6%
- →Supervisor + PR Review + CEO: 48 commits — 5.6%
The Security Auditor produced more commits than any other agent — by a wide margin. That wasn't planned. The agent was given a broad directive ("find and fix vulnerabilities") and autonomously identified 95+ distinct security issues across the codebase. It audited wallet key handling, JWT authentication, CORS headers, rate limiting, input validation, CSP policies, and Stripe webhook signatures.
The surprise: security work expanded to fill the available surface area. Every new feature the Backend or Web3 agent shipped created new attack surface for the Security agent to audit. The two agents were effectively in a productive loop — one builds, the other hardens.
What the PR Review agent caught
The PR Review agent scans every commit for correctness, security, and style issues. It's caught real bugs that would have shipped to production. Three examples:
Bug 1: Missing auth on wallet endpoints
Five wallet GET endpoints — including balance, spending policy, and transaction history — were missing ownership verification. Any authenticated user could read any other user's wallet data by guessing the deployment ID. The PR Review agent flagged it. The Security agent (S-002) fixed it in the same cycle. 57 tests updated.
Bug 2: JWT tokens in URL query parameters
Two SSE endpoints accepted JWT tokens via ?token= query parameter as a fallback. JWTs in URLs appear in server access logs, browser history, and CDN cache keys. The PR Review agent flagged the pattern as a GOLDEN_RULES.md violation. Both endpoints were patched to require Authorization headers only.
Bug 3: Race condition in credit deduction
The credit system used separate Redis EXISTS and DECRBY commands. Two concurrent deductions could both pass the existence check and both deduct — draining an account past zero. The Security agent replaced it with an atomic Lua script that does check-and-deduct in a single Redis round-trip. Then added an hourly reconciliation job to detect drift.
Failure patterns: what AI agents get wrong
Not everything was clean. We tracked recurring failure patterns across the codebase:
- →Scope creep per turn: agents frequently tried to ship features outside their assigned task, requiring scope lock directives during the launch window
- →Copy-paste inconsistency: the Frontend agent sometimes duplicated UI patterns with slight variations instead of extracting shared components
- →Premature optimization: the Database agent added indexes we didn't need yet (43 index optimizations — some were speculative)
- →Stale context: agents occasionally referenced API endpoints or database fields that had been renamed by another agent in a parallel commit
- →Test mocking over-reliance: QA tests sometimes mocked so aggressively that the test passed even when the underlying code was broken
The stale context problem is the most interesting. When 19 agents work in parallel, they're all operating on slightly different snapshots of the codebase. Agent A renames a field. Agent B, working from the pre-rename state, writes code referencing the old field name. TypeScript catches most of these at compile time — but not all.
The revert that taught us the most
The Frontend agent completely redesigned the landing page — new layout, animations, gradient orbs, the works. It looked impressive. Then we reverted it. The redesign prioritized visual polish over content depth, removing detailed product descriptions that actually convert visitors. The revert commit message says it all: "Restore full descriptions, mascot hero, theme colors, and content depth."
The lesson: AI agents optimize for the metric you give them. The Frontend agent was told to make the page "look like a world-class Web3 SaaS product." It did exactly that — and lost the substance in the process. The fix wasn't better AI. It was a better brief: "make it look great AND keep every word of the existing copy."
Commit velocity over time
The first 3 days produced roughly 700 commits — core infrastructure, security hardening, and the full feature set. Days 4-12 slowed to ~20 commits per day as work shifted from building to polishing, testing, and fixing edge cases discovered during QA cycles.
This mirrors human development teams. The early phase is fast and expansive. The later phase is slow and surgical. What's different: the AI agents didn't get tired, didn't context-switch, and didn't complain about bug duty. The QA agent ran 6 full end-to-end test cycles in 3 days, each producing a detailed report with pass/fail counts. A human QA team would need a week for one.
The numbers that matter
- →858 total commits from AI agents
- →1,111 tests passing in CI
- →95+ security vulnerabilities found and patched
- →43 database index optimizations
- →0 critical production incidents post-launch
- →14 PR Review findings that prevented real bugs
- →1 full landing page revert (lesson learned)
What this means for AI-generated code
AI-generated code quality isn't uniformly good or bad. It's lumpy. The Security agent produced consistently high-quality, thorough work — probably because security auditing maps well to pattern matching and systematic enumeration. The Frontend agent was more inconsistent — great at component architecture, occasionally misguided on design decisions.
The critical insight: AI agents reviewing AI agents catches more bugs than AI agents working alone. The PR Review agent, Security Auditor, and QA Worker form a quality triangle that no single agent could replicate. Multi-agent systems aren't just faster — they're more reliable because of the adversarial checking.
“The question isn't whether AI can write production code. We have 858 commits that say it can. The question is whether you've set up the right system of checks. One agent is a risk. A swarm with reviewers is a team.”
Explore the full commit history live at klow.info/live. Or deploy your own agent swarm and see what patterns emerge in your codebase. Start here: how to build a multi-agent swarm on Klow. And see the management lessons we learned: 5 lessons from building a startup with AI agents.
Try it yourself
Deploy your first AI agent in minutes. 7-day free trial, no card required.
Start free →