foobarto.me

AI killed Agile, long live Waterfall

2026-04-23T00:00:00+00:00

For twenty years, “Agile vs Waterfall” has been a rigged debate. You were either shipping biweekly with story points and retros, or you were a dinosaur writing 80-page requirements docs nobody would read. One was engineering; the other was malpractice. The jokes wrote themselves.

Here’s the thing. The jokes are aging badly.

Not because Agile was wrong — it was right for the problem it was solving. But the problem has quietly changed underneath it, and a lot of teams haven’t noticed. The economics that made Agile inevitable in 2005 don’t hold anymore. What’s coming back in its place looks, from a squint, an awful lot like the thing we all laughed at.

Why Agile won

Agile won because predicting software was a disaster. You’d write a spec for a six-month project and by month two you’d discover a whole subsystem you didn’t know existed, a library that didn’t do what the docs claimed, an integration that needed three times the code you estimated. The requirements would shift under you anyway because the business didn’t really know what it wanted until it saw something running.

So we gave up on prediction. Short cycles. Working software over comprehensive documentation. Respond to change over following a plan. The Agile Manifesto wasn’t a methodology, it was a surrender — a graceful one. If you can’t see far, don’t plan far. Take a step, look around, take another step.

This was the correct response to a specific constraint: writing code was expensive, and figuring out what to write was almost free by comparison. Planning meetings, specs, design docs — those were cheap. Actually building the thing was the bottleneck. So you optimized by building the smallest useful thing, seeing what you learned, and adjusting.

The ratio mattered. Exploration was cheap, execution was expensive, so you kept execution tight and let exploration wander.

What AI changes

Flip the ratio.

With a competent coding agent, writing the code is no longer the bottleneck. You can describe a feature at breakfast and have a working implementation by lunch. Not always good, not always right, but working. Three implementations if you want to compare them. The “can we even build this” question, which used to eat sprints, now eats afternoons.

What’s expensive now? Knowing what to build. And — this is the part people are still catching up to — specifying it precisely enough that the fast execution produces something coherent.

An agent will happily generate 4,000 lines of code from a vague prompt. The code will run. It will also be the wrong abstraction, ignore the three constraints you forgot to mention, and quietly invent APIs that don’t exist. The bottleneck isn’t typing anymore. It’s thinking clearly enough that a fast, literal, tireless collaborator doesn’t faithfully execute your confusion at scale.

This is the reversal. Exploration is now the expensive part. Execution is cheap. And when execution is cheap, the value of getting the plan right before you execute goes way up.

The return of the design doc

Watch what the people building seriously with AI actually do. They’re not typing please build me a web app into an agent and calling it a day. They’re writing specs. Long, carefully-considered specs. DESIGN.md. PLAN.md. Architecture decisions written out in prose before a single file is generated. Threat models. Sequencing docs. Capability boundaries spelled out in English because the agent needs them spelled out in English.

The artifact that matters has shifted. It used to be the code — the code was the truth, the spec was a lie people told themselves at the start of a project. Now the spec is where the thinking happens, and the code is a compilation target. A cheap one. Rerunnable.

This is Waterfall, or at least Waterfall’s ghost. Not the cartoon version where a BRD gets thrown over a wall to engineers who throw binaries over another wall to users. The underlying instinct: think carefully, write it down, then build. The reason it failed in the 90s was that the “then build” step took two years and the world moved on. The reason it’s viable now is that the “then build” step takes a weekend.

What “Waterfall” looks like in 2026

It doesn’t look like a Gantt chart.

It looks like a single engineer spending three days on a design doc and four hours on implementation. It looks like spec-driven development where the spec is the repo’s most-edited file. It looks like PRs where the interesting review comment is on the architecture markdown, not on the code. It looks like product work where you iterate on the description of the feature through a dozen revisions before anyone writes a test.

Short feedback loops haven’t gone away — they’ve moved upstream. The loop used to be: ship, measure, learn, adjust the backlog. Now the loop is: spec, generate, read, adjust the spec. You can run that loop five times in an afternoon. You’re still being agile in the lowercase sense. You’re just doing it in the design phase, because the design phase is now where the uncertainty lives.

What this means for teams

The uncomfortable part. A lot of Agile ceremony was scaffolding for the fact that individual engineers couldn’t be trusted to plan the whole thing — not because they were bad, but because planning the whole thing was genuinely impossible. Sprints, standups, and story points were risk-management for a world where humans wrote every line by hand and drifted in predictable ways.

When one person with a clear spec can produce what used to take a team a quarter, the ceremony starts looking expensive. Not useless — expensive. A standup where four people sync on yesterday’s tickets is a fine ritual when everyone’s producing at human speed. It’s a strange ritual when one of them shipped the feature overnight and the other three are still describing theirs.

The teams that seem to be winning right now are doing something that looks almost old-fashioned: small groups, lots of writing, careful sequencing, long thinking phases, fast execution phases. They have strong opinions about architecture before they touch a file. They push the uncertainty into the design, not the sprint.

The long live part

Waterfall isn’t literally coming back. The thing it got wrong — betting a year of work on a spec written before you knew anything — is still wrong. Specs are still usually wrong on the first pass. The difference is that the cost of being wrong has collapsed. You write the spec, generate the system, read it, realize the spec was incomplete, fix the spec, regenerate. The loop that used to take eighteen months takes an afternoon.

What we’re converging on doesn’t have a clean name yet. Spec-driven. Design-first. Intent-oriented. Pick your poison. It borrows Waterfall’s seriousness about thinking before building and Agile’s humility about not knowing the answer up front. It’s less a methodology than a new ratio: planning is expensive again, execution is cheap, and the practices that survive will be the ones that respect that.

Agile won the last war. That war is over. The code is writing itself now. Someone still has to decide what it should say.

You are holding it wrong

2026-04-22T00:00:00+00:00

There’s a growing genre of developer blog post: the AI slop rant. You’ve read them, maybe written one. Pull requests that compile and pass tests and do nothing the ticket asked for. Documentation that the submitter clearly hasn’t read. Open source maintainers burning out under a tide of drive-by contributions from people who can’t answer basic questions about the code they just submitted. Emojis in comments. Invented APIs. Four thousand lines where forty would do.

The rants are not wrong.

A recent qualitative study out of Heidelberg, the University of Melbourne, and Singapore Management University analyzed over a thousand Hacker News and Reddit posts tagged “AI slop” and found a consistent theme: developers describe the phenomenon as a tragedy of the commons, where one person’s velocity gain becomes five reviewers’ cleanup bill. Rémi Verschelde, who maintains the Godot game engine, has publicly described the influx of AI-generated contributions as draining and demoralizing. Mitchell Hashimoto, the HashiCorp founder, has built a vouching system — currently being piloted on his Ghostty project — specifically because AI tools have made it trivial to generate plausible-looking but hollow contributions. The Gentoo Linux distribution is migrating off GitHub to Codeberg. This is not a vibes problem. It is a real, measurable externality being absorbed by the people at the end of the PR queue.

But I want to offer an unfashionable observation. The problem isn’t the tool. It has never been the tool. Look at who’s actually producing the slop — and, more tellingly, look at who isn’t.

The quiet other side

Simon Willison, co-creator of the Django web framework, has been publishing his AI-assisted development workflow in public for close to three years. He recently wrote a detailed walkthrough of building a custom colophon page for his tools site: conception to deployed feature in just over seventeen minutes, total Anthropic API cost, sixty-one cents. The code was reviewed. The tests were run. He understood every line, which is why he could step in and finish the last bit by hand when the model got stuck on a GitHub Actions quirk. Nobody in the Datasette community is writing angry blog posts about how Simon’s PRs are destroying their review process.

Kent Beck — the Kent Beck, co-author of the Agile Manifesto, inventor of Extreme Programming, pioneer of test-driven development — spent a chunk of 2025 building a B+ Tree library called BPlusTree3 in Rust and Python using what he calls augmented coding. The result: production-competitive performance, with the Rust implementation matching standard library benchmarks and outperforming them on range scans. He describes the process not as letting the machine run wild but as intervening constantly — watching for warning signs, stopping the agent the moment it starts generating functionality he didn’t ask for, treating unexpected test deletions as red flags. He’s also been explicit that juniors working this way — augmented, not vibe-coding — ramp onto codebases dramatically faster than before, because the AI collapses the search space for “which library should I even use” down from hours to minutes, freeing time for actual learning.

Armin Ronacher, creator of Flask and previously VP of Platform at Sentry, now runs a startup called Earendil with a small team plus what he openly refers to as AI interns. His thirty-seven-minute talk on agentic coding walks through a workflow in which he ships features with Claude Code running with broad permissions inside a Docker container. He’s delegating real work, not supervising every token. His stated philosophy: keep the context system simple, keep the feedback loops observable, avoid tool sprawl, assume the agent will be lazy about whatever friction you introduce. He’s not complaining about slop. He’s also not producing it.

Tobias Lütke, the Shopify CEO, has a publicly visible GitHub contribution graph that spiked last autumn when coding agents crossed a real capability threshold. He’s shipping code again, for the first time in years, because an agent lets him fit real programming work into the cracks of being a CEO — including a recent autoresearch plugin built alongside a collaborator, with the agent maintaining state in a structured JSONL file across sessions.

There’s a developer named Lalit who had been procrastinating on a SQLite parser and linter project for years — four hundred grammar rules of tedious work that every would-be contributor bounces off — and finally shipped the prototype with Claude Code’s help. The reception in the community has been enthusiasm, not complaint.

You could extend this list. Andrej Karpathy. The long tail of indie developers shipping side projects that, five years ago, simply wouldn’t have existed because the cost of getting started was too high. None of these people are villains in the AI slop discourse. None of them are slowing down the people around them. Their names come up in the positive examples, not the complaints.

The diagnosis

So what’s different? It’s not the model. Simon, Kent, Armin, and the people writing angry blog posts are using roughly the same tools — Claude Code, Cursor, Codex, some mix. It’s not even the prompting technique. Most of the patterns are public; Simon has been documenting them for years.

The difference is that the people who ship quality AI-assisted work treat the output as their output. They read it. They test it. They know why every function exists and what happens when it breaks. Simon Willison has made a useful distinction: having an LLM generate every line of your code is not the same thing as vibe coding — provided you actually review, test, and understand what came out. One is using the model as a very fast typist. The other is abdicating. The word he’s landed on for the responsible version, vibe engineering, is ugly on purpose — it refuses the cleanness of pretending there’s a category of serious AI use that doesn’t involve serious human judgment.

The developer who drops four thousand lines of generated code into a PR with a ticket number and a shrug is not losing a fight with their tool. They’re losing a fight with the expectation that engineers understand what they ship. That expectation predates LLMs by about fifty years. It’s not an AI problem. It’s an accountability problem wearing an AI costume.

Why the backlash is correct anyway

None of this is an argument that the slop complaints are wrong to be loud. They are exactly as loud as they need to be. When the cost of producing plausible-looking work collapses, the ratio of serious work to performative work gets harder to read from the outside. Reviewers, maintainers, and teammates end up absorbing the evaluation that the submitter should have done. That’s a real cost, and venting about it is how a community establishes new norms in real time.

But the framing — this tool is ruining our craft — is diagnostically off. The tool is exposing something about the craft that was already there. The person who now submits four thousand lines of generated code is, in most cases, the same person who would have submitted four hundred lines of Stack Overflow copy-paste a decade ago with slightly more friction. What’s changed is the spread. The gap between the best and worst practitioners on any given task has widened, because the tool amplifies whatever judgment the user brings. If you have taste, you ship in a morning what used to take a week. If you don’t, you now generate in a morning what it takes a reviewer a week to unpick.

The quiet pattern in all the positive examples above is the same pattern visible in any good senior engineer’s workflow for the last thirty years: clear intent going in, a short feedback loop, honest reading of the output, willingness to throw it away when it’s wrong. The people who already had those habits got a force multiplier. The people who didn’t, got exposed.

The boring conclusion

There’s a comfortable version of this debate where you pick a side — AI good, AI bad — and call anyone on the other one a shill or a Luddite. The actual situation is more annoying. The tool is real. The slop is real. The productivity gains are also real. The people producing high-quality AI-assisted work are not a rhetorical fiction invented by Anthropic’s marketing team; they have names and public output you can go read.

So the next time someone forwards you a rant about how AI is destroying code review, the right response isn’t to defend the tool. It’s to ask who wrote the PR.

Don’t blame the hammer. The hammer does exactly what the hand tells it to do. That’s the entire point of a hammer.

The threat model nobody reads

2026-04-20T00:00:00+00:00

Every appsec engineer has a folder. Mine lives in a Confluence space; yours might be in a git repo, a shared drive, or the bottom of someone’s laptop. Inside are threat models. Maybe a dozen of them. Some are good. Some were good once. Most were last edited during the feature’s original design review and haven’t been opened since, unless somebody wanted to reference them in a compliance questionnaire.

This is the dirty secret of threat modeling. It’s universally acknowledged as the highest-leverage security activity a team can do — finding a design flaw on a whiteboard costs roughly a hundredth of what it costs to fix in production — and it’s universally underused. Not because people don’t know about STRIDE. Not because they don’t have Threat Dragon or the Microsoft tool or IriusRisk. They do the exercise once, get something valuable out of it, and then the thing rots.

There are two separate failure modes here, and they’ve been the same two failure modes for fifteen years. It’s worth pulling them apart before talking about what AI actually changes.

Problem one: the first draft is expensive and boring

A proper threat model for a non-trivial service takes somewhere between one and eight days of senior effort. That’s not a vendor statistic; that’s the range I’ve watched teams hit, and it matches the published numbers. You pull the architect, a senior dev, and someone from appsec into a room (or a call) for a few sessions. You draw the DFD. You walk every data flow through six STRIDE lenses. You write down a mitigation for each credible threat. You score them. You end up with a forty-page document.

The output is almost always useful. The process is almost always miserable. Developers tolerate it once, learn the shape of the questions you’ll ask, and quietly route around it the second time. The feedback loop is brutal: you spend two days in workshops, two days writing, and the team gets back a document with one hundred and forty threats in it, of which maybe twenty will ever materially change their design.

Nobody on the product side wakes up excited to threat-model. And — this is the part security people sometimes miss — they shouldn’t have to. The process as traditionally practiced is optimized for completeness, not for the developer’s time. That’s backwards. The scarce resource is the developer’s attention, not the threat catalog.

Problem two: day two

Assume you solve problem one. You get a beautiful threat model committed on day zero. What happens next?

The service adds a new upstream dependency. Two weeks later, someone swaps the auth library. A month after that, a queue gets introduced between two services that used to talk synchronously. A quarter in, there’s a new admin endpoint that bypasses the main API gateway because someone needed a quick internal tool. Each of these changes, on its own, is a small delta. Together, over six months, they turn the threat model into a document that describes a system that no longer exists.

Stale threat models are worse than no threat model, because they create false confidence. Auditors see the green checkmark. Engineers see a diagram that doesn’t match their mental model of the service and conclude the whole exercise is security theater. The model’s authority collapses the first time a developer catches it being wrong about something they know cold.

The “threat model as code” movement — keeping the model in YAML next to the application code, version-controlling it, regenerating diagrams from the source of truth — was a real and necessary step forward. It made threat models diffable. It put them in PRs. But it didn’t solve the problem. It turns out a YAML file that nobody updates rots exactly as fast as a Confluence page that nobody updates. Putting the model in git was necessary but not sufficient.

Where the agent actually helps

Here’s the part where most blog posts on this topic get breathless, so let me be boring about it. AI does not replace the security engineer’s judgment. A model that doesn’t understand your architecture will happily generate fifty plausible-sounding threats that don’t apply to your system, and an appsec engineer who ships that catalog to the dev team has just made everything worse.

But there are three specific places in the threat modeling lifecycle where an agent, used by someone who knows what they’re looking at, closes a gap that tooling has not been able to close before.

The first draft. Give an agent read access to a repository — the actual code, the IaC, the service dependencies, the existing architecture docs if any — and ask it to produce a STRIDE-ordered first pass. Not the final model. The strawman. What you get is not authoritative, but it’s a lot more grounded than a blank page at the start of a workshop. You walk into the session with a diagram that already matches the code, a list of trust boundaries the agent inferred from how services actually call each other, and forty candidate threats with proposed mitigations. Your job in the workshop shifts from “enumerate everything that could go wrong” to “confirm, correct, and prioritize.” That changes the developer experience completely. It also compresses a week of work into an afternoon. Not because the agent did the thinking — you still do the thinking — but because the mechanical parts of the exercise are no longer yours.

The delta review. This is the interesting one. Plug an agent into your PR workflow with the threat model checked into the repo and a simple rule: on every PR, compare the diff against the current model. Flag the ones that touch a trust boundary, introduce a new data flow, change an authentication path, or add an external dependency. For everything else, stay quiet. What you get is a bot that behaves the way a senior appsec engineer would if they had infinite time and read every PR in the organization. Ninety-five percent of PRs generate no comment. The five percent that do get a comment that looks like: this PR adds a new endpoint that accepts unauthenticated input from a partner network. The current threat model documents two entry points behind the gateway; this is a third. Relevant threats from the model: T-014 (auth bypass), T-022 (input validation on partner-trusted data). Consider updating §3.2. That is signal. That is what dev teams will actually read.

The drift check. Run the agent periodically against the live codebase and the threat model, and ask it to identify divergence. Not to generate a new model — to point at specific places where the model and the code disagree. The service talks to a Redis the model doesn’t know about. The gateway the model claims to protect two endpoints now protects five. The auth middleware has been rewritten and the assumptions the model makes about session handling no longer hold. These findings don’t auto-fix the model; they become tickets. The appsec engineer triages them during quarterly model review — which, incidentally, now takes a day instead of a week, because the divergence is pre-catalogued.

None of this is conceptually novel. All three of these things were theoretically possible with rule-based static analysis. The reason they never worked in practice is that the rules didn’t generalize across architectures, and maintaining them was a full-time job of its own. An agent reading code with context handles the generalization problem cheaply enough that the economics finally line up.

Skills as the encoding of your specific knowledge

The piece that makes this actually work in a real organization, rather than in a blog post, is the part where the agent knows your environment. Generic threat modeling advice is a commodity. What’s valuable is your org’s accumulated knowledge: the attack patterns that keep showing up in your sector, the shape of your standard mitigations, the things your compliance team actually cares about, the libraries you’ve blessed, the ones you’ve banned.

This is where the emerging skills pattern — a SKILL.md or equivalent that tells the agent how to do a specific task in your specific context — becomes genuinely useful for appsec work rather than being another vendor buzzword. Your threat-modeling skill is a living document in the same repo as the application. It encodes: the STRIDE-plus-your-additions checklist the team actually uses, the mitigation library with your approved controls, the threats that are explicitly out of scope because the platform handles them, the format you want the output in, the tone. When you iterate on your threat modeling practice — and you should, constantly — you edit the skill, and every subsequent run benefits. The skill is the place where institutional knowledge stops living in the appsec team’s heads and starts being applied at PR speed.

You can do similar things for pen-test scoping, for SOC2 evidence collection, for triaging SAST findings. The common pattern is the same: write down the thing your team has learned to do well, give it to the agent, and let the agent apply it consistently across surface area you could never cover by hand.

A handful of public examples already exist if you want to see what these look like in practice. tiffanymwr15/Threat-Model-Skill-for-Claude-Code is a small, focused skill that walks through framework selection, asset enumeration, and STRIDE/OWASP threat identification before producing a structured report — a good starting template if you want to see the shape of a single-purpose skill. fr33d3m0n/threat-modeling is more ambitious: an eight-phase workflow covering threat modeling, security testing, and compliance, with explicit CI/CD integration hooks. Trail of Bits has published a set of security-focused skills in the community awesome-agent-skills catalog — their differential-review skill in particular maps cleanly onto the PR-delta pattern I described above. For a broader library, mukul975/Anthropic-Cybersecurity-Skills collects several hundred cybersecurity skills mapped to MITRE ATT&CK, NIST CSF, D3FEND, and the NIST AI RMF — more than any one team will use, but a useful reference for how skills can be tagged and organized against existing security frameworks.

These are community efforts, quality varies, and treating any of them as turnkey is a mistake. Read them like you’d read any third-party security tool — as a starting point to fork and shape to your own environment, not as gospel. The real value shows up once a skill is specific to your codebase, your mitigation library, and your threat patterns. The open-source ones are there to steal patterns from.

Where this breaks

Some honest limits. The agent is not good at business-logic threats — the abuse cases that come from understanding what the product does, not just how it’s built. An agent can notice you’ve added an unauthenticated endpoint; it probably won’t notice that the combination of two authenticated endpoints lets a low-privilege user leak a high-privilege user’s data through a race condition in the feature spec. Those are still your job.

The agent is also, famously, confident when it shouldn’t be. If the model doesn’t understand the architecture, it generates threats that sound like threats. An appsec engineer who doesn’t read the output critically ends up shipping hallucinated risks to the dev team, which is the fastest way to destroy whatever goodwill the practice had built up. Rule of thumb: the threat catalog the agent produces is a draft for you, not for the developers. Developers see what you’ve reviewed and endorsed.

And finally: none of this removes the need for the workshop. The value of getting the architect, the dev, and the security person in a room for an hour is not that they’re collectively drawing a DFD. It’s that they’re arguing about what actually matters. The agent takes the drawing off the table so they can spend the hour on the argument. That’s the point.

The shape of the change

Threat modeling has always had the right idea. Find the design flaws before they become code. The problem was never the concept; it was that the cost of doing it well, and doing it continuously, exceeded what any team was willing to pay. Static DFDs rotted. Threat-model-as-code made them diffable but didn’t make them self-maintaining. Every serious attempt to close the gap ran into the same wall: producing and maintaining a good model is a judgment-heavy, full-time job per product surface, and nobody has that headcount.

What changes with agents is the unit economics of that judgment. The security engineer’s attention is still the scarce input. What’s different is how much surface area one hour of that attention can cover — because the drafting, the diffing, and the drift detection have stopped being the expensive part.

The threat model nobody reads isn’t doomed to stay that way. It just needs to stop being a document and start being a living artifact the system checks itself against. We finally have the tools to build that. The only question is whether the hands holding them know what they’re looking for.