(Image by Gemini AI)

Coding with AI is great... Coding with AI is a disaster.... Coding with AI is a 2000 piece jigsaw puzzle with both 80 pieces and the box art missing. The box art represents the process; the puzzle pieces are the tooling.

That has been my experience with AI for coding so far over the last 20 months or so. Three and a half months ago I threw away the box art, realizing it didn't belong to the actual puzzle, and ever since I've been trying to envision what the complete puzzle should actually look like.

While it's still far from ideal, I found the start of a coding-with-AI workflow that seems to work for me, even if good tooling is still missing and I don't have the time to write any polished tooling myself.
Discussing these subjects on social media I think that even if I'm still partially lost, I might be significantly less lost right now than the vast majority of people discussing coding with AI on social media right now. I don't claim to have all the answers, so this post is not a Manifesto, but I feel I have enough of the answers to make this post a humble semi-manifesto, or proto-manifesto that the community might use and cooperatively use to create both a full RVC manifesto, plus the direly needed tooling for the workflow. I place this post and all the ideas mentioned in the public domain. Attribution and usage of the RVC name would be appreciated but are not required. But please, make derived work benefit the community.

A workflow that I refer to as reverse vibe-coding. A workflow that I would like to share with others that are trying to find a workflow that works. I'm not assuming my workflow will be suitable for everyone, but I hope this blog post will get other people to think about it, maybe write the polished tools we are currently missing, so we can make reverse vibe-coding (RVC) or something similar the new AI coding workflow trend for 2026. If you love your IDE integration or manual prompting for integrative work, this may not be the post for you. If you value the good old git workflow and aren't afraid of merging and topic branches, enjoy the ride I'm about to take you on.

I won't bore you with what brought me to this workflow, but I'll run you through the reasons why I feel the different parts of this workflow are the way forward for AI assisted development.

Why call it "reverse" vibe-coding

With vibe-coding, people are using Large Language Models, Code Assistance Tools and Agentic hooks to write code prototypes at record speed, often with minimal architectural oversight. Many, mostly startups, are taking it even further, using vibe coding to create production code. There is also a more conservative model, AI assisted development, that recognizes LLMs as useful but tries to work around its limitations in order to make the results suitable for long term product development and maintenance during the multi-year full product lifecycle of a software product.

In vibe-coding, the user is there for the ideation, architecture is usually ignored, and the large majority of the code will be LLM generated. In the workflow I came to over the last few months, we come close to reversing that. LLMs play a major role in both ideation and architectural design, and NO eventual code will be verbatim LLM generated. Instead code is either hand-coded or generated by local SMALL Language Models. Note that what is ‘reversed’ is not the use of AI, instead it is where we allow verbosity, risk and uncertainty to emerge. Vibe-coding pushes uncertainty and risk into production code; reverse vibe-coding pushes it upstream into ideation and architectural exploration.

There is more to the workflow than that, a strong focus on provenance and more, and the table below shows a quick overview.

-	AI-assisted Development	Vibe Coding	Reverse Vibe-Coding
Use-case	Full products lifecycle	Proto-types & MVPs	Full products lifecycle
Ideation	human	human	Human/LLM
Architecture	human	LLM/none	Human/LLM (vibespiration)
Boilerplate	LLM	LLM	SLM
Core business logic	LLM/Human	LLM	Human, hand-coded
Refactoring	LLM	LLM	SLM
DRY focus	medium	none	high (manual)
Review	Human	None	LLM/Human (AI -> Human)
git workflow	merge/rebase	rebase	merge ( Strict Provenance )
integration	IDE	IDE	git / Agentic Hooks
Senior role	Increased vetting	non-existent	Architect & Final Sign-off
Due diligence	Yes	Ignored	Structurally minimized

We will look into all the different aspects of the reverse vibe-coding workflow in the rest of this post. We are going to look at how reverse vibe-coding differs from both the AI-assisted Development workflow and the Vibe Coding workflow.

Back to a merge-based git workflow

Lets start of with a main concern : provenance.

In the above very much paraphrazed quote from lord kelvin lies expressed what I feel is at the core of modern industry. If you want any hope to improve your process, you need to be able to measure it. Let's bring in the complete non-paraphrazed quote for good measure:

When you can measure what you are speaking about, and express it in numbers,
you know something about it; but when you cannot measure it, 
when you cannot express it in numbers, 
your knowledge is of a meagre and unsatisfactory kind.

In software development, provenance is the main prerequisite for measurement. Without knowing where code came from, when it changed, and under what conditions, we cannot reason quantitatively about quality, regressions, or risk. And we can not improve our process towards the future or learn from past mistakes.

Before AI assisted coding and vibe-coding existed, it was common for git workflows to clean and merge based. We had rules like :

Commits must be single-issue
Commits should be small
Commit often, but:
Don't commit something that doesn't compile (at least if it did compile before).
Don't commit something that doesn't pass local linters or static analysis tools.
Don't commit something that doesn't pass local unit tests.

Some people used git with many topic branches, others with days-long local runs of merges from trunk until an issue was done and ready to be merged back into trunk or the sprint-branch, those differences are less relevant for this post.

By keeping the commits small and clean, this workflow allowed for one of the most powerful tools that git has to offer in terms of test driven development: git bisect. In many situations you run into a bug that you didn't have test coverage for before. But you know the bug wasn't here three weeks ago. When you write the test, git bisect command, with a little navigating into old merged-in topic branches, will help you locate exactly when the bug was introduced and what code was changed in that commit. When you find what introduced the bug, you can go back to the current head and fix it, and with git blame, you can show it to the team member that introduced the bug and they can learn from it.

With AI for code assistance, or more specifically with AI use integrated into IDEs, the workflow has become a bit more messy in many teams, and even more so with the vibe-coding crowd. Ideally you would want to keep AI contributions separate from human contributions, but in the IDE context the process tends to be iterative and often a bit more chaotic than in our earlier workflow. It becomes logical to work on topic branches that are broken much of the time until an entire feature is ready, and then to simply rebase it, as a single commit to trunk.

This turns the entire AI/human interaction into one huge blob, and at the same time broadens the search space after a git bisect.
In practice, IDE-integrated AI makes it impossible to separate AI-generated transformations from human intent and contributions. We still know who to blame, but the actual bug will be hidden in days or more worth of commits that a rebase simply squashed into one.

Most vibe-coding settings have switched to rebase based git workflow to handle the chaos and accept the price. Quite some AI-assisted coding teams have done the some while others have stuck to the old proven practice,

In reverse vibe-coding we change how the human developer interacts with the AI in such a way that we can safely choose for a merge based approach again. But we are going to abandon part of the old rules for AI while reviving them for humans. We do this by choosing a git-only approach to agents and integrative AI usage.

A git-only approach to agents

So how can we still use AI for integrative tasks like boilerplate creation and refactoring without having to give up on provenance? And could we actually improve on provenance and get the best of both worlds while acknowledging the undeniably messy iterative use of AI? My answer is to move all agent interaction out of the IDE and into git itself. I found that, even if today polished tooling for this is completely missing, we can do just that by making simple agents integrate with git hooks, turning the agent into a virtual team member, with its own git account, and by using topic branches that may either get merged or deleted.

Let's explore a workflow:

The repo has a directory named airc.d (AI Run-Commands Directory) that contains a sub directory, with the name of the SLM used, for this post its just use slm as name.
Under the airc.d/slm directory are files with a specific naming convention. These are Domain-Specific-Language files, more on that later. The naming scheme is flyway like.
We define B files for boilerplate agents and R files for refactor agents. Names are to be chosen with subsequent numeric values, for example airc.d/slm/R1.rvp (where the extension rvp stands for Reverse Vibe Prompts, these files are not prompts in the chat sense, but declarative instructions )
We create airc.d/slm/R1.rvp, fill it with an initial prompt description, git-add it, and push to trunk or another high level branch.
The push triggers a RVP processor on the git server that pulls the trunk, creates a topic branch, and starts the agent(s) with the actual prompts and waits for the agent to complete.
The RVP processor patches the topic branch, does the necessary git adds and git commit, and pushes to the new topic branch.
The user checks out the topic branch, looks at the code the agent creates and expands on the airc.d/slm/R1.rvp file, if needed after doing some manual fixes.
If after N iterations the user sees it's going nowhere, he deletes the remote topic branch, if he is happy, he does some final tweeks and merges to trunk or the high level branch.

Unlike IDE-based agents, this approach makes every AI-driven transformation visible, reviewable, and attributable, creating full provenance.

We will look at the idea of the rvp files next, but this illustrates the basic git workflow that maintains provenance and creates a clear split between human and SLM contributions.

Moving back from english for prompting to a prompting DSL

While the git part is already a stack of hacks in my personal setup, I kept the previous section conceptual enough not to go into the hacky details of my personal setup. As stated what is missing from my workflow is very much polished tooling. Now we come to the currently most duct-tape and WD40 part of the workflow that needs tooling the most: The Domain Specific Language for prompting.

Natural language is optimized for ambiguity and human interpretation. Refactoring and boilerplate generation require the opposite: precision, repeatability, and minimal semantic drift. Using English prompts for these tasks couples correctness to model interpretation rather than to explicit intent. Further a prompt that works on one LLM or SLM may not be optimal on another or even work at all. We need an abstraction over the prompts that maps common patterns of refactoring and boilerplate to appropriate prompts.

I considered showing some of my current RVP files, but decided against it. The concrete syntax is highly provisional and tightly coupled to my experimental setup. To avoid distracting from the underlying concepts, I’ll use simplified pseudo code to illustrate what the DSL is meant to express.

The idea of the DSL is that most non-creative prompts for things like refactoring and boilerplate code look roughly similar for one AI, but the exact same need may need to be expressed differently for another AI. What we want is a DSL that allows us to extract, compose, prompt and recompose, and on top of that can express branching if multiple prompts might give alternative outcomes.

Note that the branching differs slightly from the previous overview, and it's a feature I've only been experimenting with for a few days now.

So just for illustration let's consider we are talking to an LLM instead of an SLM, we have some file src/foo.cpp that contains a class Foo that is turning rather big. The class manages 5 different members and we think moving three of them bar, baz and qux into a new class Quux and replace the three members in the Foo class with a managed quux of type Quux.

What we need in the DSL is basically a way to say:

Do a one to two class refactor
The class is in the file src/foo.cpp and is named Foo
The members bar baz and qux should end up in the new class named Quux
The original class should name its new Quux member with ownership model X as quux

Not my version of the DSL, but the below pseudo code captures its essence in a non quirky way. It is illustrative pseudo-syntax, not a proposal of any kind:

[1] refactor<[cpp_class_split_basic, cpp_class_split_proxied], src/foo.cpp Foo>
    old: bar, baz, quz
    new: quux <uniqueptr, Quux>

What we are saying here is:

This is the original (1) prompt.
Create two topic branches, one for the cpp_class_split_basic prompt template for the used SLM and one for the cpp_class_split_proxie prompt template.
Include the src/foo.cpp code in the prompt
Tell the SLM that it is about the Foo class
Tell the SLM that bar, baz and quz should move out of the Foo class
Tell the SLM that the new class Quux should now hold these members and the updated Foo class should have a unique_ptr reference named quux to an object of the Quux class.

Now when the prompt is processed, the output will be re-applied to src/foo.cpp , and the rvp file will have a tiny bit appended:

[1] refactor<[cpp_class_split_basic, cpp_class_split_proxied], src/foo.cpp Foo>
    old: bar, baz, quz
    new: quux <uniqueptr, Quux>
    done: 5270c46bec8b3cd7468b5dd94168ac410eca1e97, 591747298a3790fde1710f3aa2d03b55020575bc

The later is used for replayability if needed. Containing both the commit id of the own repo and that of the prompt templates repo at time of first run.
This is there to make AI-assisted refactors reproducible in the same way builds are reproducible; something that is currently almost entirely missing from AI tooling

The user should not normally edit a numbered entry that has a done section.

The DSL bit can be improved upon a lot, and my own DSL looks even worse than the syntax of the pseudo DSL in this post, but I hope both the concept and the value are clear. It’s the second bit of tooling, after the git hook processing agentic bit that needs a lot of work into something polished and complete.

Small language models (SLMs) for integration (for now)

Now we come to the reason why I virtually threw away the box art: the lack of provenance in the current generation of LLMs themselves.

By provenance in this case, I specifically mean the ability to trace generated output back to its training sources with enough fidelity to make licensing and attribution decisions.

We have all heard about LLMs training data regurgitation. Regurgitation in itself is not a big problem, the models are usually trained on open source code or code licensed to the LLM companies in such a way that LLM company customers should be able to use it. The real problem arises when the training data is regurgitated close to verbatim, at least in structure, but the original licence is not.

While normal compositional usage from hundreds of sources of the training data can be seen as transformational in a way that doesn't require our code to adhere to licence duplication or attribution prerequisites posed by specific open source training data files, the moment that the entire structure of a piece of open source code gets regurgitated, and likely even if the merging used only a small countable number of open source code files, possibly with a mutual derivation history, then we have no such luck, and our AI generated code should both copy the original open source licence and attribute to the original authors.

LLMs could fix this by implementing provenance. They currently don't. I could write a separate blog post on the economics of provenance for LLMs, it should be viable given the high but relatively low cost, but the base conclusion is that right now LLMs don't do any provenance, and if they do, it is not exposed sufficiently to the users.

This is a factor that weighs in together with DRY principles and the need to unload the senior from a merge-monkey role to the conclusion that we need to hand-code business logic, we absolutely don't want to hand-code boilerplate or do refactoring by hand. And while the chances of regurgitation seem smaller for these non-creative coding tasks, LLMs remain a blackbox and the due diligence practice that most AI-assistence dev shops are doing really cuts into productivity and requires setting up more infrastructure.

Here due diligence refers to the responsibility LLM customers have to look through common open source repositories with some kind of similarity scan to get some level of confidence that they aren't doing IP theft and copyright infringement. It is a practice that in amy shops is barely more than handwaving, a way to check the checkmark, but for others it is an ethical responsibility to try hard.
It must be said: due diligence without provenance is fundamentally speculative. As said, the AI companies could fix this by implementing provenance, but they don't.

For our reverse vibe-coding workflow we choose to take it serious enough to take three ethical standpoints:

We should use a non-blackbox SLM that publishes its training set
We should choose an ethical SLM that excludes the most restrictive open source licences from the training data
We limit the use of integrative SLM to non-creative tasks like boilerplate and refactoring

Please note that this choice is a practical one flowing from the lack of accessible provenance offered by the current generation of LLMs. If in the future LLMs start to provide provenance hooks, the integrative use of LLMs should be reconsidered in this workflow.

Summarizing there are four things that all push us into the SLM direction right now, and into the stance about limiting what to use integrative AI usage for:

LLMs without provenance create a huge legal compliance challenge.
Senior engineers should not be compliance filters
DRY and maintainability push toward less generated code
Due diligence workflows are expensive and brittle

Micro vibe-coding for experiments

We are not completely abolishing vibe-coding. We are just not using vibe-coding in any way to create production code. Vibe-coding has a place in production grade product development, and that place is in exploratory development and experiments. Not full on prototypes, that's another playing fields, but experiments that make the developer comfortable with techniques, libraries, frameworks, etc. Big enough to run and test, but not a single line of code will be copy pasted into production. It's about learning, experimenting, finding out what works and what doesn't work.
The experiments are deliberately disposable: they exist to inform future design decisions, not to be refined or integrated.
The user can use all of the vibe-coding IDE tools here, preferably a separate IDE setup from production, maybe even on a different machine, to reduce the temptation of copy pasting to production.
No code should be copied into production. Not because the code is bad, but because its provenance is not separable from the experiment.

Vibespiration for architecture

Now we get to the place where LLMs actually shine. On social media we see people claiming that vibe-coding has solved coding and that software engineering is now shifting to architecture, implying that architecture is still mainly a human task. In the reverse vibe-coding workflow we flip the narrative: Neither is solved, but rather than letting AI do most of the heavy lifting on the code side, we let it do much of the heavy lifting on the architecture.

We aren't going to vibe-architect the architecture, but a 50%/50% split on the task of exploration effort on the architectural design seems perfectly reachable. The human remains the architect, the LLM is an exploratory engine, but the combination creates much synergy.

LLMs are great sparring partners for exploring. They can explore multiple roads from A to B, evaluate internal consistency of a wide architectural concept, and as such be a great source of inspiration. You could say we are using AI for vibing our inspiration. Vibespiration so to speak.

I'm currently working on a tech stack in my spare time, and one of the pet projects in this stack is a least authority domain specific language for Web 3.0 bots and Layer-2 Web 3.0 nodes called Merg-E. The language combines a lot of moving parts, and weeks of brainstorming and vibespiration sessions with different LLMs have allowed me to tune what otherwise would have been a much larger language into a minimal internaly consistent language and runtime architecture that I now need to implement.

Where, as we will look at in the next section, LLMs tend to bloat up code-bases with vibe-coding, on the architecture side of things LLMs can actually help explore things in a holistic way that allows to cut out things like unneeded abstractions.

It still needs the human brain to pick between the good and the bad exploratory paths, but it is a part of the development workflow where LLMs shine the brightest.

Vibespiration and exploratory and holistic architecture design.

Back to hand-coding and manual DRY (the reverse part)

This bit is going to be the hardest sell for the readers that are what I like to call believers, and for some of the readers of this post that never actually looked into the legal aspects of using apparently clean AI output without due diligence, but we are going to ban AI generated code from doing any integrative code generation of the creative kind. Boilerplate is fine, refactoring is great, up to a point, but our core business logic will be the part we want to be hand-coding again.

But wit all the productivity gained on other aspects of the workflow, we are going to do it with extra focus and deliberation. With the help of LLMs we have a good internally coherent architecture and every bit of business logic we create should fit holistically into that architecture. We have created space and time to think deliberately about the DRY principle, the principle of Don't Repeat Yourself in a way today's LLMs can't touch on yet and maybe never will. LLMs tend to be very verbose, both syntactically and semantically, and choices of abstractions are more about pattern recognition and generation than in any way deliberate. The LOC count of any given functionality is already usually much higher for AI generated code than it is for human generated code, and that is for focused local functionality.

There are important things to think about when doing DRY, subtle balances between DRY that effectively uses things like parameterized generic programming, and overshooting the abstractions by using generics on algorithmically similar but cognitive distinct pieces of logic. Right now humans are capable of identifying the right balance in a way AI is not, and might not be until we reach AGI.

Other reasons to avoid using LLMs for creative coding were already explored, and we wont rehash them here. In the Reverse vibe-coding workflow we not only go back to hand-coding core business logic, we do so with more care, slow and thoughtfully, and above all holistically. One subject we have not touched on yet though is senior overload, In many AI-assisted dev teams the senior devs are experiencing increased vetting load because of AI generated code. In theory the junior and mid-level devs should already do pre-veting, but in practice, many devs tend to trust AI input more than they should, creating more downstream vetting load. Due diligence in many teams is also a task that tends to predominantly fall on seniors.

Iterative reversing first-tier reviews

Because core business logic is human-authored, the volume of AI-generated code that requires human vetting is kept small. Instead, what we primarily need to review now is human-written code, and this is precisely where LLMs can be effective as the first tier of reviews.

LLMs can provide additional “eyeballs” early and often, without the high latency and coordination overhead of traditional human code reviews. Rather than a bulk, post-hoc activity, this review can be iterative and local. A developer who writes code, even before it is commit-ready, can ask, “Hey LLM, please check this code: what is wrong with it?”

This is not a zero-cost process. LLMs produce false positives, occasionally flagging correct code as problematic or suggesting changes that are either unnecessary, stylistic rather than substantive, or in some cases even conceptually wrong (suggesting 'free' in a loop that does 'getdelim' is a famous example). However, the cost is defensible for two reasons:

Genuine issues are caught early
The overall review burden on senior developers is significantly reduced.

Crucially, AI-based iterative reviews are not a replacement for senior vetting. They are a load reducer. By filtering out obvious issues, inconsistencies, and missed edge cases early, they allow senior reviewers to focus on architectural correctness, domain logic, and long-term maintainability rather than routine defect detection.

Monte-Carlo (Property-Based) Testing

This part of the workflow may seem like an odd one out. That is because it originated as an independent eye-opener during my roughly twenty-month journey with AI-assisted coding. While reverse vibe-coding is less exposed to this issue than unchecked, agentic vibe-coding workflows, the underlying problem exists regardless of AI, and would still exist even if everyone stopped using AI tomorrow, only at a much slower pace.

In software engineering, Test-Driven Development (TDD) has long been a pillar of code quality. Unit tests are typically written using a fixed, deterministic set of inputs. Run the test a thousand times, and it will use exactly the same values every time. For a long time, we considered this a virtue: determinism makes failures reproducible and fixes immediately visible.

The problem emerges once iteration enters the picture.

When bugs are fixed iteratively, especially in automated or semi-automated loops, there is a real risk that code starts to overfit the tests. The tests pass, but the underlying problem is not actually solved. In purely human-driven workflows this risk exists but is limited by human iteration speed. With AI in the loop though, iteration speeds can increase dramatically, and brute-force adherence-to-tests becomes possible: the code remains broken, but all tests are green.

This phenomenon mirrors what has been observed with some AI benchmarks. LLMs are extremely good at overfitting, and when placed in feedback loops with static benchmarks or tests, they may optimize for the benchmark rather than for the intended behavior. Importantly, this is not an AI-specific flaw; it is a fundamental weakness of static tests in recursive workflows. AI has merely made the problem visible.

So how do we address it?

In data engineering and computational statistics, a common technique for dealing with uncertainty and complex input spaces is a Monte Carlo simulation. Rather than testing against a small, fixed set of values, inputs are sampled randomly according to an appropriate probability distribution and the experiment is repeated many times.

We can apply a lightweight version of this idea to testing.

Instead of unit-tests that operate on a fixed set of inputs, we can write tests that execute many times, each time sampling randomized inputs within well-defined constraints. The goal is not to compute a probability distribution of the output, but to prevent any iterative process, human or AI, from overfitting to a narrow, deterministic test set.

I have been using hand-spun tests like these using my own duct-tape and WD40 tooling, but after having Gemini review this blog post for accuracy It told me I'm basically doing what is called Property-based testing, and after looking into the subject, it does roughly match my own implementation and there actually is some existing tooling, what is great.

Closing

As a first (of two) closers, a reminder. While right now we are excluding LLMs for integrative code generation and are opting purely for SLMs for that task, provenance by AI companies will remove the rationale for that choice and LLMs should then get reconsidered.

As a final closer I need to stress the current lack of polished tooling. Polished tooling for git integration and a DSL, polished tooling for iterative LLM code reviews, and polished tooling for Monte-Carlo TDD. Having a stack of about 20 pet projects already, I'm making do with my duct tape and WD40 solutions. But for broader usage of this workflow, polished tooling is essential. If you as a reader believe in this proposed workflow, please consider starting an open source project for one of the needed tools.

Let's rehash the core principles of the RVC workflow:

The Reverse Vibe-Coding Principles:

Human Core: Business logic is hand-coded; AI handles the "scaffolding" and "sculpting" (boilerplate/refactor).
Immutable Provenance: Every AI change must be a distinct, attributable git commit.
Vibespiration, not Vibe-Implementation: Use LLMs to explore architecture, not to bypass it.
Deterministic Intent, Stochastic Verification: Use a DSL for precision in coding, use Property-Based Testing to catch AI and human "overfitting".

As many people right now, who have discovered that the puzzle and the box art for the use of AI in code don't match, I'm searching for what the actual box art should look like, drawing up provisional puzzle pieces (tools) to replace the missing ones. I think I'm close enough to an idea of the box art to share my ideas and provisional process to share it as this proto-manifesto.

A Reverse Vibe-Coding semi-manifesto