My take after running engineering teams at multiple companies: documentation survives when it lives next to the code. File-level header comments explaining each component's purpose and role in the architecture. A good README tying it all together. If you compartmentalize architecture into folders, a README per folder. This works for humans, LLMs, and GitHub search alike.
ADRs, Notion docs, and Confluence pages die because they're separate from the code. Out of sight, out of mind.
If you want to be really disciplined about it, set up an LLM-as-judge git hook that runs on each PR. It checks whether code changes are consistent with the existing documentation and blocks the merge if docs need updating. That way the enforcement is automated and you only need a little human discipline, not a lot.
There's no way to avoid some discipline though. But the less friction you add, the more likely it sticks.
> documentation survives when it lives next to the code.
15+ years ago, this was pretty much the standard. Every decision - whether major or just a hack to handle a corner-case - used to be recorded in the code itself. Then tools like Jira and Confluence came in and these things moved to undiscoverable nooks and corners of the organization. AI search tools like Glean and Rovo have improved the discoverability, though I'd still prefer things to remain in the code.
I suppose you are trying to "warm up" the audience before announcing you product, which is... fine, I guess.
I also had a an idea for a solution to this problem long time ago.
I wanted to make a thing that would allow you to record a meeting (in the company I where I worked back then such things where mostly discussed in person), transcribe it and link parts of the conversation to relevant tickets, pull requests and git commits.
Back then the tech wasn't ready yet, but now it actually looks relatively easy to do.
For now, I try to leave such breadcrumbs manually, whenever I can. For example, if the reason why a part of the code exists seems non-obvious to me, I will write an explanation in a comment/docstring and leave a link to a ticket or a ticket comment that provides additional context.
I worked on the problem of recording 'design rationale' ~25 years ago. It is a big problem. Particulalry for long-lived artefacts, such as nuclear reactors. Nobody is quite sure exactly why decisions were made, as the original designers have forgotten, retired or been run over by buses. And this makes changing things difficult and risky. The biggest problem is that there is no real incentive for the people making the decisions to write down why they made them:
* they may see it as reducing their career security
* they may see it as opening them up to potential prosecution
First, recognize that, for the first time ever, having good docs actually pays dividends. LLMs love reading docs and they're fantastic at keeping them up to date. Just don't go overboard, and don't duplicate anything that can be easily grepped from the codebase.
Second, for #3, it's a new hire's job to make sure the docs are useful for new hires. Whenever they hit friction because the docs are missing or wrong, they go find the info, and then update the docs. No one else remembers what it's like to not know the things they know. And new hires don't yet know that "nobody writes anything" at your company.
In general, like another poster said, docs must live as close as possible to the code. LLMs are fantastic at keeping docs up to date, but only if they're in a place that they'll look. If you have a monorepo, put the docs in a docs/ folder and mention it in CLAUDE.md.
ADRs (architecture decision records) aren't meant to be maintained, are they? They're basically RFCs, a tool for communication of a proposal and a discussion. If someone writes a nontrivial proposal in a slack thread, say "I won't read this until it's in an ADR."
IMHO, PRs and commits are a pretty terrible place to bury this stuff. How would you search through them, dump all commit descriptions longer than 10 words into a giant .md and ask an LLM? No, you shouldn't rely on commits to tell you the "why" for anything larger in scope than that particular commit.
It's not magic, but I maintain a rude Q&A document that basically has answers to all the big questions. Often the questions were asked by someone else at the company, but sometimes they're to remind myself ("Why Kafka?" is one I keep revisiting because I want to ditch Kafka so badly, but it's not easy to replace for our use case). But I enjoy writing. I'm not sure this process scales.
ADRs are the only way I've ever seen it done well for a sufficiently large enough project, let alone something like an entire product line or suite of many projects. Sometimes those span multiple organizations. Think of the Internet and the IETF RFCs. Yes, they don't give a complete picture. Implementations may not match the specification. I don't really agree they require maintenance. It's just you have to write up a new one any time you change a decision and give a reason why. Yes, it takes a lot of organizational discipline to do that. You probably can't be in panic mode and it won't work for a startup that needs to ship in five weeks or they can't make payroll. But there isn't really a substitute for discipline.
As maligned as it can be, the single best organization I've ever been a part of for code archaeology, on a huge multi-decade project that spanned many different companies and agencies of the government, simply made diligent use of the full Atlassian suite. Bitbucket, Jira, Confluence, Fish Eye, and Crucible all had the integrations turned on. Commits and PRs had a Jira ticket number in them. Follow that link to the original story, epic, whatever the hell it was, and that had further links to ADRs with peer review comments. I don't know that I ever really had to ask a question. Just find a line of interest and follow a bunch of links and you've got years of history on exactly what a whole bunch of different people (not just the one who committed code) were thinking and why they made the decisions they made.
I've always thought about the tradeoffs involved. They were waterfall. They didn't deliver fast. Their major customers were constantly trying to replace them with cheaper, more agile alternatives. But competitors could never match the strict non-functional requirements for security, reliability, and performance, and non-tolerence of regressions, so it never happened and they've had a several decades monopoly in what they do because of it.
(Don’t take this as advice. Just writing my own experience with this.)
This is the reason why I take the time to summarize all “why” decisions and implementation tradeoffs being made in my (too lengthy) PR descriptions with links, etc. I’ve gotten into the habit of using to collapse everything because I’ve gotten feedback multiple times that no one reads my walls of text. However, I still write it (with short s now) because I’ve lost track of the number of times I’ve been able to search my PRs and quickly answer mine or someone’s “why” question. I do it mostly for me because I find it invaluable as I prefer writing shit down instead of relying on my flaky memory. People are forgetful and people come and go. What doesn’t disappear is documentation tied to code commits (well… unless you nuke your repo).
Sometimes the answer to "why?" is that the dev had a hammer and the codebase was starting to look an awful lot like a nail. In-memory cache isn't considered as a serious option nearly enough imho.
Keep the reasoning as close to the code as possible.
1. Code should be self-explanatory, so should vars, function names and the entire shape be.
2. For the remaining non-obvious bigger design decisions, add a comment header (eg jsdoc) above the main section code block, and possibly refactor it out into its own file. Prefer to have a large comment header (and possibly some inline comments) outlining an important architectural part than having that knowledge dissipate with time, separate external docs and your leaving workers.
Doesn't really answer you question but IME this is sort of unavoidable unless you're massive and you can afford to have people who just document this kind of stuff as their job.
Reason being, a lot of this stuff happens for no good reason, or by accident, or for reasons that no longer apply. Someone liked the tech so used it - then left. Something looked better in a benchmark, but then the requirements drifted and now it's actually worse but no one has the time to rewrite. Something was inefficient but implemented as a stop gap, then stayed and is now too hard to replace.
So you can't explain the reasons when much of the time there aren't any.
The non-solutions are:
- document the high level principles and stick to them. Maybe you value speed of deployment, or stability, or control over codebase. Individual software choices often make sense in light of such principles.
- keep people around and be patient when explaining what happened
- write wiki pages, without that much effort at being systematic and up to date. Yes, they will drift out of sync, but they will provide breadcrumbs to follow.
I can't say ADRs work that great, in my experience, but the flaw was more connecting them to other architectural stuff to make them actually discoverable and drawing the boundaries in a logical way (what goes into an ADR and what goes into a living design doc?).
"Not maintained" seems kinda weird to me, because at least as I see an ADR, it's like a point in time decision right? "In this situation, we looked at these options, and chose this for these reasons". You don't go back and update it. If you're making a big change, you make a new ADR with your new reasons.
One place I worked did have an interesting idea of basically forcing (not quite) the new hires to take notes on all their onboarding questions/answers as they went and then sticking it in the company docs. It at least meant that incorrect onboarding docs got fixed quickly. Sometimes you had good reasons for stuff, sometimes the reason is "dunno, that's just what we do and it seems hard to change".
> - Why Redis over in-memory cache? - Why GraphQL for this one service but REST everywhere else? - Why that strange exception in the auth flow for enterprise users?
These are all implementation details that shouldn't actually matter. What does matter is that the properties of your system are accounted for and validated. That goes in your test suite, or type system if your language has a sufficiently advanced type system.
If replacing Redis with an in-memory cache is a problem technically, your tests/compiler should prevent you from switching to an in-memory cache. If you don't have that, that is where you need to start. Once you have those tests/types, many of the questions will also get answered. It won't necessarily answer why Redis over Valkey, but it will demonstrate with clear intent why not an in-memory cache.
For context, my engineering team is fairly small – no guarantees this scales well for larger organizations. I capture the reasons for decisions on why code was written a particular way or why a particular architecture was decided upon in commit messages. We follow a squash-and-rebase flow for commits, so each PR is ultimately a single commit before merging. During that squash process, I'll update the commit message to sometimes be a few paragraphs long. Later when I'm curious why we made decision in the past, I can use git blame to navigate back until the point where I can find the answer.
ADRs but give ownership to the team. They should sit in the repo most relevant, but a central repo called ADRs have issue templates and a readme which links off to all the repos and their ADRs - ADRs can not be approved and the issue closed until all the docs are in place. Everyone can see the open ADRs in the main repo and see issue and comment on them. Accountability is there if an assigned issue is open for days/weeks etc.
GitHub issues templates are perfect for ADR templates. All Hands for engineering is a great place to mention them and for teams to comment on the decision and outcomes.
If it’s something in the code, that’s where I use comments. It’s the only place people have a chance of seeing it. Even when I add these comments some people ask me about the code instead of reading them. This isn’t just for others, I forget as well. Something to the effect of…
# This previously used ${old-solution}, but has moved to ${new-solution} because ${reason}
Or
# This is ugly and doesn’t make sense, but ${clean-logocal-way} doesn’t work due to ${reason}. If you change ${x} it will break.
Or
# This was a requirement from ${person} on ${date}. We want to remove this, but will need to wait until ${person} no longer needs it or leaves the company.
Simple: ask "why" in a PR review, put the answer in a code comment. If there is a bigger / higher level "why", add it to git commit description. This way it's auto-maintained with code, or stays frozen at a point in time in a git commit.
* File issues in a project tracker (Github, jira, asana, etc)
* Use the issue id at the start of every commit message for that issue
* Use a single branch per issue, whose name also starts with the issue id
* Use a single PR to merge that branch and close the issue
* Don't squash merge PRs
You can use git blame to get the why.
git blame, gives you the change set and the commit message. Use the issue id in commit message to get to the issue. Issue description and comments provide a part of the story.
Use the issue id, to track the branch and PR. The PR comments give you the rest of the story.
Sometime the best way to why a (Chesterton's) fence is blocking the road is... to remove it and see what happens!
Sorry, not really an answer to your problem. But I feel you, this is a genuinely hard problem.
Keep in mind that, pretty often, the reason something is the way it is comes down to "no real reason", "that seemed easier at the time" or "we didnt know better". At least if you don't work on critical systems.
This is called rationale and it goes in the design document. As work proceeds, it goes into tickets and meeting notes, and gets fed back into the design doc.
Did he write down everything he learned? That way the next person only needs to cover the intervening time period.
Conceivably LLMs might be good at answering questions from an unorganized mass of timestamped documents/tickets/chat logs. All the stuff that exists anyway without any extra continuous effort required to curate it - I think that's key.
Your company is missing an architect role. An architect would know why redis over in-memory cache and have that pattern documented. They would definitely know why graphql for the one service but REST everywhere else - they would have it documented from design approval meetings.
64 comments
ADRs, Notion docs, and Confluence pages die because they're separate from the code. Out of sight, out of mind.
If you want to be really disciplined about it, set up an LLM-as-judge git hook that runs on each PR. It checks whether code changes are consistent with the existing documentation and blocks the merge if docs need updating. That way the enforcement is automated and you only need a little human discipline, not a lot.
There's no way to avoid some discipline though. But the less friction you add, the more likely it sticks.
Jokes aside, i think LLMs will enable us to handle information in a much better and smoother way. We should use them!
> documentation survives when it lives next to the code. 15+ years ago, this was pretty much the standard. Every decision - whether major or just a hack to handle a corner-case - used to be recorded in the code itself. Then tools like Jira and Confluence came in and these things moved to undiscoverable nooks and corners of the organization. AI search tools like Glean and Rovo have improved the discoverability, though I'd still prefer things to remain in the code.
I also had a an idea for a solution to this problem long time ago.
I wanted to make a thing that would allow you to record a meeting (in the company I where I worked back then such things where mostly discussed in person), transcribe it and link parts of the conversation to relevant tickets, pull requests and git commits.
Back then the tech wasn't ready yet, but now it actually looks relatively easy to do.
For now, I try to leave such breadcrumbs manually, whenever I can. For example, if the reason why a part of the code exists seems non-obvious to me, I will write an explanation in a comment/docstring and leave a link to a ticket or a ticket comment that provides additional context.
* they may see it as reducing their career security
* they may see it as opening them up to potential prosecution
* it takes a lot of time
Second, for #3, it's a new hire's job to make sure the docs are useful for new hires. Whenever they hit friction because the docs are missing or wrong, they go find the info, and then update the docs. No one else remembers what it's like to not know the things they know. And new hires don't yet know that "nobody writes anything" at your company.
In general, like another poster said, docs must live as close as possible to the code. LLMs are fantastic at keeping docs up to date, but only if they're in a place that they'll look. If you have a monorepo, put the docs in a docs/ folder and mention it in CLAUDE.md.
ADRs (architecture decision records) aren't meant to be maintained, are they? They're basically RFCs, a tool for communication of a proposal and a discussion. If someone writes a nontrivial proposal in a slack thread, say "I won't read this until it's in an ADR."
IMHO, PRs and commits are a pretty terrible place to bury this stuff. How would you search through them, dump all commit descriptions longer than 10 words into a giant .md and ask an LLM? No, you shouldn't rely on commits to tell you the "why" for anything larger in scope than that particular commit.
It's not magic, but I maintain a rude Q&A document that basically has answers to all the big questions. Often the questions were asked by someone else at the company, but sometimes they're to remind myself ("Why Kafka?" is one I keep revisiting because I want to ditch Kafka so badly, but it's not easy to replace for our use case). But I enjoy writing. I'm not sure this process scales.
As maligned as it can be, the single best organization I've ever been a part of for code archaeology, on a huge multi-decade project that spanned many different companies and agencies of the government, simply made diligent use of the full Atlassian suite. Bitbucket, Jira, Confluence, Fish Eye, and Crucible all had the integrations turned on. Commits and PRs had a Jira ticket number in them. Follow that link to the original story, epic, whatever the hell it was, and that had further links to ADRs with peer review comments. I don't know that I ever really had to ask a question. Just find a line of interest and follow a bunch of links and you've got years of history on exactly what a whole bunch of different people (not just the one who committed code) were thinking and why they made the decisions they made.
I've always thought about the tradeoffs involved. They were waterfall. They didn't deliver fast. Their major customers were constantly trying to replace them with cheaper, more agile alternatives. But competitors could never match the strict non-functional requirements for security, reliability, and performance, and non-tolerence of regressions, so it never happened and they've had a several decades monopoly in what they do because of it.
This is the reason why I take the time to summarize all “why” decisions and implementation tradeoffs being made in my (too lengthy) PR descriptions with links, etc. I’ve gotten into the habit of using to collapse everything because I’ve gotten feedback multiple times that no one reads my walls of text. However, I still write it (with short s now) because I’ve lost track of the number of times I’ve been able to search my PRs and quickly answer mine or someone’s “why” question. I do it mostly for me because I find it invaluable as I prefer writing shit down instead of relying on my flaky memory. People are forgetful and people come and go. What doesn’t disappear is documentation tied to code commits (well… unless you nuke your repo).
> Why Redis over in-memory cache?
Sometimes the answer to "why?" is that the dev had a hammer and the codebase was starting to look an awful lot like a nail. In-memory cache isn't considered as a serious option nearly enough imho.
1. Code should be self-explanatory, so should vars, function names and the entire shape be.
2. For the remaining non-obvious bigger design decisions, add a comment header (eg jsdoc) above the main section code block, and possibly refactor it out into its own file. Prefer to have a large comment header (and possibly some inline comments) outlining an important architectural part than having that knowledge dissipate with time, separate external docs and your leaving workers.
Reason being, a lot of this stuff happens for no good reason, or by accident, or for reasons that no longer apply. Someone liked the tech so used it - then left. Something looked better in a benchmark, but then the requirements drifted and now it's actually worse but no one has the time to rewrite. Something was inefficient but implemented as a stop gap, then stayed and is now too hard to replace.
So you can't explain the reasons when much of the time there aren't any.
The non-solutions are:
- document the high level principles and stick to them. Maybe you value speed of deployment, or stability, or control over codebase. Individual software choices often make sense in light of such principles.
- keep people around and be patient when explaining what happened
- write wiki pages, without that much effort at being systematic and up to date. Yes, they will drift out of sync, but they will provide breadcrumbs to follow.
"Not maintained" seems kinda weird to me, because at least as I see an ADR, it's like a point in time decision right? "In this situation, we looked at these options, and chose this for these reasons". You don't go back and update it. If you're making a big change, you make a new ADR with your new reasons.
One place I worked did have an interesting idea of basically forcing (not quite) the new hires to take notes on all their onboarding questions/answers as they went and then sticking it in the company docs. It at least meant that incorrect onboarding docs got fixed quickly. Sometimes you had good reasons for stuff, sometimes the reason is "dunno, that's just what we do and it seems hard to change".
> - Why Redis over in-memory cache? - Why GraphQL for this one service but REST everywhere else? - Why that strange exception in the auth flow for enterprise users?
These are all implementation details that shouldn't actually matter. What does matter is that the properties of your system are accounted for and validated. That goes in your test suite, or type system if your language has a sufficiently advanced type system.
If replacing Redis with an in-memory cache is a problem technically, your tests/compiler should prevent you from switching to an in-memory cache. If you don't have that, that is where you need to start. Once you have those tests/types, many of the questions will also get answered. It won't necessarily answer why Redis over Valkey, but it will demonstrate with clear intent why not an in-memory cache.
GitHub issues templates are perfect for ADR templates. All Hands for engineering is a great place to mention them and for teams to comment on the decision and outcomes.
# This previously used ${old-solution}, but has moved to ${new-solution} because ${reason}
Or
# This is ugly and doesn’t make sense, but ${clean-logocal-way} doesn’t work due to ${reason}. If you change ${x} it will break.
Or
# This was a requirement from ${person} on ${date}. We want to remove this, but will need to wait until ${person} no longer needs it or leaves the company.
More: https://max.engineer/reasons-to-leave-comment
Much more: https://max.engineer/maintainable-code
* File issues in a project tracker (Github, jira, asana, etc)
* Use the issue id at the start of every commit message for that issue
* Use a single branch per issue, whose name also starts with the issue id
* Use a single PR to merge that branch and close the issue
* Don't squash merge PRs
You can use
git blameto get the why.git blame, gives you the change set and the commit message. Use the issue id in commit message to get to the issue. Issue description and comments provide a part of the story.
Use the issue id, to track the branch and PR. The PR comments give you the rest of the story.
Sorry, not really an answer to your problem. But I feel you, this is a genuinely hard problem.
Keep in mind that, pretty often, the reason something is the way it is comes down to "no real reason", "that seemed easier at the time" or "we didnt know better". At least if you don't work on critical systems.
Conceivably LLMs might be good at answering questions from an unorganized mass of timestamped documents/tickets/chat logs. All the stuff that exists anyway without any extra continuous effort required to curate it - I think that's key.