SQLite in Production: Lessons from Running a Store on a Single File (ultrathink.art)

by thunderbong • 129 comments • 209 points

129 comments

[−] yokuze 41d ago

> The technical fix was embarrassingly simple: stop pushing to main every ten minutes.

Wait, you push straight to main?

> We added a rule — batch related changes, avoid rapid-fire pushes. It's in our CLAUDE.md (the governance file that all our AI agents follow):

> Avoid rapid-fire pushes to main — 11 pushes in 2h caused overlapping Kamal deploys with concurrent SQLite access.

Wait, you let _Claude_ push your e-commerce code straight to main which immediately results in a production deploy?

[−] chasil 38d ago

This is the actual problem:

"Kamal runs blue-green deploys — it starts a new container, health-checks it, then stops the old one. During the switchover, both containers are running. Both mount ultrathink_storage. Both have the SQLite files open."

WAL mode requires shared access to System V IPC mapped memory. This is unlikely to work across containers.

In case anybody needs a refresher:

https://en.wikipedia.org/wiki/Shared_memory

https://en.wikipedia.org/wiki/CB_UNIX

https://www.ibm.com/docs/en/aix/7.1.0?topic=operations-syste...

[−] simonw 38d ago

Thanks for this, the anecdote with the lost data was very concerning to me.

I think you're exactly right about the WAL shared memory not crossing the container boundary. EDIT: It looks like WAL works fine across Docker boundaries, see https://news.ycombinator.com/item?id=47637353#47677163

I don't know much about Kamal but I'd look into ways of "pausing" traffic during a deploy - the trick where a proxy pretends that a request is taking another second to finish when it's actually held in the proxy while the two containers switch over.

From https://kamal-deploy.org/docs/upgrading/proxy-changes/ it looks like Kamal 2's new proxy doesn't have this yet, they list "Pausing requests" as "coming soon".

[−] hedora 38d ago

Pausing requests then running two sqlites momentarily probably won’t prevent corruption. It might make it less likely and harder to catch in testing.

The easiest approach is to kill sqlite, then start the new one. I’d use a unix lockfile as a last-resort mechanism (assuming the container environment doesn’t somehow break those).

[−] Retr0id 38d ago

> I think you're exactly right about the WAL shared memory not crossing the container boundary.

I don't, fwiw (so long as all containers are bind mounting the same underlying fs).

[−] chasil 38d ago

You might consider taking the database(s) out of WAL mode during a migration.

That would eliminate the need for shared memory.

[−] gcr 38d ago

The SQLite documentation says in strong terms not to do this. https://sqlite.org/howtocorrupt.html#_filesystems_with_broke...

See more: https://sqlite.org/wal.html#concurrency

[−] PunchyHamster 38d ago

> WAL mode requires shared access to System V IPC mapped memory.

Incorrect. It requires access to mmap()

"The wal-index is implemented using an ordinary file that is mmapped for robustness. Early (pre-release) implementations of WAL mode stored the wal-index in volatile shared-memory, such as files created in /dev/shm on Linux or /tmp on other unix systems. The problem with that approach is that processes with a different root directory (changed via chroot) will see different files and hence use different shared memory areas, leading to database corruption."

> This is unlikely to work across containers.

I'd imagine sqlite code would fail if that was the case; in case of k8s at least mounting same storage to 2 containers in most configurations causes K8S to co-locate both pods on same node so it should be fine.

It is far more likely they just fucked up the code and lost data that way...

[−] ncruces 38d ago

This thread in the SQLite forum should be instructive: https://sqlite.org/forum/forumpost/90d6805c7cec827f

[−] Retr0id 38d ago

> This is unlikely to work across containers.

Why not?

[−] voidfunc 38d ago

Ooh new historical Unix variant I had never heard of.. neat!

[−] crabmusket 41d ago

Patient: doctor, my app loses data when I deploy twice during a 10 minute interval!

Doctor: simply do not do that

[−] littlestymaar 38d ago

> Wait, you let _Claude_ push your e-commerce code straight to main which immediately results in a production deploy?

Yikes. Thank you I'm not going to read “Lessons learned” by someone this careless.

[−] xnorswap 38d ago

I'm fairly confident they let it write the blog post too.

[−] bombcar 38d ago

Hey, Apple still takes their store down during product launches!

[−] tensegrist 38d ago

i hate to be so blunt but look around the site and then tell me you're surprised

[−] burnt-resistor 38d ago

I suspect they don't wear helmets or seatbelts either. Sigh. The "I'm so proud and ignorant of unnecessarily risky behaviors" meme is tiring.

The Meta dev model of diff reviews merge into main (rebase style) after automated tests run is pretty good.

Also, staging and canary, gradual, exponential prod deployment/rollback approaches help derisk change too.

Finally, have real, tested backups and restore processes (not replicated copies) and ability to rollback.