I love this article as it shows how fast computers really are.
There is one conclusion that I do not agree with. Near the end, the author lists cases where you will outgrow flat files. He then says that "None of these constraints apply to a lot of applications."
One of the constraints is "Multiple processes need to write at the same time." It turns out many early stage products need crons and message queues that execute on a separate worker. These multiple processes often need to write at the same time. You could finagle it so that the main server is the only one writing, but you'd introduce architectural complexity.
So while from the pure scale perspective I agree with the author, if you take a wider perspective, it's best to go with a database. And sqlite is a very sane choice.
If you need scale, cache the most often accessed data in memory and you have the best of both worlds.
The article is fine, but I wanted to call this out.
"Every database you have ever used reads and writes to the filesystem, exactly like your code does when it calls open()."
Technically not true. Applications like SQLite use mmap to map the file into a locally addressable memory space. This lets you skip the syscalls when reading and writing. The kernel can map that data in dynamically much faster than a userland process can.
Later in the article they go over the process of reading in the entire file into memory, again mmap is much better at this. Would have been nice to see that approach used.
Writing your own storage is a great way to understand how databases work (if you do it efficiently, keeping indexes, correct data structures, etc.) and to come to the conclusion that if your intention wasn't just tinkering, you should've used a database from day 1.
Many eons ago I wrote a small sales web application in Perl. I couldn't install anything on the ISP's machine, so I used file-backed hashes: one for users, one for orders, another for something else.
As the years went by, I expected the client to move to something better, but he just stuck with it until he died after about 20 years, the family took over and had everything redone (it now runs Wordpress).
The last time I checked, it had hundreds of thousands of orders and still had good performance. The evolution of hardware made this hack keep its performance well past what I had expected it to endure. I'm pretty sure SQLite would be just fine nowadays.
> Binary search beats SQLite... For a pure ID lookup, you're paying for machinery you're not using.
You'll likely end up quite a chump if you follow this logic.
sqlite has pretty strong durability and consistency mechanism that their toy disk binary search doesn't have.
(And it is just a toy. It waves away the maintenance of the index, for god's sake, which is almost the entire issue with indexes!)
Typically, people need to change things over time as well, without losing all their data, so backwards compatibility and other aspects of flexibility that sqlite has are likely to matter too.
I think once you move beyond a single file read/written atomically, you might as well go straight to sqlite (or other db) rather than write your own really crappy db.
I dunno. Even in embedded systems every time I've started without a database I've eventually come to need something like a database, and in every case I've found myself building essentially an ad-hoc poorly managed database into the application including marshalling/unmarshalling, file management, notification, and so on because each new feature over the top of regular files was just that much easier to add versus switching to a database system.
However the driving motivation for adding a database is not necessarily managing data, but the fact that the database system creates a nice abstraction layer around storing data of relational or non-relational form in non-volatile memory and controlling access to it while other systems are updating it. And because it's a nice abstraction, there are a lot of existing libraries that can take advantage of it in your language of choice without requiring you to completely invent all of that stuff over the top of the filesystem. That has knock-on effects when you're trying to add new functionality or new interaction patterns to an existing system.
And in cases where two or more processes need to communicate using the same data, a database gives you some good abstractions and synchronization primitives that make sense, whereas regular files or IPC require you to invent a lot of that stuff. You could use messaging to communicate updates to data but now you have two copies of everything, and you have to somehow atomize the updates so that either copy is consistent for a point in time. Why not use a database?
Knowing what I know today I would start with some kind of database abstraction even if it's not necessarily designed for transactional data, and I would make sure it handled the numerous concerns I have around data sharing, consistency, atomicity, and notification because if I don't have those things I eventually have to invent them to solve the reliability problems I otherwise run in to without them.
Made the performance on my machine go from 27,700.99 r/s to 89,687.36 r/s.
I also tried making the get user a prepared statement, and storing the timestamp as an unix timestamp integer, but that didn't make much difference for me.
I love SQLite, I love the idea, I love having something mature and lightweight, but like the author I discovered it's overkill or dare I say insufficient for certain use cases.
I was building a client side dictionary app with search functionality and thought that using sqlite's wasm port would be the perfect solution.
I used SQLite for a couple years and it started to wear on me. The database files were bigger than they should be, they didn't compress super well, the loading time was a bit slow. Linear search wasn't very fast either. Editing SQLite files and doing joins and stuff was slow and frustrating as well. I wanted something simpler.
I didn't need atomics, I didn't need writes, so I just handrolled a set of indexes from the source tsv files and compressed those with zstd and decompressed them via wasm on every load. Decompressing and loading was now faster than straight loading via SQLite and since my module was 52kb of wasm instead of 800kb for SQLite I could load as many instances of the module as I wanted with no regrets. I use stringzilla for linear scans and it's ridiculously fast.
SQLite is great, but it's not the solution for every problem.
You need databases if you need any kind of atomicity. Doing atomic writes is extremely fragile if you are just on top of the filesystem.
This is also why many databases have persistence issues and can easily corrupt on-disk data on crash. Rocksdb on windows is a very simple example a couple years back. It was regularly having corruption issues when doing development with it.
people wildly underestimate the os page cache and modern nvme drives tbh. disk io today is basically ram speeds from 10 years ago. seeing startups spin up managed postgres + redis clusters + prisma on day 1 just to collect waitlist emails is peak feature vomit.
a jsonl file and a single go binary will literally outlive most startup runways.
also, the irony of a database gui company writing a post about how you dont actually need a database is pretty based.
This is a cool exercise, but I would hesitate to choose files over SQLite or another Dockerised relational database in production.
They are overoptimising for the simplest part of writing the application; the beginning. They've half-implemented an actual database, with none of the safety features. There are a lot of potential headaches that this article has avoided talking about; perhaps because they haven't experienced them yet.
What happens when you need to start expanding the scope of this feature? Joining users on profiles, or users on orgs?
Ask yourself: how many shops have seriously written an application backed by files and stuck with it over the long-run? The answer is likely very few. Therefore, this is likely doubling up the work required.
There is a reason people reach for a database first. I'd strongly encourage anyone to avoid doing stuff like this.
I suggest every developer write a database from scratch at least once, and use it for something real. Or, even better, let somebody else use it for something real. Then you will know "why database".
Don't know if it counts, but my London cinema listings website just uses static json files that I upload every weekend. All of the searching and stuff is done client side. Although I do use sqlite to create the files locally.
Total hosting costs are £0 ($0) other than the domain name.
Separate from performance, I feel like databases are a sub-specialty that has its own cognitive load.
I can use databases just fine, but will never be able to make wise decisions about table layouts, ORMs, migrations, backups, scaling.
I don't understand the culture of "oh we need to use this tool because that's what professionals use" when the team doesn't have the knowledge or discipline to do it right and the scale doesn't justify the complexity.
I feel like someone who works for a DB company ought to mention at least some of the pitfalls in file-based backing stores (data loss due to crashes, file truncation, fsync weirdness, etc)
I'm so old I remember working on databases that were designed to use RAW, not files. I'm betting some databases still do, but probably only for mainframe systems nowadays.
A few months back I decided to write an embedded db for my firm's internal JS framework. Learned a lot about how/why databases work the way they do. I use stuff like reading memory cached markdown files for static sites, but there are certain things that a database gives you (chief of which for me was query ergonomics—I loved MongoDB's query language but grew too frustrated with the actual runtime) that you'll miss once you move past a trivial data set.
I think a better way to ask this question is "does this application and its constraints necessitate a database? And if so, which database is the correct tool for this context?"
Not to nitpick, but it would be interesting to see profiling info of the benchmarks
Different languages and stdlib methods can often spend time doing unexpected things that makes what looks like apples-to-apples comparisons not quite equivalent
I think this whole article and post is an attention / points seeking exercise. It is hard to imagine programmer who would not know difference between DBMS and just bunch of files and when to use which
File systems are nice if you need to do manual or transparent script-based manipulations. Like 'oh hey, I just want to duplicate this entry and hand-modify it, and put these others in an archive.' Or use your OS's access control and network sharing easily with heterogeneous tools accessing the data from multiple machines. Or if you've got a lot of large blobs that aren't going to get modified in place.
What the world needs is a hybrid - database ACID/transaction semantics with the ability to cd/mv/cp file-like objects.
SQLite did decently well but I think they should’ve done an additional benchmark with the database loaded completely into memory.
Since they’re using Go to accept requests and forwarding them to their SQLite connection, it may have been worthwhile to produce the same interface with Rust to demonstrate whether or not SQLite itself was hitting its performance limit or if Go had some hand in that.
Other than that, it’s a good demonstration of how a custom solution for a lightweight task can pay off. Keep it simple but don’t reinvent the wheel if the needs are very general.
Please …
Every few years the pendulum swings. First it was “relational databases are too rigid, just use NoSQL.” Then “NoSQL is a mess, just go back to Postgres.” Now: “do you even need a database at all, just use flat files.”
Each wave is partially right. But… each wave is about to rediscover, the hard way, exactly why the previous generation made the choices they did.
SQLite is the answer to every painful lesson learned, every scar from long debug night the last time someone thought “a JSON file is basically a database.”
Back when Ajax was a hip new thing someone asked me why xml? We later answered that question with JSON of course but his hypothetical solution was much more hilarious. I'm calling it the html database. He made two mockups. One where everything was a richly formatted static html file and one that used a blogging platform. It had things like archives by month each with their own url. A paginated front page with chunk from each posting.
It was pretty funny and I couldn't really see something wrong with it.
In his demo he pulled in a document with a table in it and inserted the table into a div. One version used the domparser but he also made one with and and chopped out a substring.
In the 80's, because of resource constraints, a standard thing to ask was "why is my data not formatted the way I need it?"
IOW when to format it properly? when it is created? When it is consumed? Both? At some other time?
If you are going to create it one time and consume it 100 000 times html is a great "DB"
I sympathize with this so hard. I frequently conduct system design interviews where the problem could easily be handled on a single machine with a flat file, let alone sql lite. Only the rare candidate mentions this; mostly I get a horde of microservices and queues and massive distributed databases that are totally unneccessary.
In many cases not. E.g. for caching with python, diskcache is a good choice.
For small amounts of data, a JSON file does the job (you pointed to JSONL as an option).
But for larger collections, that should be searchable/processable, postgres is a good choice.
Memory of course, as you wrote, also seems reasonable in many cases.
DB-less is underrated. You can take it even further.
One of my favourite websites ever loads its entire main content index into the client as a JSON file. It's about 5 MBs I think, so maybe 10% of the ad and autoplay video crap most "modern" websites make you download in the first few seconds. Except that this initial download actually gives you an amazing UX in return, because all the searching, filtering, pagination etc. happens entirely client-side without further network requests. I don't know any other website that is as much of a joy to just browse through. You never wait for anything until the point that you found what you want and actually want to open a detail page. Just wonderful.
I will still reach for a database 99% or the time, because I like things like SQL and transactions. However, I've recently been working on a 100% personal project to manage some private data; extracting insights, graphing trends, etc. It's not high volume data, so I decided to use just the file system, with data backed at yaml files, with some simple indexing, and I haven't run into any performance issues yet. I probably never will at my scale and volume.
In this particular case having something that was human readable, and more importantly diffable, was more valuable to me than outright performance.
Having said that, I will still gladly reach for a database with a query language and all the guarantees that comes with 99% of the time.
There's a whole thing this days about building solvers (e.g. SAT or Ising) out of exotic hardware that does compute in memory. A while back I wondered if one could leverage distributed DB logic to build solvers for massive problems, something like compute in DB.
While this is certainly cool to see. And I love seeing how fast webservers can go.. The counter question "Do you even need 25,000 RPS and sub-ms latency?" comes to mind.
I don't choose a DB over a flat file for its speed. I choose a DB for the consistent interface and redundancy.
The index format is simple: one line per record, exactly 58 bytes: <36-char UUID>:<20-digit byte offset in data file>\n.
It would be much better to write all of this as binary data, omitting separators.
• Since it’s fixed-width and simple, inspecting the data is still pretty easy—there are tools for working with binary data of declared schema, or you could write a few-liner to convert it yourself. You don’t lose much by departing ASCII.
• You might want to complicate it a little by writing a version tag at the start of the file or outside it so you can change the format more easily (e.g. if you ever add a third column). I will admit the explicit separators do make that easier. You can also leave that for later, it probably won’t hurt.
• It removes one type of error altogether: now all bit patterns are syntactically valid.
• It’ll use less disk space, be cheaper to read, be cheaper to write, and probably take less code.
I also want to register alarm at the sample code given for func FindUserBinarySearch. To begin with, despite a return type of (*User, error), it always returns nil error—it swallows all I/O errors and ignores JSON decode errors. Then:
I need a filesystem that does some database things. We got teased with that with WinFS and Beos's BFS, but it seems the football always gets yanked away, and the mainstream of filesystems always reverts back to the APIs established in the 1980s.
I'm a big fan of using S3 as a database. A lot of apps can get a lot of mileage just doing that for a good chunk of their data; that which just needs lookup by a single field (usually ID, but doesn't have to be).
In order to ask this question it's important to understand the lifecycle of the data in question. If it is constantly being updated and requires "liveness" (updates are reflected in queries immediately), the simple answer is: yes, you need a database.
But if you have data that is static or effectively static (data that is updated occasionally or batched), then serving via custom file handling can have its place.
If the records are fixed width and sorted on the key value, then it becomes trivial to do a binary search on the mmapped file. It's about as lightweight as could be asked for.
Sure. Go ahead and use JSONL files and implement every feature of SQL query. Congrats, you just reinvented a database, while trying to prove you don't need database.
so the article benchmarks speed, but I don't really think most people are making the choice based on speed. Robustness is really the driving force (well at least for me). So for CRUD operations I have way more confidence with something like SQLite than using files. Files are workable in many situations, but much easier if you default to things like SQLite and use files only when you think there is a specific advantage.
To not destroy the article author and apreciate his effort to prove something, that might be useful in a extreme case of optimization with a limited ammount of data and NO NEED to update/write the files. Just a read cache only.
If you need to ever update a single byte in your data, please USE A PROPER DATABASE, databases does a lot of fancy thing to ensure you are not going to corrupt/broke your data on disk among other safety things.
Let's put it this way. I always end up needing the functionality and ACID guarantees of a database. I always wish I had a database. But some times I'm forced to use the project's legacy data stores (often flat-file data lakes) and watch every wheel get reinvented as we struggle to glue consistency, transactions, a bespoke query language, etc. onto an unwilling pile of unstructured data.
I remember reading a story from Robert C. Martin, if I recall correctly, about writing an application and trying to decide which DB to use. In the end, they put the DB access paths behind an abstraction and decided that they'd just use the file system to start with, and easily switch it out later. In the end, they shipped, and never did need to use a real DB.
I'd argue for using LevelDB or similar if I just wanted to store arbitrary data based on a single indexable value like TFA. That said, I'd probably just default to SQLite myself since the access, backup, restore patterns are relatively well known and that you can port/grow your access via service layers that include Turso or Cloudflare D1, etc.
298 comments
There is one conclusion that I do not agree with. Near the end, the author lists cases where you will outgrow flat files. He then says that "None of these constraints apply to a lot of applications."
One of the constraints is "Multiple processes need to write at the same time." It turns out many early stage products need crons and message queues that execute on a separate worker. These multiple processes often need to write at the same time. You could finagle it so that the main server is the only one writing, but you'd introduce architectural complexity.
So while from the pure scale perspective I agree with the author, if you take a wider perspective, it's best to go with a database. And sqlite is a very sane choice.
If you need scale, cache the most often accessed data in memory and you have the best of both worlds.
My winning combo is sqlite + in-memory cache.
"Every database you have ever used reads and writes to the filesystem, exactly like your code does when it calls open()."
Technically not true. Applications like SQLite use mmap to map the file into a locally addressable memory space. This lets you skip the syscalls when reading and writing. The kernel can map that data in dynamically much faster than a userland process can.
Later in the article they go over the process of reading in the entire file into memory, again mmap is much better at this. Would have been nice to see that approach used.
As the years went by, I expected the client to move to something better, but he just stuck with it until he died after about 20 years, the family took over and had everything redone (it now runs Wordpress).
The last time I checked, it had hundreds of thousands of orders and still had good performance. The evolution of hardware made this hack keep its performance well past what I had expected it to endure. I'm pretty sure SQLite would be just fine nowadays.
The very small bonus you get on small apps is hardly worth the time you spend redeveloping the wheel.
> Binary search beats SQLite... For a pure ID lookup, you're paying for machinery you're not using.
You'll likely end up quite a chump if you follow this logic.
sqlite has pretty strong durability and consistency mechanism that their toy disk binary search doesn't have.
(And it is just a toy. It waves away the maintenance of the index, for god's sake, which is almost the entire issue with indexes!)
Typically, people need to change things over time as well, without losing all their data, so backwards compatibility and other aspects of flexibility that sqlite has are likely to matter too.
I think once you move beyond a single file read/written atomically, you might as well go straight to sqlite (or other db) rather than write your own really crappy db.
However the driving motivation for adding a database is not necessarily managing data, but the fact that the database system creates a nice abstraction layer around storing data of relational or non-relational form in non-volatile memory and controlling access to it while other systems are updating it. And because it's a nice abstraction, there are a lot of existing libraries that can take advantage of it in your language of choice without requiring you to completely invent all of that stuff over the top of the filesystem. That has knock-on effects when you're trying to add new functionality or new interaction patterns to an existing system.
And in cases where two or more processes need to communicate using the same data, a database gives you some good abstractions and synchronization primitives that make sense, whereas regular files or IPC require you to invent a lot of that stuff. You could use messaging to communicate updates to data but now you have two copies of everything, and you have to somehow atomize the updates so that either copy is consistent for a point in time. Why not use a database?
Knowing what I know today I would start with some kind of database abstraction even if it's not necessarily designed for transactional data, and I would make sure it handled the numerous concerns I have around data sharing, consistency, atomicity, and notification because if I don't have those things I eventually have to invent them to solve the reliability problems I otherwise run in to without them.
Just adding:
Made the performance on my machine go from 27,700.99 r/s to 89,687.36 r/s.I also tried making the get user a prepared statement, and storing the timestamp as an unix timestamp integer, but that didn't make much difference for me.
I was building a client side dictionary app with search functionality and thought that using sqlite's wasm port would be the perfect solution.
I used SQLite for a couple years and it started to wear on me. The database files were bigger than they should be, they didn't compress super well, the loading time was a bit slow. Linear search wasn't very fast either. Editing SQLite files and doing joins and stuff was slow and frustrating as well. I wanted something simpler.
I didn't need atomics, I didn't need writes, so I just handrolled a set of indexes from the source tsv files and compressed those with zstd and decompressed them via wasm on every load. Decompressing and loading was now faster than straight loading via SQLite and since my module was 52kb of wasm instead of 800kb for SQLite I could load as many instances of the module as I wanted with no regrets. I use stringzilla for linear scans and it's ridiculously fast.
SQLite is great, but it's not the solution for every problem.
This is also why many databases have persistence issues and can easily corrupt on-disk data on crash. Rocksdb on windows is a very simple example a couple years back. It was regularly having corruption issues when doing development with it.
a jsonl file and a single go binary will literally outlive most startup runways.
also, the irony of a database gui company writing a post about how you dont actually need a database is pretty based.
They are overoptimising for the simplest part of writing the application; the beginning. They've half-implemented an actual database, with none of the safety features. There are a lot of potential headaches that this article has avoided talking about; perhaps because they haven't experienced them yet.
See: https://danluu.com/file-consistency/
What happens when you need to start expanding the scope of this feature? Joining users on profiles, or users on orgs?
Ask yourself: how many shops have seriously written an application backed by files and stuck with it over the long-run? The answer is likely very few. Therefore, this is likely doubling up the work required.
There is a reason people reach for a database first. I'd strongly encourage anyone to avoid doing stuff like this.
Total hosting costs are £0 ($0) other than the domain name.
I can use databases just fine, but will never be able to make wise decisions about table layouts, ORMs, migrations, backups, scaling.
I don't understand the culture of "oh we need to use this tool because that's what professionals use" when the team doesn't have the knowledge or discipline to do it right and the scale doesn't justify the complexity.
So my opinion has thoroughly shifted to "start with a database, and if you _really_ don't need one it'll be obvious.
But you probably do.
I think a better way to ask this question is "does this application and its constraints necessitate a database? And if so, which database is the correct tool for this context?"
Different languages and stdlib methods can often spend time doing unexpected things that makes what looks like apples-to-apples comparisons not quite equivalent
What the world needs is a hybrid - database ACID/transaction semantics with the ability to cd/mv/cp file-like objects.
Since they’re using Go to accept requests and forwarding them to their SQLite connection, it may have been worthwhile to produce the same interface with Rust to demonstrate whether or not SQLite itself was hitting its performance limit or if Go had some hand in that.
Other than that, it’s a good demonstration of how a custom solution for a lightweight task can pay off. Keep it simple but don’t reinvent the wheel if the needs are very general.
It was pretty funny and I couldn't really see something wrong with it.
In his demo he pulled in a document with a table in it and inserted the table into a div. One version used the domparser but he also made one with and and chopped out a substring.
In the 80's, because of resource constraints, a standard thing to ask was "why is my data not formatted the way I need it?"
IOW when to format it properly? when it is created? When it is consumed? Both? At some other time?
If you are going to create it one time and consume it 100 000 times html is a great "DB"
Memory of course, as you wrote, also seems reasonable in many cases.
Also notable mention for JSON5 which supports comments!: https://json5.org/
One of my favourite websites ever loads its entire main content index into the client as a JSON file. It's about 5 MBs I think, so maybe 10% of the ad and autoplay video crap most "modern" websites make you download in the first few seconds. Except that this initial download actually gives you an amazing UX in return, because all the searching, filtering, pagination etc. happens entirely client-side without further network requests. I don't know any other website that is as much of a joy to just browse through. You never wait for anything until the point that you found what you want and actually want to open a detail page. Just wonderful.
I will still reach for a database 99% or the time, because I like things like SQL and transactions. However, I've recently been working on a 100% personal project to manage some private data; extracting insights, graphing trends, etc. It's not high volume data, so I decided to use just the file system, with data backed at yaml files, with some simple indexing, and I haven't run into any performance issues yet. I probably never will at my scale and volume.
In this particular case having something that was human readable, and more importantly diffable, was more valuable to me than outright performance.
Having said that, I will still gladly reach for a database with a query language and all the guarantees that comes with 99% of the time.
I don't choose a DB over a flat file for its speed. I choose a DB for the consistent interface and redundancy.
>
The index format is simple: one line per record, exactly 58 bytes: <36-char UUID>:<20-digit byte offset in data file>\n.It would be much better to write all of this as binary data, omitting separators.
• Since it’s fixed-width and simple, inspecting the data is still pretty easy—there are tools for working with binary data of declared schema, or you could write a few-liner to convert it yourself. You don’t lose much by departing ASCII.
• You might want to complicate it a little by writing a version tag at the start of the file or outside it so you can change the format more easily (e.g. if you ever add a third column). I will admit the explicit separators do make that easier. You can also leave that for later, it probably won’t hurt.
• UUID: 36 bytes → 16 bytes.
• Offset: 20 bytes (zero-padded base-ten integer) → 8 bytes.
• It removes one type of error altogether: now all bit patterns are syntactically valid.
• It’ll use less disk space, be cheaper to read, be cheaper to write, and probably take less code.
I also want to register alarm at the sample code given for func FindUserBinarySearch. To begin with, despite a return type of (*User, error), it always returns nil error—it swallows all I/O errors and ignores JSON decode errors. Then:
That strings.TrimRight will only do anything if your data is corrupted. Not important when you control the writing, but worth noting that UUID string comparison is case-insensitive. Superfluous. ParseInt doesn’t mind leading zeroes, and it’ll probably skip them faster than a separate TrimLeft call. That’s begging to make data corruption difficult to debug. Most corruption will now become dataOffset 0. Congratulations! You are now root.But if you have data that is static or effectively static (data that is updated occasionally or batched), then serving via custom file handling can have its place.
If the records are fixed width and sorted on the key value, then it becomes trivial to do a binary search on the mmapped file. It's about as lightweight as could be asked for.
If you need to ever update a single byte in your data, please USE A PROPER DATABASE, databases does a lot of fancy thing to ensure you are not going to corrupt/broke your data on disk among other safety things.