5NF and Database Design (kb.databasedesignbook.com)

by petalmind 83 comments 203 points
Read article View on HN

83 comments

[−] sgarland 30d ago
NOTE: this is critiquing the author's 4NF definition (from a link in TFA), not TFA itself.

> If you read any text that defines 4NF, the first new term you hear is “multivalued dependency”. [Kent 1983] also uses “multivalued facts”. I may be dumb but I only very recently realized that it means just “a list of unique values”. Here it would be even better to say that it’s a list of unique IDs.

This is an inaccurate characterization, and the rest of the post only makes sense when viewed through this strawman. The reason 4NF is explained in the "weird, roundabout way" is because it demonstrates [one of] the precise problem[s] the normal form sets out to solve: a combinatorial explosion of rows.

If you have a table:

    CREATE TABLE Product(
      product_id INT NOT NULL,
      supplier_id INT NOT NULL,
      warehouse_id INT NOT NULL
    );
If you only ever add an additional supplier or an additional warehouse for a given product, it's only adding one row. But if you add both to the same product, you now have 4 rows for a single product; if you add 5 suppliers and 3 warehouses to the same product, you now have 15 rows for a single product, etc. This fact might be lost on someone if they're creating a table with future expansion in mind without thinking it through, because they'd never hit the cross-product, so the design would seem reasonable.

The conclusion reached (modulo treating an array as an atomic value) is in fact in 4NF, but it doesn't make any sense why it's needed if you redefine multivalued dependency to mean a set.

[−] petalmind 29d ago
I think I understand this "Cartesian product" reasoning behind 4NF/5NF, I just find it irrelevant I guess.

Cartesian product is explained in Kent: case (3) in https://www.bkent.net/Doc/simple5.htm#label4.1 ("A "cross-product" form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages")

I do not explicitly mention this Cartesian product even tho it is present in both posts ("sports / languages" in 4NF, and "brands / flavours" in 5NF).

> it demonstrates [one of] the precise problem[s] the normal form sets out to solve: a combinatorial explosion of rows.

I just don't understand this wording of "a combinatorial explosion of rows" — what's so dramatic here? I don't need four iterations of algebra-dense papers to explain this concept, I think it's pretty simple frankly.

And my implicit argument is, I guess, exactly that you could design tables that handle both problems without invoking 4NF and 5NF — people are doing that all the time.

[−] jerf 30d ago
In a roundabout way this article captures well why I don't really like thinking in terms of "normal forms", especially as a numbered list like that. The key insights are really 1. Avoid redundancy and 2. This may involve synthesizing relationships that don't immediately obviously exist from a human perspective. Both of those can be expanded on at quite some length, but I never found much value in the supposedly-blessed intermediate points represented by the nominally numbered "forms". I don't find them useful either for thinking about the problem or for communicating about it.

Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM) doesn't mean it is actually useful... sometimes it just means that it made it easy to write multiple choice test questions. (e.g., "What does Layer 2 of the OSI network model represent? A: ... B: ... C: ... D: ..." to which the most appropriate real-world answer is "Who cares?")

[−] da_chicken 30d ago
I still see value in the numbering.

Breaking 1NF is essentially always incorrect. You're fundamentally limiting your system, and making it so that you will struggle to perform certain queries. Only break 1NF when you're absolutely 100% certain that nobody anywhere will ever need to do anything even slightly complex with the data you're looking at. And then, probably still apply 1NF anyways. Everyone that ever has to use your system is going to hate you when they find this table because you didn't think of the situation that they're interested in. "Why does this query use 12 CTEs and random functions I've never heard of and take 5 minutes to return 20,000 rows?" "You broke 1NF."

2NF is usually incorrect to break. Like it's going to be pretty obnoxious to renormalize your data using query logic, but it won't come up nearly as frequently. If it's really never going to come up that often in practical terms, then okay.

3NF and BCNF are nice to maintain, but the number of circumstances where they're just not practical or necessary starts to feel pretty common. Further, the complexity of the query to undo the denormalization will not be as obnoxious as it is for 1NF or 2NF. But if you can do it, you probably should normalize to here.

4NF and higher continue along the same lines, but increasingly gets to what feels like pretty arbitrary requirements or situations where the cost you're paying in indexes is starting to become higher than the relational algebra benefits. Your database disk usage by table report is going to be dominated by junction tables, foreign key constraints, and indexes, and all you're really buying with that disk space is academic satisfaction.

[−] petalmind 30d ago

> Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM)

One problem is that normal forms are underspecified even by the academy.

E.g., Millist W. Vincent "A corrected 5NF definition for relational database design" (1997) (!) shows that the traditional definition of 5NF was deficient. 5NF was introduced in 1979 (I was one year old then).

2NF and 3NF should basically be merged into BCNF, if I understand correctly, and treated like a general case (as per Darwen).

Also, the numeric sequence is not very useful because there are at least four non-numeric forms (https://andreipall.github.io/sql/database-normalization/).

Also, personally I think that 6NF should be foundational, but that's a separate matter.

[−] sgarland 30d ago
What always frustrates me is that when people on here discuss deeply technical and/or meta-aspects of programming (e.g. type theory), it's taken at face value, but the same is not true of databases. They are generally treated as a dumb data store that you can throw anything into, and when someone explains why that's a bad idea, or why an academic concept like normal form is still applicable, it's met with criticism.

Even when it's purely performance-related, it usually gets a shrug, and "it's good enough." Cool, you're wrecking the B+tree, maybe don't do that. It's as if I said, "I'm using an array to store these millions of items that I later need to de-duplicate," and when someone suggests maybe using a set, I dismiss it.

[−] thegdsks 30d ago
Agreed. In practice I just ask "am I storing the same fact in two places?" & fix it if yes. Never once sat down and thought "let me check if this is in 4NF specifically."
[−] awesome_dude 30d ago
The levels do the most important thing in computer science, give discrete and meaningful levels to talk/argue about at the watercolour
[−] wolttam 30d ago
Why shouldn’t we care about layer 2? You can do really fun and interesting things at the MAC layer.
[−] cremer 30d ago
The numbered forms are most useful as a teaching device, not an engineering specification. Once you have internalized 2NF and 3NF violations through a few painful bugs, you start spotting partial and transitive dependencies by feel rather than by running through definitions. The forms gave you the vocabulary. The bugs gave you the instinct..
[−] minkeymaniac 30d ago
Normalize till it hurts, then denormalize till it works!
[−] estetlinus 31d ago
The lost art of normalizing databases. ”Why is the ARR so high on client X? Oh, we’re counting it 11 times lol”.

I would maybe throw in date as an key too. Bad idea?

[−] reval 30d ago
I haven’t finished reading this but I am commenting because of the form. Lead with the conclusions, table of contents, and then sources? This is someone who is confident in what they write. I wish more writing trusted the audience to decide if the writing were important instead of stringing the audience allow. Keep up the good work.
[−] Quarrelsome 30d ago
Especially loved the article linked that was dissing down formal definitions of 4NF.
[−] iFire 30d ago
https://en.wikipedia.org/wiki/Essential_tuple_normal_form is cool!

Since I had bad memory, I asked the ai to make me a mnemonic:

* Every

* Table

* Needs

* Full-keys (in its joins)

[−] tadfisher 31d ago
I love reading about the normal forms, because it makes me sound like I know what I'm talking about in the conversation where the backend folks tell me, "if we normalized that data then the database would go down". This is usually followed by arguments over UUID versions for some reason.
[−] ibrahimhossain 30d ago
The article makes a good point about when 5nf becomes impractical. In my experience, stopping at BCNF or 4nf often strikes a better balance unless you have very clear join dependencies. How do others decide where to stop normalizing in real world apps?
[−] bvrmn 30d ago
For me NF>3 seems like an implicit encoding of underlying data logic. They impose additional restrictions (usually contrived and artificial, break really fast in real life) on data not directly expressed as data tuples. Because of that they are hard to explain, natural reaction: "why you just don't store data?".
[−] akdev1l 30d ago
My brain has been blunted too far due to dynamodb and NoSQL storage usage and now I can’t even normalize anymore
[−] blueybingo 30d ago
the missing piece in most normalization discussions is the OLAP vs OLTP split. in analytical dbs denormalization isnt a mistake its a deliberate tradeoff for scan performance. teaching normal forms without that context sets people up to make the wrong calls when they hit a warehouse workload
[−] artyom 30d ago
Color me impressed. Even being very well versed in database design myself, this is just pragmatic and straight to the point, the way I'd have liked it back in the day.

I think the main problem of how 4NF and 5NF formal definitions were taught is that essentially common sense (which is mostly "sufficient" to understand 1NF-3NF) starts to slip away, and you start needing the mathematical background that Ed Codd (and others) had. And trying to avoid that is how those weird examples came up.

[−] DeathArrow 31d ago
There are use cases where is better to not normalize the data.
[−] carlyai 30d ago
love this
[−] arh5451 30d ago
i like it but i find the writing style difficult to read.
[−] umutnaber 30d ago
güzel elinize sağlık
[−] mergisi 29d ago
[dead]