A “simplest” hash function is completely dependent on what you are using the hash function for and the guarantees you want a hash function to make. An optimal permutation of an integer is different from a small string hash is different from a block checksum. Literally, you are optimizing the algorithm for entirely different properties. No algorithm can satisfy all of them even approximately.
The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.
For example, when you know that your input is uniformly randomly distributed, then truncation is a perfectly good hash function. (And a special case of truncation is the identity function.)
The above condition might sound too strong to be practical, but when you are eg dealing with UUIDs it is satisfied.
Another interesting hash function: length. See https://news.ycombinator.com/item?id=6919216 for a bad example. For a good example: consider rmlint and other file system deduplicators.
These deduplicators scan your filesystem for duplicates (amongst other things). You don't want to compare every file against every other file. So as a first optimisation, you compare files only by some hash. But conventional hashes like sha256 or crc take O(n) to compute. So you compute cheaper hashes first, even if they are weaker. Truncation, ie only looking at the first few bytes is very cheap. Determining the length of a file is even cheaper.
Now I'm no expert in that matter, but the fs deduplicators I've seen were block, not file, based. Those can clearly not use the file length as they are blissfully unaware of files (or any structure for that matter). Those use a rather expensive hash function (you really want to avoid hash collisions), but (at least some ten years ago) memory, not processing speed, was the limiting factor.
The point about hash tables using top bits instead of bottom bits is the kind of thing that feels obvious once someone says it and yet here we are. Genuine question: have you seen any real-world hash table implementations that actually do this, or is it purely "this is what we should have done 40 years ago"?
I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?
While I generally like to reinvent the wheel, for hash functions I strongly recommend to use a proved good one. Djb2 by the venerable Daniel Bernstein satisfies all the requirements of TFA.
h = 5381
while still has data:
h = h * 33 + next_byte()
return h
PS of course if you think the multiplication is overkill, consider that it is nothing more than a shift and an addition.
60 comments
The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.
For example, when you know that your input is uniformly randomly distributed, then truncation is a perfectly good hash function. (And a special case of truncation is the identity function.)
The above condition might sound too strong to be practical, but when you are eg dealing with UUIDs it is satisfied.
Another interesting hash function: length. See https://news.ycombinator.com/item?id=6919216 for a bad example. For a good example: consider rmlint and other file system deduplicators.
These deduplicators scan your filesystem for duplicates (amongst other things). You don't want to compare every file against every other file. So as a first optimisation, you compare files only by some hash. But conventional hashes like sha256 or crc take O(n) to compute. So you compute cheaper hashes first, even if they are weaker. Truncation, ie only looking at the first few bytes is very cheap. Determining the length of a file is even cheaper.
> Like addition
I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?
I get why technically it is a hash function, but still, no.