Jennifer Aniston and Friends Cost Us 377GB and Broke Ext4 Hardlinks (blog.discourse.org)

by speckx 31 comments 47 points
Read article View on HN

31 comments

[−] replooda 35d ago
In short: Deduplication efforts frustrated by hardlink limits per inode — and a solution compatible with different file systems.
[−] UltraSane 35d ago
The real problem is they aren't deduplicating at the filesystem level like sane people do.
[−] otterley 35d ago
From the article:

> [W]e shipped an optimization. Detect duplicate files by their content hash, use hardlinks instead of downloading each copy.

[−] UltraSane 35d ago
I meant TRANSPARENT filesystem level dedupe. They are doing it at the application level. filesystem level dedupe makes it impossible to store the same file more than once and doesn't consume hardlinks for the references. It is really awesome.
[−] mmh0000 35d ago
Filesystem/file level dedupe is for suckers. =D

If the greatest filesystem in the world were a living being, it would be our God. That filesystem, of course, is ZFS.

Handles this correctly:

https://www.truenas.com/docs/references/zfsdeduplication/

[−] UltraSane 35d ago
I was talking about block level dedupe.
[−] mmh0000 35d ago
I thought you might be.

I just wanted to mention ZFS.

Have I mentioned how great ZFS is yet?

[−] otterley 35d ago
ZFS is great! However, it's too complicated for most Linux server use cases (especially with just one block device attached); it's not the default (root filesystem); and it's not supported for at least one major enterprise Linux distro family.
[−] vmilner 34d ago
[−] burnt-resistor 35d ago
File system dedupe is expensive because it requires another hash calculation that cannot be shared with application-level hashing, is a relatively rare OS-fs feature, doesn't play nice with backups (because files will be duplicated), and doesn't scale across boxes.

A simpler solution is application-level dedupe that doesn't require fs-specific features. Simple scales and wins. And plays nice with backups.

Hash = sha256 of file, and abs filename = {{aa}}/{{bb}}/{{cc}}/{{d}} where

aa = hash 2 hex most significant digits

bb = hash next 2 hex digits

cc = hash next 2 hex after that

d = remaining hex digits

[−] UltraSane 35d ago
All good backup software should be able to do deduped incremental backups at the block level. I'm used to veeam and commvault
[−] burnt-resistor 34d ago
That costs even more, unreuseable time and effort. It's simpler to dedupe at the application level rather than shift the burden onto N things. I guess you don't understand or appreciate simplicity.
[−] UltraSane 34d ago
This article shows it really isn't that simple and is easy to mess up. Who cares if your storage and backup software both dedupe?
[−] otterley 35d ago
For ZFS, at least, zfs send is the backup solution. And it performs incremental backups with the -i argument.
[−] UltraSane 34d ago
zfs send is really awesome when combined with dedupe and incremental
[−] dj_rock 35d ago
We were on a break...of your filesystem!
[−] uticus 35d ago
And I thought this was a reference to a Win95 problem https://www.slashgear.com/1414245/jennifer-aniston-matthew-p...
[−] niobe 35d ago
Completely Claude written FWIW. I recongise the style.
[−] trixn86 35d ago
The Problem. The fix. The Limit.

Is it just me or is everybody else just as fed up with always the same AI tropes?

I've reached a point where I just close the tab the moment I read a headline "The problem". At least use tropes.fyi please

[−] otterley 35d ago
Another reason to use XFS -- it doesn't have per-inode hard link limits.

(Some say ZFS as well, but it's not nearly as easy to use, and its license is still not GPL-friendly.)

[−] bravetraveler 35d ago
As is always the case, short vs long term... but I think I'd put effort into migrating to a filesystem that is aware of duplication instead of trying to recreate one with links [while retaining duplicates, just fewer].

Effectiveness is debatable, this approach still has duplication. An insignificant amount, I'll admit. The filesystem handling this at the block level is probably less problematic/prone to rework and more efficient.

edit: Eh, ignore me. I see this is preparing for [whatever filesystem hosts chose] thanks to 'ameliaquining' below. Originally thought this was all Discourse-proper, processing data they had.

[−] UltraSane 35d ago
This makes them look rather incompetent. Storing the exact same file 246,173 times is just stupid. Dedupe at the filesystem level and make your life easier.