A neat trick I was told is to always have ballast files on your systems. Just a few GiB of zeros that you can delete in cases like this. This won't fix the problem, but will buy you time and free space for stuff like lock files so you can get a working system.
They fill app their mobile apps with junk data just to make the APK/IPA bigger. So if they need to push an urgent update, they won't have users that can't update because their phones are full to the brim.
I know two Italian banks that do it, Unicredit and Intesa. The latter was on the news when a user found out that one of the filler files was a burp recording [1].
Doesnt this create an arms race situation where every 'critical' app claims a larger diskspace than necessary, just in case, and accelerates the issue?
Kinda sorta, but there's a limit where users will typically install X apps and apps of Y size need Z extra space to update. User content would fill up the rest. I would imagine a typical 256 gb phone is probably over this limit and people who take lots of videos/photos just need to clean up their phone a little more often.
But you still need a bunch of extra space to download and unpack the new version, and there are so many apps that need to share space, and a banking app should only need about 0.1% of a phone's storage...
Better fill those files with random bytes, to ensure the filesystem doesn’t apply some “I don’t actually have to store all-zero blocks” sparse-file optimization. To my knowledge no non-compressing file system currently does this, but who knows about the future.
If you add conv=sparse to the dd command with a smaller block size it will sparsify what you copy too, use the wrong cp command flags and they will explode.
Much harder problem than the file system layers to deal with because the stat size will look smaller usually.
A good way to do this is to create a swap file, both because then you can use it as a swap file until you need to delete it and because swap files are required to not be sparse.
Similarly, I always leave some space unallocated on LMV volume groups. It means that I can temporarily expand a volume easily if needed.
It also serves to leave some space unused to help out the wear-levelling on the SSDs on which the RAID array that is the PV¹ for LVM. I'm, not 100% sure this is needed any more² but I've not looked into that sufficiently so until I do I'll keep the habit.
--------
[1] if there are multiple PVs, from different drives/arrays, in the VG, then you might need to manually skip a bit on each one because LVM will naturally fill one before using the next. Just allocate a small LV specially on each and don't use it. You can remove one/all of them and add the extents to the fill LV if/when needed. Giving it a useful name also reminds you why that bit of space is carved out.
A neat trick I was told is to always have ballast files on your systems.
ZFS has a "reservation" mechanism that's handy:
> The minimum amount of space guaranteed to a dataset, not including its descendants. When the amount of space used is below this value, the dataset is treated as if it were taking up the amount of space specified by refreservation. The refreservation reservation is accounted for in the parent datasets' space used, and counts against the parent datasets' quotas and reservations.
Quotas prevent users/groups/directories (ZFS datasets) from using too much space, but reservations ensure that particular areas always have a minimum amount set aside for them.
I always called it a “bit-mass”. Like a thermal mass used in freezers in places where the power is not very stable.
I knew I didn’t invent the concept, as there’s so many systems that cannot recover if the disk is totally full. (a write may be required in many systems in order to execute an instruction to remove things gracefully).
The latest thing I found with this issue is Unreal Engines Horde build system, its so tightly coupled with caches, object files and database references: that a manual clean up is extremely difficult and likely to create an unstable system. But you can configure it to have fewer build artefacts kept around and then it will clear itself out gracefully. - but it needs to be able to write to the disk to do it.
Now that I think about it, I don’t do this for inodes, but you can run out of those too and end up in a weird “out of disk” situation despite having lots of usable capacity left.
This saved us a couple times. At least until I had time to add monitoring to their old system to track disk usage. It was also helpful to use a tool called ncdu. It helps you visualize where most disk space is getting used up to track down the problem.
We have a script that basically slowly expands volume when demand grows, up to a limit. So we don't have to think on stuff like "does the logs partition need to be 1 or 10GB", it will expand to the sane limit, and if it hits that we get disk usage alert before it finishes so we can either see what's going on (app shat in logs), or take a look for the one in the 10 apps that need some special tuning there
Similar to the old game development trick of hiding some memory away and then freeing it up near the end of development when the budget starts getting tight.
> A neat trick I was told is to always have sleep statements in your code. Just a few sleep statements that you can delete in cases like this. This won't fix the problem, but will buy you time and free up latency for stuff like slow algorithms so you can get faster code.
Putting limits on folders where information may be added (with partitions or project quotas) is a proactive way to avoid that something misbehaves and fills the whole disk. Filling that partition or quota may still cause some problems, depending on the applications writing there, but the impact may be lower and easier to fix than running out of space for everything.
I've run into that "process still has deleted files open" situation a few times. df shows disk full, but du can't account for all of it, that's your clue to run lsof and look for "deleted" files that are open.
Even more confusing can be cases where a file is opened, deleted or renamed without being closed, and then a different file is created under the orginal path. To quote the man page, "lsof reports only the path by which the file was opened, not its possibly different final path."
I'm not sure that his problems are really over if a LOT of people were downloading a 2GB file. It would depend on the plan. Especially if his server is in the US.
But maybe the European Hetzner servers still have really big limits even for small ones.
But still, if people keep downloading, that could add up.
It can be difficult to reason about seemingly innocuous things at scale. I have definitely fallen into the trap of increasing file size from 8 KB to 10 KB and having it cause massive problems when multiplied across all customers at once.
One thing that jumps out is the root filesystem, /nix/store, logs, temp files, and application data were all on the same partition. Putting /tmp, /var/log, and /nix on separate mount points (or at least using quotas) is a normal defense against exactly this kind of cascading failure. A runaway temp dir can't break your app ability to send outgoing emails.
The author ended up doing this for /nix under pressure, but it's very much standard best practice in any unix/linux box, especially one with only 40GB.
I remember a story of an Oracle Database customer who had production broken for days until an Oracle support escalation led to identifying the problem as mere "No disk space left".
> Plausible Analytics, with a 8.5GB (clickhouse) database
And this is why I tried Plausible once and never looked back.
To get basic but effective analytics, use GoAccess and point it at the Caddy or Nginx logs. It’s written in C and thus barely uses memory. With a few hundreds visits per day, the logs are currently 10 MB per day. Caddy will automatically truncate if logs go above 100 MB.
129 comments
They fill app their mobile apps with junk data just to make the APK/IPA bigger. So if they need to push an urgent update, they won't have users that can't update because their phones are full to the brim.
I know two Italian banks that do it, Unicredit and Intesa. The latter was on the news when a user found out that one of the filler files was a burp recording [1].
[1] https://www.ilfattoquotidiano.it/2024/12/20/intesa-san-paolo... (in Italian)
Whoever gave them that idea was doing a bad deed.
And you can tell by the fact that the filler data is called "burp.mp3" and things like that.
Much harder problem than the file system layers to deal with because the stat size will look smaller usually.
Shit like that just wastes space that SSD could use for wear levelling...
It also serves to leave some space unused to help out the wear-levelling on the SSDs on which the RAID array that is the PV¹ for LVM. I'm, not 100% sure this is needed any more² but I've not looked into that sufficiently so until I do I'll keep the habit.
--------
[1] if there are multiple PVs, from different drives/arrays, in the VG, then you might need to manually skip a bit on each one because LVM will naturally fill one before using the next. Just allocate a small LV specially on each and don't use it. You can remove one/all of them and add the extents to the fill LV if/when needed. Giving it a useful name also reminds you why that bit of space is carved out.
[2] drives under-allocate by default IIRC
>
A neat trick I was told is to always have ballast files on your systems.ZFS has a "reservation" mechanism that's handy:
> The minimum amount of space guaranteed to a dataset, not including its descendants. When the amount of space used is below this value, the dataset is treated as if it were taking up the amount of space specified by refreservation. The refreservation reservation is accounted for in the parent datasets' space used, and counts against the parent datasets' quotas and reservations.
* https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops...
Quotas prevent users/groups/directories (ZFS datasets) from using too much space, but reservations ensure that particular areas always have a minimum amount set aside for them.
I knew I didn’t invent the concept, as there’s so many systems that cannot recover if the disk is totally full. (a write may be required in many systems in order to execute an instruction to remove things gracefully).
The latest thing I found with this issue is Unreal Engines Horde build system, its so tightly coupled with caches, object files and database references: that a manual clean up is extremely difficult and likely to create an unstable system. But you can configure it to have fewer build artefacts kept around and then it will clear itself out gracefully. - but it needs to be able to write to the disk to do it.
Now that I think about it, I don’t do this for inodes, but you can run out of those too and end up in a weird “out of disk” situation despite having lots of usable capacity left.
Would it be more pragmatic to allocate a swap file instead? Something that provides a theoretical benefit in the short term vs a static reservation.
Disc Space Insurance File
https://gist.github.com/klaushardt/9a5f6b0b078d28a23fd968f75...We have a script that basically slowly expands volume when demand grows, up to a limit. So we don't have to think on stuff like "does the logs partition need to be 1 or 10GB", it will expand to the sane limit, and if it hits that we get disk usage alert before it finishes so we can either see what's going on (app shat in logs), or take a look for the one in the 10 apps that need some special tuning there
> A neat trick I was told is to always have sleep statements in your code. Just a few sleep statements that you can delete in cases like this. This won't fix the problem, but will buy you time and free up latency for stuff like slow algorithms so you can get faster code.
FTFY ;)
The authorization can probably be done somehow in nginx as well.
> I rushed to run du -sh on everything I could, as that’s as good as I could manage.
I recently came across gdu (1) and have installed/used it on every machine since then.
[1]: https://github.com/dundee/gdu
Even more confusing can be cases where a file is opened, deleted or renamed without being closed, and then a different file is created under the orginal path. To quote the man page, "lsof reports only the path by which the file was opened, not its possibly different final path."
> Note: this was written fully by me, human.
But maybe the European Hetzner servers still have really big limits even for small ones.
But still, if people keep downloading, that could add up.
The author ended up doing this for /nix under pressure, but it's very much standard best practice in any unix/linux box, especially one with only 40GB.
5. Implement infrastructure monitoring.
Assuming you're on something like Ubuntu, the monit program is brilliant.
It's open source and self hosted, configured using plain text files, and can run scripts when thresholds are met.
I personally have it configured to hit a Slack webhook for a monitoring channel. Instant notifications for free!
> Plausible Analytics, with a 8.5GB (clickhouse) database
And this is why I tried Plausible once and never looked back.
To get basic but effective analytics, use GoAccess and point it at the Caddy or Nginx logs. It’s written in C and thus barely uses memory. With a few hundreds visits per day, the logs are currently 10 MB per day. Caddy will automatically truncate if logs go above 100 MB.