Everything old is new again: memory optimization (nibblestew.blogspot.com)

by ibobev 163 comments 231 points
Read article View on HN

163 comments

[−] muskstinks 50d ago
I'm always confused as hell how little insight we have in memory consumption.

I look at memory profiles of rnomal apps and often think "what is burning that memory".

Modern compression works so well, whats happening? Open your taskmaster and look through apps and you might ask yourself this.

For example (lets ignore chrome, ms teams and all the other bloat) sublime consumes 200mb. I have 4 text files open. What is it doing?

Alone for chrome to implement tab suspend took YEARS despite everyone being aware of the issue. And addons existed which were able to do this.

I bought more ram just for chrome...

[−] pjc50 50d ago
https://learn.microsoft.com/en-us/sysinternals/downloads/vmm... for an empty sublime text window gives me:

- 100MB 'image' (ie executable code; the executable itself plus all the OS libraries loaded.)

- 40MB heap

- 50MB "mapped file", mostly fonts opened with mmap() or the windows equivalent

- 45MB stack (each thread gets 2MB)

- 40MB "shareable" (no idea)

- 5MB "unusable" (appears to be address space that's not usable because of fragmentation, not actual RAM)

Generally if something's using a lot of RAM, the answer will be bitmaps of various sorts: draw buffers, decompressed textures, fonts, other graphical assets, and so on. In this case it's just allocated but not yet used heap+stacks, plus 100MB for the code.

Edit: I may be underestimating the role of binary code size. Visual Studio "devenv.exe" is sitting at 2GB of 'image'. Zoom is 500MB. VSCode is 300MB. Much of which are app-specific, not just Windows DLLs.

[−] wat10000 50d ago
Turning these numbers into "memory consumption" gets complicated to the point of being intractable.

The portions that are allocated but not yet used might just be page table entries with no backing memory, making them free. Except for the memory tracking the page table entries. Almost free....

A lot of "image" will be mmapped and clean. Anything you don't actually use from that will be similarly freeish. Anything that's constantly needed will use memory. Except if it's mapped into multiple processes, then it's needed but responsibility is spread out. How do you count an app's memory usage when there's a big chunk of code that needs to sit in RAM as long as any of a dozen processes are running? How do you count code that might be used sometime in the next few minutes or might not be depending on what the user does?

[−] gmueckl 49d ago
This assumes that executable code pages can be shared between processes. I'm skeptical that this is still a notable optimization on modern systems because dynamic linking writes to executable memory to perform relocations in the loaded code. So this would counteract copy on write. And at least with ASLR, the result should be different for each process anyway.
[−] cataphract 49d ago
ld writes to the GOT. The executable segment where .text lives is not written to (it's position independent code in dynamic libraries).

ASLR is not an obstacle -- the same exact code can be mapped into different base addresses in different processes, so they can be backed by the same actual memory.

[−] manwe150 49d ago
That’s true on most systems (modern or not), but actually never been true on Windows due to PE/COFF format limitations. But also, that system doesn’t/can’t do effective ASLR because of the binary slide being part of the object file spec.
[−] gmueckl 49d ago
I can't reconcile this with the code that GCC generates for accessing global variables. There is no additional indirection there, just a constant 0 address that needs to be replaced later.
[−] cataphract 49d ago
Assuming the symbol is defined in the library, when the static linker runs (ld -- we're not talking ld.so), it will decide whether the global variable is preemptable or not, that is, if it can be resolved to a symbol outside the dso. Generally, by default it is, though this depends on many things -- visibility attributes, linker scripts, -Bsymbolic, etc. If it is, ld will have the final code reach into the GOT. If not, it can just use instruction (PC) relative offsets.
[−] gmueckl 49d ago
OK, I spent a few additional minutes digging into this. It's been too long since I looked at those mechanisms. Turns out my brain was stuck in pre-PIE world.

Global variables in PIC shared libraries are really weird: the shared library's variable is placed into the main program image data segment and the relocation is happening in the shared library, which means that there is an indirection generated in the library's machine code.

[−] wat10000 49d ago
Are you looking at the code before or after the static linker runs?
[−] wat10000 49d ago
Dynamic linking doesn't have to write to code. I'm not familiar with other platforms, but on macOS, relocations are all in data, and any code that needs a relocation will indirect through non-code pages. I assume it's similar on other OSes.

This optimization is essential. A typical process maps in hundreds of megabytes of code from the OS. There are hundreds of processes running at any given time. Eyeballing the numbers on an older Mac I have here (a newer one would surely be worse) I'd need maybe 50GB of RAM just to hold the code of all the running processes if the pages couldn't be shared.

[−] muskstinks 50d ago
Tx for the breakdown. I will play around with it later on my windows machine.

But isn't it crazy how we throw out so much memory just because of random buffers? It feels wrong to me

[−] Capricorn2481 50d ago
But I have sublime text open with a hundred files and it's using 12mb.
[−] corysama 49d ago
Some ten years ago I used an earlier version of https://unity.com/how-to/analyze-memory-usage-memory-profili... to accidentally discover a memory leak that was due to some 3rd party code with a lambda that captured an ancient, archived version of Microsoft's C# vector which had a bug. There were multiple layers of impossibility of me finding that through inspection. But, with a functional tool, it was obvious.

Ten years before that I worked on a bespoke commercial game engine that had its own memory tracker. First thing we did with it was fire up a demo program, attach the memory analyzer to it, then attach a second instance of the memory analyzer to the first one and found a memory error in the memory analyzer.

Now that I'm out of gamedev, I feel like I'm working completely blind. People barely acknowledge the existence of debuggers. I don't know how y'all get anything to work.

A quick google for open-source C++ solutions turns up https://github.com/RudjiGames/MTuner which happens to have been updated today. From a game developer, of course XD

[−] inetknght 50d ago

>

I look at memory profiles of rnomal apps and often think "what is burning that memory".

As a corrolary to this: I look at CPU utilization graphs. Programs are completely idle. "What is burning all that CPU?!"

I remember using a computer with RAM measured in two-digit amounts of MiB. CPU measured in low hundreds of MHz. It felt just as fast -- sometimes faster -- as modern computers. Where is all of that extra RAM being used?! Where is all of that extra performance going?! There's no need for it!

[−] gwbas1c 50d ago
Basically, the short answer is that most memory managers allocate more memory than a process needs, and then reuse it.

IE, in a JVM (Java) or dotnet (C#) process, the garbage collector allocates some memory from the operating system and keeps reusing it as it finds free memory and the program needs it.

These systems are built with the assumption that RAM is cheap and CPU cycles aren't, so they are highly optimized CPU-wise, but otherwise are RAM inefficient.

[−] ben-schaaf 50d ago
Completely agree, it would be very helpful to get even just a breakdown of what the ram is being used for. It's unfortunately a lot of work to instrument.

> sublime consumes 200mb. I have 4 text files open. What is it doing?

To add to what others have said: Depending on the platform a good amount will be the system itself, various buffers and caches. If you have a folder open in the side bar, Sublime Text will track and index all the files in there. There's also no limit to undo history that is kept in RAM.

There's also the possibility that that 200MB includes the subprocesses, meaning the two python plugin hosts and any processes your plugins spawn - which can include heavy LSP servers.

[−] senfiaj 50d ago
It's partly because there are layers of abstractions (frameworks, libraries / runtimes / VM, etc). Also, today's software often has other pressures, like development time, maintainability, security, robustness, accessibility, portability (OS / CPU architecture), etc. It's partly because the complexity / demand has increased.

https://waspdev.com/articles/2025-11-04/some-software-bloat-...

[−] pjmlp 50d ago
It is a matter of tooling.

Visual Studio runs the memory profiler in debug mode right from the start, it is the default configuration, you need to disable it.

https://learn.microsoft.com/en-us/visualstudio/profiling/mem...

[−] Orygin 50d ago
200Mb for Sublime does not seem so bad when compared to Postman using 4Gb on my machine...
[−] koverstreet 47d ago
I did memory allocation profile for the Linux kernel. Sure would be nice if we had the same capabilities in userspace.
[−] toss1 49d ago

>>I'm always confused as hell how little insight we have in memory consumption.

>>I look at memory profiles of rnomal apps and often think "what is burning that memory".

Because companies starting with Microsoft approach it as an infinite resource, and have done so literally for generations of programmers — it is now ancient tradition.

Back in the x86 days when both memory and memory handles were constrained (64k of them, iirc) I went to a MS developer conference. One problem starting to plague everyone was users' computers running out of memory when actual memory in use was less than half, and the problem was not that memory was used, but all available handles were consumed.

I randomly ended up talking to the (at the time) leader of the Excel team, so I thought I'd ask him about good practices, asking "Does it make sense to have the software look at the task and make an estimate of the full amount of RAM required and allocate it off one handle and track our usage ourselves within that block?" I was speechless when he answered: "Sure, if you wanted to optimize the snot out of it — we just allocate another handle."

That two-line answer just blew my mind and instantly explained so much about problems I saw at the time, and since.

It also made sense in the context of another talk they gave at a previous conference where the message was they anticipate the increased power of the next generation of hardware and write their new version for that hardware, not the then-current hardware. It makes sense, but in the new light, it seems almost like a cousin of planned obsolescence — "How can we squander all the new power Intel is giving us?". And the result was decades after word processing and spreadsheets had usable performance on 640K DOS machines, new machines with orders of magnitude more power and RAM, actually run slower from a user perspective.

I'm hoping this memory crunch (having postponed a memory upgrade for my daily driver and now noticing it is 10x the price) will at least have the benefit of driving developers to maybe get back some craft of designing in optimization.

[−] veunes 50d ago
Part of the problem is that modern apps aren't really "one thing" anymore
[−] BiteCode_dev 49d ago
rnomal ?
[−] Capricorn2481 50d ago

> sublime consumes 200mb. I have 4 text files open. What is it doing?

Huh? Sublime Text? I have like 100 files open and it uses 12mb. Sublime is extremely lean.

Do you have plugins installed?

[−] 1vuio0pswjnm7 50d ago
Been waiting for online commentary about programming to start acknowledging this situation as it pertains to writing programs

Memory and storage are not "cheap" anymore. Power may also rise in cost

Under these conditions, memory usage and binary size are irrefutably relevant^1

To some, this might feel like going backwards in time toward the mainframe era. Another current HN item with over 100 points, "Hold on to your hardware", reflects on how consumer hardware may change as a result

To me, the past was a time of greater software efficiency; arguably this was necessitated by cost. Perhaps higher costs in the present and future could lead to better software quality. But whether today's programmers are up for the challenge is debatable. It's like young people in finance whose only experience is in a world with "zero" interest rates. It's easier to whine about lowering rates than to adapt

With the money and poltical support available to "AI" companies, the incentive for efficiency of any kind is lacking. Perhaps their "no limits" operations, e.g., its effects on supply, may provide an incentive for others' efficiency

1. As an underpowered computer user that compiles own OS and writes own simple programs, I've always rejected large binary size and excessive memory use, even in times of "abundance"

[−] canpan 50d ago
String views were a solid addition to C++. Still underutilized. It does not matter which language you are using when you make thousands of tiny memory allocations during parsing. https://en.cppreference.com/w/cpp/string/basic_string_view.h...
[−] baud9600 49d ago
Strange days we live in. Python and C++? What about a line of bash:

tr -s '[:space:]' '\n' < file.txt | sort | uniq -c | sort -rn

I’d like to know the memory profile of this. The bottleneck is obviously sort which buffers everything in memory. So if we replace this with awk using a hash map to keep count of unique words, then it’s a much smaller data set in memory:

tr -s '[:space:]' '\n' < file.txt | awk '{c[$0]++} END{for(w in c) print c[w], w}' | sort -rn

I’m guessing this will beat Python and C++?

[−] griffindor 50d ago
Nice!

> Peak memory consumption is 1.3 MB. At this point you might want to stop reading and make a guess on how much memory a native code version of the same functionality would use.

I wish I knew the input size when attempting to estimate, but I suppose part of the challenge is also estimating the runtime's startup memory usage too.

> Compute the result into a hash table whose keys are string views, not strings

If the file is mmap'd, and the string view points into that, presumably decent performance depends on the page cache having those strings in RAM. Is that included in the memory usage figures?

Nonetheless, it's a nice optimization that the kernel chooses which hash table keys to keep hot.

The other perspective on this is that we sought out languages like Python/Ruby because the development cost was high, relative to the hardware. Hardware is now more expensive, but development costs are cheaper too.

The take away: expect more push towards efficiency!

[−] gwbas1c 50d ago
A lot of frameworks that use variants of "mark and sweep" garbage collection instead of automatic reference counting are built with the assumption that RAM is cheap and CPU cycles aren't, so they are highly optimized CPU-wise, but otherwise are RAM inefficient.

I wonder if frameworks like dotnet or JVM will introduce reference counting as a way to lower the RAM footprint?

[−] tzot 50d ago
Well, we can use memoryview for the dict generation avoiding creation of string objects until the time for the output:

    import re, operator
    def count_words(filename):
        with open(filename, 'rb') as fp:
            data= memoryview(fp.read())
        word_counts= {}
        for match in re.finditer(br'\S+', data):
            word= data[match.start(): match.end()]
            try:
                word_counts[word]+= 1
            except KeyError:
                word_counts[word]= 1
        word_counts= sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True)
        for word, count in word_counts:
            print(word.tobytes().decode(), count)
We could also use mmap.mmap.
[−] fix4fun 50d ago
Digression: Nowadays when RAM is expensive good old zram is gaining popularity ;) Try to check on trends.google.com . Since 2025-09 search for it doubled ;)
[−] bcjdjsndon 50d ago
A few things

- since GC languages became prevalent, and maybe high level programming in general, coders arent as economic with their designs. Memory isn't something a coder should worry about apparently.

- far more people code apps in web languages because they don't know anything else. These are anywhere from 5-10 levels of abstraction away from the metal, naturally inefficient.

- increasing scope... I can only describe this one by example, web browsers must implement all manner of standards etc that it's become a mammoth task, especially compared to 90s. Same for compilers, oses, heck even computers thenselves were all one-man jobs at some point because things were simpler cos we knew less.

[−] 6510 48d ago
Someone on youtube[0] suggested we use (something like) KolibriOS[1] in a vm as a kind of webbrowser. Then we can have snappy effective desktop apps for everything cross platform(!)

The 12 MB OS looks surprisingly mature. We are so conditioned for bloat that with each click I'm surprised how fast it responds. I don't remember ever being surprised by the same thing twice in a row but here it stays surprising how everything opens in the next frame even on very poor hardware.

Besides from a vm I read there is a synery[3] client for it[4]. I've used crappy pc's on extra screens and having a dedicated machine for a single application is fun, useful and it makes old stuff useful. You can run heavy applications on the main computer, it's unimportant the extra cant do it.

[0] - https://www.youtube.com/watch?v=v3NVKOsWkQs

[1] - https://www.kolibrios.org/en

[3] - https://www.youtube.com/watch?v=tlt7X0H5GJw

[4] - https://board.kolibrios.org/viewtopic.php?t=2544

[−] dgb23 50d ago
Not a C++ programmer and I think the solution is neat.

But it's not necessarily an apples to apples comparison. It's not unfair to python because of the runtime overhead. It's unfair because it's a different algorithm with fundamentally different memory characteristics.

A fairer comparison would be to stream the file in C++ as well and maintain internal state for the count. For most people that would be the first/naive approach as well when they programmed something like this I think. And it would showcase what the actual overhead of the python version is.

[−] zahlman 50d ago

> This sounds like a job for Python. Indeed, an implementation takes fewer than 30 lines of code.

I don't know if the implementation is written in a "low-level" way to be more accessible to users of other programming languages, but it can certainly be done more simply leveraging the standard library:

  from collections import Counter
  import sys

  with open(sys.argv[1]) as f:
      words = Counter(word for line in f for word in line.split())

  for word, count in words.most_common():
      print(count, word)
At the very least, manually creating a (count, word) list from the dict items and then sorting and reversing it in-place is ignoring common idioms. sorted creates a copy already, and it can be passed a sort key and an option to sort in reverse order. A pure dict version could be:

  import sys

  with open(sys.argv[1]) as f:
    counts = {}
    for line in f:
      for word in line.split():
        counts[word] = counts.get(word, 0) + 1

  stats = sorted(counts.items(), key=lambda item: item[1], reverse=True)

  for word, count in stats:
      print(count, word)
(No, of course none of this is going to improve memory consumption meaningfully; maybe it's even worse, although intuitively I expect it to make very little difference either way. But I really feel like if you're going to pay the price for Python, you should get this kind of convenience out of it.)

Anyway, none of this is exactly revelatory. I was hoping we'd see some deeper investigation of what is actually being allocated. (Although I guess really the author's goal is to promote this Pystd project. It does look pretty neat.)

[−] veunes 50d ago
Not "C++ everywhere again" but maybe "understanding memory again"
[−] tombert 50d ago
I've been rewriting a lot of my stuff in Rust to save memory.

Rust is high-level enough to still be fun for me (tokio gives me most of the concurrency goodies I like), but the memory usage is often like 1/10th or less compared to what I would write in Clojure.

Even though I love me some lisp, pretty much all my Clojure utilities are in Rust land now.

[−] kristianp 49d ago
How much memory does the C++ compiler use when compiling the program? I wonder how that compares to the python program? Not a completely unrelated metric.

Would the rust compiler use much more memory compiling a comparable program to the C++ version?

[−] yakkomajuri 50d ago
The abrupt ending was funny and then I realized the author is Finnish and it all made sense.

Nice post.

(P.S. I'm also Finnish)

[−] wbsun 49d ago
It’s less about 'old vs. new' and more about the evolving trade-offs dictated by the constraints of the era. There have always been engineers trying to squeeze every last drop of performance out of the bits available to them.
[−] perching_aix 49d ago
Since we're doing this, I do wonder then: is going from a 1.3K input file to 21K peak memory usage (16x) really optimal?

It's certainly a lot better than 1000x, sure, but still surprised me.

[−] 90d 50d ago
Speaking about optimization, is Windows just too far gone at this point? It is comical the amount of resources it uses at "idle".
[−] est 50d ago
I think py version can be shortened as:

from collections import Counter

stats = Counter(x.strip() for l in open(sys.argv[1]) for x in l)

[−] callamdelaney 50d ago
I shove everything in memory, it's a design decision. Memory is still cheap, relatively.
[−] biorach 50d ago
"copyright infringement factories"
[−] gostsamo 50d ago

> how much memory a native code version of the same functionality would use.

native to what? how c++ is more native than python?

[−] amelius 50d ago

> AI sociopaths have purchased all the world's RAM in order to run their copyright infringement factories at full blast

The ultimate bittersweet revenge would be to run our algorithms inside the RAM owned by these cloud companies. Should be possible using free accounts.

[−] yieldcrv 50d ago
as long as you know what architecture questions to ask, agentic coding can help with this next phase of optimization really quickly

delaying comp sci differentiation for a few months

I wonder if assembly based solutions will become in vogue