Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m (huggingface.co)

by tamnd 167 comments 408 points
Read article View on HN

167 comments

[−] 6thbit 59d ago
From YC /legal

> Except as expressly authorized by Y Combinator, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Site or the Site Content, in whole or in part

Not to pretend this isn't widely happening behind the curtains already, but coming from a "Show HN" seems daring.

[−] Wowfunhappy 59d ago
I can't comment on what is legal, but I very much dislike the idea that my comments are the property of Y Combinator. I assume that by writing here, I am putting information out into the world for anyone to use as they wish.
[−] stackghost 59d ago
HN/YC cares more about community aesthetics than your right to be forgotten.

Try to have your account and its contents deleted. The best I was offered for my 2011-vintage account was to randomize the username, and the reason I was given was that browsing and old thread with a bunch of deleted comments "looks bad".

[−] Wowfunhappy 59d ago
I agree with this policy, deleting comments isn't fair to all the other people who replied to that comment. I don't see how this goes against what I said?
[−] stackghost 59d ago
I was responding to your statement that you don't like that your comments are the property of YC. I was elaborating on how they hold our content (that we author) hostage because it looks pretty.

Not wanting your comments to be property of YC but then also being okay with them refusing to delete your content doesn't make sense to me. Those seem like fundamentally-opposed viewpoints.

Now I'm thinking about it, I wonder what they do with GDPR deletion requests?

[−] wzdd 58d ago
If comments here were for anybody to use as they wish, then anybody could use them for whatever they liked and (thus) YC could refuse to delete them. Being okay with both of those doesn't isn't a fundamentally-opposed viewpoint. One is a logical consequence of the other.
[−] Wowfunhappy 59d ago
I don't want Y Combinator to be the gatekeeper of who can see and use my comments. I think they should belong to everybody.
[−] hnfong 58d ago
AFAICT, you retain the copyrights to your comments, but YC has a license to essentially do whatever they want with them.

So, you could additionally give a license to the world to use your posted comments freely. That doesn't mean HN can't add terms to say clients can't copy the site as a condition for use.

[−] YVoyiatzis 58d ago
Your comments, yes. But the contextual thread as a while, no.
[−] keepamovin 58d ago
I did a show hn a month or so back like this: https://hackerbook.dosaygo.com/

https://news.ycombinator.com/item?id=46435308

https://github.com/DOSAYGO-STUDIO/HackerBook

The mods and community had no problem with it

Differences: Sharded SQLITE, used bigquery export, build script is open on GitHub, interactive “archived website” view of HN, updated weekly (each build takes a couple dollars on a custom GitHub runner)

[−] tamnd 58d ago
@keepamovin thanks, your project was a big inspiration for this.

I built my own pipeline with a slightly different setup. I use Go to download and process the data, and update it every 5 minutes using the HN API, trying to stay within fair use. It is also easy to tweak if someone wants faster or slower updates.

One part I really like is the "dynamic" README on Hugging Face. It is generated automatically by the code and keeps updating as new commits come in, so you can just open it and quickly see the current state.

The code is still a bit messy right now (I open sourced it together with around 3.6M lines across 100+ other tools, hidden in a corner of GitHub, anyone interested can play Sherlock Holmes and find it :) ), but I will clean it up, and open source as clearer new repository and write a proper blog post explaining how it works.

[−] keepamovin 58d ago
Wow tamnd that is lovely to hear. I’m so glad you told me it was an inspiration.

Your big download plus quick refreshes is smart. Is your Background in data/AI?

Because i don’t know much about huggingface beyond its a hub for that.

[−] tamnd 58d ago
Connecting directly with the author of the project that inspired me is awesome.

Let's collaborate and see how we can make our two projects work together. DuckDB has a feature that can write to SQLite: https://duckdb.org/docs/stable/core_extensions/sqlite. Starting from Parquet files, we could use DuckDB to write into SQLite databases. This could reduce ingress time to around five minutes instead of a week.

If I have some free time this weekend, I would definitely like to contribute to your project. Would you be interested?

As for my background, I focus on data engineering and data architecture. I help clients build very large-scale data pipelines, ranging from near real-time systems (under 10 ms) to large batch processing systems (handling up to 1 billion business transactions per day across thousands of partners). Some of these systems use mathematical models I developed, particularly in graph theory.

Happy to chat.

[−] krapp 58d ago
This site offers a public, non rate-limted API. IANAL but I'm reasonably certain that's authorization for anyone to use the data as long as they do so through the API. It certainly isn't the case that you need explicit legal permission to use Hacker News comment data in your project.

There have been tons of alternative frontends and projects using HN data over the years, posted to Show HN without an issue. I think their primary concern is interfering with the YCombinator brand itself. "the site" and "site content" referring to YCombinator and not HN specifically.

[−] tamnd 58d ago
[Author here] The whole pipeline runs on a single ~$10/month VPS, but it can process hundreds of TB even with just 12GB RAM and a 200GB SSD.

The main reason I built this was to have HN data that is easy to query and always up to date, without needing to run your own pipeline first. There are also some interesting ideas in the pipeline, like what I call "auto-heal". Happy to share more if anyone is interested :)

A lot of the choices are trade-offs, as usual with data pipelines. I chose Parquet because it is columnar and compressed, so tools like DuckDB or Polars can read only the columns they need. This matters a lot as the dataset grows. I went with Hugging Face mainly because it is simple and already handles distribution and versioning. I can just push data as commits and get a built-in history without managing extra infrastructure (and, more conveniently, if you read the README, you can query it directly using Python or DuckDB).

The pipeline is incremental. Instead of rebuilding everything, it appends small batches every few minutes using the API. That keeps it fresh while staying cheap to run. The data is also partitioned by time, so queries do not need to scan the entire dataset (and I use very simple tech, just a Go binary running in a "screen" session, using only a few MB of RAM for the whole pipeline).

[−] jiggawatts 59d ago
I could have used this just yesterday!

I've been evaluating Gemini Embedding 2 using Hacker News comments and I wasted half a day making a wrapper for the HN API to collect some sample data to play with.

In case anyone is curious:

- The ability to simply truncate the provided embedding to a prefix (and then renormalize) is useful because it lets users re-use the same (paid!) embedding API response for multiple indexes at different qualities.

- Traditional enterprise software vendors are struggling to keep up with the pace of AI development. Microsoft SQL Server for example can't store a 3072 element vector with 32-bit floats (because that would be 12 KB and the page size is only 8 KB). It supports bfloat16 but... the SQL client doesn't! Or Entity Framework. Or anything else.

- Holy cow everything is so slow compared to full text search! The model is deployed in only one US region, so from Australia the turnaround time is something like 900 milliseconds. Then the vector search over just a few thousand entries with DiskANN is another 600-800 ms! I guess search-as-you-type is out of the question for... a while.

- Speaking of slow, the first thing I had to do was write an asynchronous parallel bounded queue data processor utility class in C# that supports chunking of the input and rate limit retries. This feels like it ought to be baked into the standard library or at least the AI SDKs because it's pretty much mandatory if working with anything other than "hello world" scenarios.

- Gemini Embedding 2 has the headline feature of multi-modal input, but they forgot to implement anything other than "string" for their IEmbeddingGenerator abstraction when used with Microsoft libraries. I guess the next "Preview v0.0.3-alpha" version or whatever will include it.

[−] 0cf8612b2e1e 59d ago
Under the Known Limitations section

  deleted and dead are integers. They are stored as 0/1 rather than booleans.
Is there a technical reason to do this? You have the type right there.
[−] Imustaskforhelp 59d ago
As someone who had made a project analysing hackernews who had used clickhouse, I really feel like this is a project made for me (especially the updated every 5 minute aspect which could've helped my project back then too!)

Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.

I had thought of making a ping website where people can just @Username and a service which can detect it and then send mail to said username if the username has signed up to the service (similar to a service run by someone from HN community which mails you everytime someone responds to your thread directly, but this time in a sort of ping)

[The previous idea came as I tried to ping someone to show them something relevant and thought that wait a minute, something like ping which mails might be interesting and then tried to see if I can use algolia or any service to hook things up but not many/any service made much sense back then sadly so I had the idea in back of my mind but this service sort of solves it by having it being updated every 5 minutes]

Your 5 minute updates really make it possible. I will look what I can do with that in some days but I am seeing some discrepancy in the 5 minute update as last seems to be 16 march in the readme so I would love to know more about if its being updated every 5 minutes because it truly feels phenomenal if true and its exciting to think of some new possibilities unlocked with it.

[−] xnx 59d ago
The best source for this data used to be Clickhouse (https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...), but it hasn't updated since 2025-12-26.
[−] politician 59d ago
This is great. I've soured on this site over the past few years due to the heavy partisanship that wasn't as present in the early days (eternal September), but there are still quite a few people whose opinions remain thought-provoking and insightful. I'm going to use this corpus to make a local self-hosted version of HN with the ability to a) show inline article summaries and b) follow those folks.
[−] brtkwr 59d ago
This comment should make it into the download in a few mins.
[−] lyu07282 59d ago
Please upload to https://academictorrents.com/ as well if possible
[−] Onavo 59d ago
Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.
[−] dacapoday 58d ago
Very Nice Job! I built a small CLI to browse and query it from the terminal: https://github.com/dacapoday/hn
[−] palmotea 59d ago

> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.

Wouldn't that lose deleted/moderated comments?