What category theory teaches us about dataframes (mchav.github.io)

by mchav 65 comments 190 points
Read article View on HN

65 comments

[−] rich_sasha 42d ago
The article starts well, on trying to condense pandas' gaziliion of inconsistent and continuously-deprecated functions with tens of keyword arguments into a small, condensed set of composable operations - but it lost me then.

The more interesting nugget for me is about this project they mention: https://modin.readthedocs.io/en/latest/index.html called Modin, which apparently went to the effort of analysing common pandas uses and compressed the API into a mere handful of operations. Which sounds great!

Sadly for me the purpose seems to have been rather to then recreate the full pandas API, only running much faster, backed by things like Ray and Dask. So it's the same API, just much faster.

To me it's a shame. Pandas is clearly quite ergonomic for various exploratory interactive analyses, but the API is, imo, awful. The speed is usually not a concern for me - slow operations often seem to be avoidable, and my data tends to fit in (a lot of) RAM.

I can't see that their more condensed API is public facing and usable.

[−] few 42d ago
I felt like one or two decades ago, all the rage was about rewriting programs into just two primitives: map and reduce.

For example filter can be expressed as:

  is_even = lambda x: x % 2 == 0
  mapped = map(lambda x: [x] if is_even(x) else [], data)
  filtered = reduce(lambda x, y: x + y, mapped, [])
But then the world moved on from it because it was too rigid
[−] pavodive 42d ago
When I started reading about pandas complexity and the smaller set of operations needed, couldn't help but think of R's data.table simplicity.

Granted, it's got more than 15 functions, but its simplicity seems to me very similar to what the author presented in the end.

[−] caseyross 42d ago
Interesting idea. I feel like it could be productive to categorize operations by their result shape as well:

- Row select: From N rows, produce 0-N rows.

- Column select: From N columns, produce 0-N columns.

- Table add: From MxN and OxP tables, produce max M+OxN+P table.

- Table subtract: From MxN and OxP tables, produce min 0x0 table.

This line of thinking reveals some normally hard-to-see similarities, such as groupby and dedupe sharing the same underlying mechanism. (i.e., both are "collapsing" row selects.)

[−] voxleone 42d ago
It’s almost suspiciously elegant: focus on transformations and their composition, and the structure takes care of itself.
[−] toxik 42d ago
Pandas and so on exist for the same reason Django's ORM and SqlAlchemy do: people do not want to string interpolate to talk to their database. SQL is great for DBA's, and absolutely sucks for programmers. Microsoft was really onto something with LINQ, in my opinion.
[−] getnormality 42d ago
Hmm. Folks trying to discover the elegant core of data frame manipulation by studying... pandas usage patterns. When R's dplyr solved this over a decade ago, mostly by respecting SQL and following its lead.

The pandas API feels like someone desperately needed a wheel and had never heard of a wheel, so they made a heptagon, and now millions of people are riding on heptagon wheels. Because it's locked in now, everyone uses heptagon wheels, what can you do? And now a category theorist comes along, studies the heptagon, and says hey look, you could get by on a hexagon. Maybe even a square or a triangle. That would be simpler!

No. Stop. Data frames are not fundamentally different from database tables [1]. There's no reason to invent a completely new API for them. You'll get within 10% of optimal just by porting SQL to your language. Which dplyr does, and then closes most of the remaining optimality gap by going beyond SQL's limitations.

You found a small core of operations that generates everything? Great. Also, did you know Brainfuck is Turing-complete? Nobody cares. Not all "complete" systems are created equal. A great DSL is not just about getting down to a small number of operations. It's about getting down to meaningful operations that are grammatically composable. The relational algebra that inspired SQL already nailed this. Build on SQL. Don't make up your own thing.

Like, what is "drop duplicates"? What are duplicates? Why would anyone need to drop them? That's a pandas-brained operation. You want the distinct keys defined by a select set of key columns, like SQL and dplyr provide.

Who needs a separate select and rename? Select is already using names, so why not do your name management there? One flexible select function can do it all. Again, like both SQL and dplyr.

Who needs a separate difference operation? There's already a type of join, the anti-join, that gets that done more concisely and flexibly, and without adding a new primitive, just a variation on the concept of a join. Again, like both SQL and dplyr.

Props to pandas for helping so many people who have no choice but to do tabular data analysis in Python, but the pandas API is not the right foundation for anything, not even a better version of pandas.

[1] No, row labels and transposition are not a good enough reason to regard them as different. They are both just structures that support pivoting, which is vastly more useful, and again, implemented by both R and many popular dialects of SQL.

[−] jiehong 42d ago
[−] hermitcrab 42d ago

>a dataframe is a tuple (A, R, C, D): an array of data A, row labels R, column labels C, and a vector of column domains D.

What is 'a vector of column domains D'? A description of how the data A maps to columns?

[−] jeremyscanvic 42d ago
It's very insightful how they explain the difference between dataframes and SQL tables / standard relational structures!
[−] kiviuq 42d ago
there is also ZIO Prelude and ZIO schema...
[−] jmount 42d ago
I like this sort of study- but it really misses the point to not give more credit for some of the observations and designs to Codd and others.
[−] kokhanserhii 42d ago
[dead]
[−] Whyachi 42d ago
[dead]
[−] hermitcrab 42d ago
I guess this article is an interesting exercise from a pure maths point of view. But, as someone developing a drag and drop data wrangling tool the important thing is creating a set of composable operations/primitive that are meaningful and useful to your end user. We have ended up 73 distinct transforms in Easy Data Transform. Sure they overlap to an extent, but feel they are at the right semantic level for our users, who are not category theorists.