Pandas Exercises for Data Analysis (Interactive) (machinelearningplus.com)

by selva86 33 comments 126 points
Read article View on HN

33 comments

[−] ronfriedhaber 59d ago
Pandas is terrific, yet even its original author has noted inherent shortcomings [1], and there exist alternatives.

Polars seems to be the most prominent competitor in the Python DataFrame space, and DuckDB appears to pursue an approach similar to SQLite, but columnar.

I am personally working on a solution to a broader problem, which can also be viewed as an alternative to Pandas [2].

[1] https://wesmckinney.com/blog/apache-arrow-pandas-internals/

[2] https://github.com/ronfriedhaber/autark

[−] arijun 59d ago
For your link [1], many of those issues have been addressed with pandas 2.0 (which I believe Wes Mckinney [pandas' original author] contributed to). So it's a bit disingenuous to point to that post and say "See? Even Wes disowns it!"

That being said, if I were to start a new project requiring that kind of work today, I would probably try Polars first. Their greenfield implementation allowed them to get rid of many of the crusty edges of pandas.

[−] 0x696C6961 59d ago
Would be nice to have a polars version of this.
[−] selva86 64d ago
Build this as an interactive tool for our popular 101 Pandas exercises. The code runs entirely in local in your browser. Would love feedback on the ease of use and the editor UX.
[−] short_sells_poo 59d ago
You'll get a lot of responses saying Polars is better than Pandas. I argue those people are missing the point and don't understand Pandas' real strength or why people choose Pandas today.

Pandas was never meant to be a technologist's tool. It was meant to be a researcher's tool and was unfortunately coopted to be a technical solution as well. It has not well escaped it's roots.

Pandas is fantastic for doing iterative and interactive research on semi-structured data. It has a lot of QoL facilities and utility functions for seamlessly dealing with exploratory timeseries analytics for in-core data. Data that fits into memory.

For example, I can take two time series and calculate their product:

ts3 = ts1 * ts2

This one line does a huge amount of heavily lifting by automatically aligning the timestamps and columns between the two inputs so that I'm not accidentally multiplying two entries that have the same ordinal but not the same timestamp or column label.

Can I do the same with Polars? Yes, but it comes with exponentially more cognitive overhead. And this is just one example.

Pandas is ultimately a flawed product as it's origin's go back more than a decade where R's dataframe was cutting edge. A lot of innovation happened since then and the API and internals of Pandas mean that certain choices that were made early on are nontrivial to change.

This doesn't change the fact that Pandas is still immensely useful. Eventually perhaps Polars will come close to it, but so far the focus wasn't on interactive use ergonomics unfortunately.

As it stands, I use pandas for research and polars for production systems.

[−] rithdmc 59d ago
Dope. I've just started using Pandas in some personal projects, and am quickly hitting my knowledge ceiling. I think this will be useful. I'll check it out properly after work.
[−] driftnode 58d ago
The author posted a Polars version in the comments and almost nobody noticed. Meanwhile the top comments are still asking for it. Building something useful and having people ignore what you made to request what you already made is a special kind of frustration.
[−] kasperset 59d ago
I don't hear much about Ibis here. https://ibis-project.org On paper it sounds like a good idea. Any opinion about this option.
[−] sghaz 59d ago
The pricing page says, "This page doesn’t seem to exist. It looks like the link pointing here was faulty. Maybe try searching?"
[−] fud101 59d ago
what is the permission it asks for? it seems suspicious af.
[−] kjkjadksj 59d ago
If you think pandas is comfortable, wait until you try base R. Such a comfortable language for data wrangling and analysis.