06-07, 10:20–11:05 (Europe/London), Doddington Forum
Lots of data in the real world has missing values, but historically prevalent data science tools have had limited support for such data. This talk will compare traditional numerical approaches, the more modern alternative Arrow, as well as ArcticDB, the client-side Dataframe database developed at Man Group.
Data in the real world is complex, and one form that complexity often takes is missing values. In the Dataframe world, this can mean that your data is no longer representable as a nice rectangle of dense values. So what are the options?
Pandas has historically dominated the data science ecosystem, and offers a couple of alternatives. Certain datatypes, such as floats, timestamps, and strings, have a "natural" representation for missing values (NaN, NaT, and None respectively). Integer types present more of a challenge, as for a given bit-width, all binary values represent legitimate values. Pandas offers SparseArray with a user-defined fill-value. This is memory efficient, but it is still not possible to differentiate between a missing value, and a value that is present and equal to the fill value.
Arrow is the modern alternative in-memory Dataframe representation format, and it comes equipped with in-built handling for missing values that do not depend on the column type in any way. However, the Arrow sparse data representation has it's own drawbacks in terms of both memory usage and processing speed.
This talk will compare and contrast, with examples, the above two approaches, along with the more sophisticated approach taken in ArcticDB. As a database, ArcticDB faces all of the same challenges as Pandas and Arrow for its in-memory processing, plus the extra consideration of efficiently serialising these data structures to disk.
No previous knowledge expected
Alex Owens has been working in a combination of Python and C++ for the past 8 years. For the last 3 and a half of those, he has been a senior engineer on the new open-source Dataframe database, ArcticDB, which is backed by long-time Python enthusiasts Man Group and Bloomberg