04-18, 10:55–11:30 (US/Eastern), Auditorium 4
Learn how to wrangle data in Python with DuckDB, a fast, open source, in-process analytical SQL database!
Learn how to use DuckDB to process data in python! In the era of "big data," many data practitioners immediately reach for distributed computing solutions when facing large datasets. Modern hardware capabilities combined with efficient tools like DuckDB make this much less necessary than a few years ago. This talk will demonstrate how to effectively wrangle data using DuckDB in Python, offering a powerful alternative to Pandas and Spark for the majority of data science workflows.
This session will cover:
- Understanding DuckDB's architecture and its integration with the Python ecosystem
- Practical examples of migrating from pandas to DuckDB.
- Performance benchmarks comparing DuckDB against pandas and other popular Python data processing methods
- Real-world scenarios where DuckDB shines, including handling larger-than-memory datasets
- Discussion of the "shrinking size" of big data and when to consider DuckDB versus distributed computing solutions
This talk is aimed at Python data practitioners who regularly work with medium to large datasets (100MB-100GB) and are looking to optimize their data processing workflows. The presentation will include both conceptual explanations and hands-on code examples.
No previous knowledge expected
Will Angel is a Data Solution Architect at Excella, leading data teams to help our clients solve data problems. Will is the author of Virtual Power: The Future of Energy Flexibility, an organizer for the Data Visualization and Data Engineers DC Meetups, and the executive director at Data Community DC, a 501c3 nonprofit dedicated to data education in the national capital area. In his free time, Will enjoys wildlife photography, gardening, reading, cooking, art, DIY electronics, and traveling.