04-19, 15:30–17:00 (US/Eastern), Room 130
Tired of waiting for massive datasets to load on your local machine? In this beginner-friendly tutorial, we’ll explore how to scale your data analysis skills from pandas to PySpark using a real-world anime dataset. We’ll walk through the basics of distributed computing, discuss why Spark was created, and demonstrate the benefits of working with PySpark for big data tasks—including reading, cleaning, and transforming millions of records with ease. By the end of this workshop, you’ll understand how PySpark harnesses cluster computing to handle large-scale data and you’ll be comfortable applying these techniques to your own projects.
Participant Requirements:
- A laptop (any OS) with an internet connection
- A Google account (to access Colab notebooks and slides)
- Familiarity with Python and pandas
Here's the link to the Google Colab to follow along 👇🏾
https://colab.research.google.com/drive/1fi0cTQ1NIE5kDEH0ynp2sqDuVeiBJJWU?usp=sharing
Here are the slides 👇🏾
https://drive.google.com/file/d/11JIih1VzLxTJ9O6PeGzqD_e8vumTZQmw/view?usp=sharing
This tutorial aims to close the gap between small-scale data analysis and big data processing. If you’ve ever tried to load a multi-gigabyte CSV into pandas or Excel, you know the frustration of crashing programs and endless waits. This tutorial shows how to level up your data skills using PySpark’s distributed DataFrame API.
We’ll do more than just introduce Spark concepts—we’ll work through a lively anime dataset full of ratings, genres, and user insights, so you can see how PySpark handles real-world tasks (like filtering, grouping, and joining) at scale. You’ll get comfortable with Spark’s architecture and learn how it uses lazy evaluations, cluster computing, and in-memory operations to achieve speedups. One highlight of the workshop is its hands-on approach: all exercises will be run in Google Colab. That means zero friction in setup—no cluster installation or environment wrangling. We’ll walk through the entire pipeline: loading massive CSV files, performing transformations that mirror pandas operations, and drawing insights through SQL-like queries.
Expect a fast-paced but accessible look at Spark’s key features, practical code examples, and best practices to keep your big data workflows efficient and transparent.
Tutorial Outline
- Why Spark?: A short overview of Hadoop MapReduce and how Spark rose to address its shortcomings.
- Distributed Data 101: Breaking down Spark’s architecture, executors, and lazy evaluation.
- Hands-On Setup: Launching PySpark in Google Colab so everyone can follow along in real time.
- Exploring the Anime Dataset: Reading data from CSV, structuring DataFrames, and performing data cleaning.
- Common Operations at Scale: Filtering, grouping, and aggregating millions of rows with PySpark.
- Comparisons to Pandas: Mapping familiar DataFrame operations to their Spark counterparts.
- Final Thoughts: Discussion of where Spark fits into modern data stacks, plus pointers for advanced usage (MLlib, streaming, cluster optimization).
Previous knowledge expected
Cynthia is a geospatial software engineer with a passion for teaching and making technical concepts approachable. Currently working as a backend software engineer, she develops innovative geospatial solutions that solve real-world problems. Cynthia has a strong background in Python and data science, with experience mentoring students in data analytics at Springboard and teaching Python to beginners at Masterschool.
In addition to her professional work, Cynthia is an experienced public speaker. She’s presented at PyTexas and at Arlington Code-The-Curb on her “Park and Stride” project—a web app that helps commuters integrate walking into their daily routines. Her approachable teaching style combines hands-on learning with practical insights.
Outside of work, Cynthia is passionate about graph theory, computer vision, and geospatial data. She’s currently exploring the intersection of LiDAR technology and urban mobility. When she’s not coding or mentoring, Cynthia enjoys dancing Samba and blogging about ways beginners can break into tech on her website, cynscode.com.