2025-12-11 –, Data Engineering & Infrastructure
Geospatial analysis often involves harmonizing and processing raster datasets from diverse sources with varying resolutions, coordinate systems, and data formats. This talk demonstrates how you can build efficient, scalable pipelines for zonal statistics extraction using Python’s scientific computing stack, xarray, and dask to handle rasters that would otherwise overwhelm traditional processing approaches.
Through a real-world case study of processing multi-source geospatial data for small-area estimation of poverty, we’ll explore practical strategies for memory-efficient raster harmonization, parallel computing workflows, and automated statistical aggregation across administrative boundaries.
This talk addresses a common challenge faced by data scientists, data engineers, researchers, and geospatial analysts working with large-scale geospatial data: how to efficiently process and harmonize raster datasets that exceed memory limits, while maintaining both data integrity and computational performance. Attendees are expected to have a basic familiarity with Python and an understanding of fundamental geospatial concepts.
I will begin by outlining prevalent issues in geospatial data processing, such as memory constraints when working with large rasters, the difficulty of harmonizing datasets with varying resolutions and projections, and the computational cost of performing zonal statistics across multiple layers. To address these challenges, I will demonstrate how libraries like xarray and rioxarray offer elegant abstractions for geospatial data manipulation, while Dask facilitates out-of-core computation and parallel processing. A technical walkthrough will showcase a flexible pipeline designed to handle key data processing scenarios: downsampling, upsampling, masking, managing missing values, and other steps.
I will do a live code demonstration from a project involving zonal statistics for small area poverty estimation. This will include processing layers such as population density, distance to healthcare, and nightlights to produce harmonized zonal statistics at administrative level three of a select country. To wrap up, we’ll briefly touch on optimization strategies, including chunking techniques and memory management.
Yes
Clinton Oyogo David is a Data Scientist at Oxford Policy Management, specializing in geospatial analytics, data engineering, dashboard development, and automation. He has led data-intensive projects across Africa and Asia, developing data pipelines, dashboards, and data analysis for various organisations. Clinton combines a background in statistics with a deep interest in scalable data solutions that inform policy and drive impact. His recent work focuses on harmonizing large raster datasets using tools like xarray and Dask to support small area estimation of poverty and sustainable development research.