2025-11-09 –, Tutorial Track 4
DataMaps are ML-powered visualizations of high-dimensional data, and in this talk the data is collections of embedding vectors. Interactive DataMaps run in-browser as web-apps, potentially without any code running on the web server. DataMap tech can be used to visualize, say, the entire collection of chunks in a RAG vector database.
The best-of-breed tools of this new DataMap technique are liberally licensed open source. This presentation is an introduction to building with those repos. The maths will be mentioned only in passing; the topic here is simply how-to with specific tools. Talk attendees will be learning about Python tools, which produce high-quality web UIs.
DataMapPlot is the premiere tool for rendering a DataMap as a web-app. Here is a live demo thereof:
http://connoiter.com/datamap/cff30bc1-0576-44f0-a07c-60456e131b7b
00-10: Intro to DataMaps
10-15: A pipeline blueprint and a DataMap file format that gets assembled by pipelines
15-35: demos tour of such tools as UMAP, HDBSCAP, DataMapPlot, Toponomy, etc.
35-40: Q & A
DataMaps are a new visualization technique for high dimensional data that is especially useful when working with embedding vectors, which are proliferating wildly with the success of LLMs and RAG systems.
The DataMap conceptual model can be framed via an extended metaphor to real world geo maps -- think Google Maps for high-dimensional data. To wit:
- The scene opens on a moonlit 3D landscape viewed as if from a satellite
- The data being mapped are represented as points of light scattered across the surface of the landscape, positioned such that points similar to each other in the original high dimensional space are grouped nearby each other on the 3D map
- The points are grouped into a tree of clusters
- The world starts as a landless water-world but the water recedes
- Eventually islands appear, which are the clusters
- Continued draining of the landscape leads to cluster agglomeration
- DataMap viewers allow users to navigate within the 3D space to perform exploratory data analysis on the high-dimensional data (read: embedding vectors).
The tech behind the above metaphor:
- The elevation is the probability density
- The placement of points is determined via ML-based dimensionality reduction algorithms (UMAP, t-SNE, etc.)
- The hierarchical clustering is sometime called Topic Modeling. In DataMaps, this is usually performed by HDBSCAN and variants (FLASC, etc.)
The tooling is code that implements various topological data analysis (TDA) algorithms but the maths aspect will not be covered in any depth; this is all about how to build DataMap pipelines with open source tools.
The following open source tools will be covered along with demo code:
- UMAP
- HDBSCAN
- DataMapPlot
- Toponomy
- NOMAD
- Vectorizers
- Vectron
- King Tutte
The core FOSS tool for making data maps DataMapPlot. Here is an
live demo thereof:
https://connoiter.com/datamap/cff30bc1-0576-44f0-a07c-60456e131b7b
No previous knowledge expected
Founder/CTO of Connoiter, producing liberally licensed open source DataMap tooling and driving the effort to have a widely useful DataMap data schema in order to promote interoperability and reduce bit rot.