09-26, 11:05–11:40 (Europe/Amsterdam), Voyager
This talk presents a principled methodology for partitioning item-level data into homogeneous time-series, with the objective of maximizing monitoring coverage and improving the detection of anomalies and drifts. We discuss the theoretical underpinnings of clustering algorithms for this task and describe practical algorithms enabling efficient search for optimal partitioning. We exemplify our approach with a real-world application in large-scale monitoring environments from the online payment domain.
In time-series monitoring, item-level data - such as clicks, transactions, or service requests - is typically aggregated into pre-defined groups based on measurable categorical features (e.g., country, operating system, integration version). The resulting time series are then analyzed for significant shifts in distribution with anomaly and drift detection pipelines.
However, when the dimensionality or cardinality of the available categorical features is high, the naive stratification based on a pre-defined list of features often leads to a highly imbalanced distribution of the sizes of the time-series: a small subset of time-series contains the vast majority of observations, while the remainder comprises series with very few data points. Time-series based on a small number of items are unsuitable for reliable monitoring and are generally excluded from further analysis, resulting in incomplete coverage of the underlying item space. Moreover, heuristic or static grouping strategies frequently fail to produce homogeneous series in terms of the monitored metrics. Ideally, items exhibiting similar behavioral characteristics — those prone to fail or shift together — should be grouped within the same time-series. Without a principled approach, disparate items may be inappropriately grouped, resulting in noisy, non-informative, or misleading time-series.
This presentation introduces a systematic framework for learning an “optimal” hierarchy of categorical features for stratification. We begin by formally defining the framework and optimization objectives, followed by a comparative analysis of greedy and non-greedy algorithms designed to solve the partitioning problem efficiently. Given the computational intractability of exhaustively searching over all feature permutations, the described methods emphasize scalability and principled feature selection.
To illustrate the efficacy of our approach, we will present case studies from the payments domain, demonstrating how we reduced the number of monitored time-series from millions of heterogeneous streams to thousands of homogeneous time-series, simultaneously increasing the coverage and detection power.