Publications

Breaking the Curse of Dimensionality: On the Stability of Modern Vector Retrieval

Published in (Preprint), 2025

Modern vector databases enable efficient retrieval over high-dimensional neural embeddings, powering applications from web search to retrieval-augmented generation. However, classical theory predicts such tasks should suffer from the curse of dimensionality, where distances between points become nearly indistinguishable, thereby crippling efficient nearest-neighbor search. We revisit this paradox through the lens of stability, the property that small perturbations to a query do not radically alter its nearest neighbors. Building on foundational results, we extend stability theory to three key retrieval settings widely used in practice: (i) multi-vector search, where we prove that the popular Chamfer distance metric preserves single-vector stability, while average pooling aggregation may destroy it; (ii) filtered vector search, where we show that sufficiently large penalties for mismatched filters can induce stability even when the underlying search is unstable; and (iii) sparse vector search, where we formalize and prove novel sufficient stability conditions. Across synthetic and real datasets, our experimental results match our theoretical predictions, offering concrete guidance for model and system design to avoid the curse of dimensionality. Read more

Recommended citation: Vihan Lakshman, Blaise Munyampirwa, Julian Shun, Benjamin Coleman https://arxiv.org/pdf/2512.12458

SDBench: A Comprehensive Benchmark Suite for Speaker Diarization

Published in Interspeech, 2025, 2025

Even state-of-the-art speaker diarization systems exhibit high variance in error rates across different datasets, representing numerous use cases and domains. Furthermore, comparing across systems requires careful application of best practices such as dataset splits and metric definitions to allow for applesto-apples comparison. We propose SDBench (Speaker Diarization Benchmark), an open-source benchmark suite that integrates 13 diverse datasets with built-in tooling for consistent and fine-grained analysis of speaker diarization performance for various on-device and server-side systems. SDBench1 enables reproducible evaluation and easy integration of new systems over time. To demonstrate the efficacy of SDBench, we built SpeakerKit, an inference efficiency-focused system built on top of Pyannote v3. SDBench enabled rapid execution of ablation studies that led to SpeakerKit being 9.6x faster than Pyannote v3 while achieving comparable error rates. Read more

Recommended citation: Berkin Durmus, Blaise Munyampirwa , Eduardo Pacheco, Atila Orhon, Andrey Leonov https://arxiv.org/pdf/2507.16136

Down with the Hierarchy: The H in HNSW stands for Hubs

Published in ECIR, 2026, 2025

Driven by recent breakthrough advances in neural representation learning, approximate near-neighbor (ANN) search over vector embeddings has emerged as a critical computational workload. With the introduction of the seminal Hierarchical Navigable Small World (HNSW) algorithm, graph-based indexes have established themselves as the overwhelmingly dominant paradigm for efficient and scalable ANN search. Read more

Recommended citation: Blaise Munyampirwa, Vihan Lakshman, Benjamin Coleman https://arxiv.org/pdf/2412.01940

Deep learning detects actionable molecular and clinical features directly from head/neck squamous cell carcinoma histopathology slides

Published in International Journal of Radiation Oncology, Biology, Physics, 2020

The purpose of this abstract is to describe the application of deep learning to digital histopathology slide data for detection of clinically relevant features. Deep learning is a form of artificial intelligence which can process graphical data and “learn” to extract hidden features. Here we test the ability of deep learning to detect human papilloma virus, location of origin, and other features. Read more

Recommended citation: J. Dolezal, J.N. Kather, S. Kochanny, J. Schulte, A. Patel, B. Munyampirwa, S. Morin, A. Srisuwananukorn, N. Cipriani, D. Basu, A. Pearson. International Journal of Radiation Oncology, Biology, Physics, Volume 106, Issue 5, 1165 https://www.redjournal.org/article/S0360-3016(19)34202-6/abstract