Portfolio item number 1
Short description of portfolio item number 1
Read more
Short description of portfolio item number 1
Read more
Short description of portfolio item number 2
Read more
Published in International Journal of Radiation Oncology, Biology, Physics, 2020
The purpose of this abstract is to describe the application of deep learning to digital histopathology slide data for detection of clinically relevant features. Deep learning is a form of artificial intelligence which can process graphical data and “learn” to extract hidden features. Here we test the ability of deep learning to detect human papilloma virus, location of origin, and other features. Read more
Recommended citation: J. Dolezal, J.N. Kather, S. Kochanny, J. Schulte, A. Patel, B. Munyampirwa, S. Morin, A. Srisuwananukorn, N. Cipriani, D. Basu, A. Pearson. International Journal of Radiation Oncology, Biology, Physics, Volume 106, Issue 5, 1165 https://www.redjournal.org/article/S0360-3016(19)34202-6/abstract
Published in ECIR, 2026, 2025
Driven by recent breakthrough advances in neural representation learning, approximate near-neighbor (ANN) search over vector embeddings has emerged as a critical computational workload. With the introduction of the seminal Hierarchical Navigable Small World (HNSW) algorithm, graph-based indexes have established themselves as the overwhelmingly dominant paradigm for efficient and scalable ANN search. Read more
Recommended citation: Blaise Munyampirwa, Vihan Lakshman, Benjamin Coleman https://arxiv.org/pdf/2412.01940
Published in Interspeech, 2025, 2025
Even state-of-the-art speaker diarization systems exhibit high variance in error rates across different datasets, representing numerous use cases and domains. Furthermore, comparing across systems requires careful application of best practices such as dataset splits and metric definitions to allow for applesto-apples comparison. We propose SDBench (Speaker Diarization Benchmark), an open-source benchmark suite that integrates 13 diverse datasets with built-in tooling for consistent and fine-grained analysis of speaker diarization performance for various on-device and server-side systems. SDBench1 enables reproducible evaluation and easy integration of new systems over time. To demonstrate the efficacy of SDBench, we built SpeakerKit, an inference efficiency-focused system built on top of Pyannote v3. SDBench enabled rapid execution of ablation studies that led to SpeakerKit being 9.6x faster than Pyannote v3 while achieving comparable error rates. Read more
Recommended citation: Berkin Durmus, Blaise Munyampirwa , Eduardo Pacheco, Atila Orhon, Andrey Leonov https://arxiv.org/pdf/2507.16136
Published in (Preprint), 2025
Modern vector databases enable efficient retrieval over high-dimensional neural embeddings, powering applications from web search to retrieval-augmented generation. However, classical theory predicts such tasks should suffer from the curse of dimensionality, where distances between points become nearly indistinguishable, thereby crippling efficient nearest-neighbor search. We revisit this paradox through the lens of stability, the property that small perturbations to a query do not radically alter its nearest neighbors. Building on foundational results, we extend stability theory to three key retrieval settings widely used in practice: (i) multi-vector search, where we prove that the popular Chamfer distance metric preserves single-vector stability, while average pooling aggregation may destroy it; (ii) filtered vector search, where we show that sufficiently large penalties for mismatched filters can induce stability even when the underlying search is unstable; and (iii) sparse vector search, where we formalize and prove novel sufficient stability conditions. Across synthetic and real datasets, our experimental results match our theoretical predictions, offering concrete guidance for model and system design to avoid the curse of dimensionality. Read more
Recommended citation: Vihan Lakshman, Blaise Munyampirwa, Julian Shun, Benjamin Coleman https://arxiv.org/pdf/2512.12458
Published:
Near neighbor search over vector embeddings is a linchpin of modern ML infrastructure, forming a core component of established applications to search and retrieval as well as emerging LLM applications via retrieval-augmented generation (RAG). The seminal Hierarchical Navigable Small World (HNSW) graph index is perhaps the most popular choice in current vector database implementations. In this talk, we share two methods to significantly optimize the HNSW memory consumption and query latency, by removing the hierarchical component of the index and reordering the graph layout. Our extensive benchmark studies show that these methods are simple, easy to productionize, and offer robust performance improvements (on the order of 20-30% peak memory and latency). Read more