53 streams from a single H100.
A custom NIM server, vLLM serving with TRT-LLM kernels, and SNAC isolated onto a gRPC tier with micro-batch scheduling. Orpheus runs 2.2× over industry-leading throughput.
Led the zero-to-one build of a custom NIM server combining vLLM's serving layer with TRT-LLM kernels and decoupled SNAC decoding for scalable, optimized deployment.
Isolated SNAC onto an optimized gRPC server with micro-batch scheduling, achieving 2.2× industry-leading throughput at an average of 53 real-time concurrent streams per H100.
Accelerated SNAC decoding with CUDA Graphs over gRPC. Orchestrated a MegaPod setup on K8s with 7 LLM pods across 7 H100s and 3 SNAC instances on a single H100 — throughput bottlenecked only by the LLM backbone.
Trained a custom EAGLE-3 draft model for Orpheus with 63%+ draft acceptance, pushing vLLM throughput by 2× to support 16 concurrent real-time streams. Also implemented Suffix Tree Decoding for Orpheus and Qwen3 32B for agentic workflows.