hi, i'm aryan.
inference & optimizations engineer at sarvam ai. below is what i've shipped so far.
press to see it the way your GPU would. also just started writing.
Experience
Inference & Optimizations Intern
Orpheus TTS: 2.2x Over Industry, 53 Streams per H100
- Led the zero-to-one build of a custom NIM server for the Orpheus TTS model, combining vLLM's serving layer with TRT-LLM kernels and decoupled SNAC decoding for scalable, optimized deployment.
- Isolated SNAC onto an optimized gRPC server with micro-batch scheduling, achieving 2.2x industry-leading throughput at an average of 53 real-time concurrent streams per H100.
- Accelerated SNAC decoding with CUDA Graphs over gRPC serving. Orchestrated a MegaPod setup on K8s with 7 LLM pods across 7 H100s and 3 SNAC instances on a single H100, making throughput bottlenecked only by the LLM backbone.
- Trained a custom EAGLE-3 draft model for Orpheus with 63%+ draft acceptance rate (draft sampling in vLLM), pushing vLLM throughput by 2x to support 16 concurrent real-time streams. Also implemented Suffix Tree Decoding for Orpheus and Qwen3 32B for agentic workflows.
Streaming ASR: Disaggregated Setup, 8 → 84 Streams, p95 TTFT 150ms
- Architected a disaggregated streaming ASR pipeline from scratch with 5 TensorRT engines on a single H100, and built a scalable, reliable gRPC server for online serving, achieving industry-leading SLA with p95 TTFT of 150ms on 1s streaming audio chunks.
- Scaled streaming ASR from 8 to 84 concurrent streams per H100 using CUDA streams, CUDA events, continuous batching, and a disaggregated encoder/decoder setup with per-slot synchronization.
- Replaced Python locks with CUDA events for GPU-side encoder-decoder coordination, eliminating GIL contention on the hot path. Simplified from 5 streams to 2 dedicated streams with event-driven signaling, hitting the memory-bound theoretical max at 84 sessions.
Whisper ASR: 1800+ RTFx on Single H100, 5-50x Scalability on NIM
- Optimized Whisper-Large-V2 on a custom NIM server, delivering a 5-50x scalability gain and 2.29x-4.43x latency reduction while achieving 1800+ RTFx on a single H100 for production voice-agent workloads.
AI Intern
Bhashini · C4GT DMP'24
Model Compression & Edge Deployment
- Engineered pipeline to compress ASR NeMo modules using 8-bit integer quantization, achieving 73% size reduction to 143MB for edge device inference.
- Reduced CTranslate2 model from 2.7GB to 224MB using static quantization, 92% reduction with only 0.2% accuracy loss.
- Developed demo Android application for ASR across all 22 Indic languages, achieving under 1 sec inference for 7 sec audio. Built MacOS application for NMT offline inference.
- Presented work at DPG Dialogues'24 as a lightning talk. Watch →
Projects
Moonlight Protein Engineering
- Applied Meta's ESM-2 model generating embedding sequences from moonlight protein sequences.
- Compared Bi-LSTMs, MLP, and classical ML techniques (SVM, Random Forest, KNN, etc.) with detailed comparative analysis.
- Enhanced accuracy by 7.2% using ensembling techniques, achieving ~81.7% accuracy.
Education
Cluster Innovation Centre, University of Delhi
B.Tech. (IT & Mathematical Innovation) · CGPA: 8.82
Skills
Python · Machine Learning · Deep Learning · PyTorch · Triton Server · TensorRT · ONNX · TensorRT-LLM · vLLM · SGLang · Docker · NVIDIA NIM · NVIDIA Dynamo · LLM Perf · LMCache · Spec Decoding