Aryan Sharma

53 streams from a single H100.

A custom NIM server, vLLM serving with TRT-LLM kernels, and SNAC isolated onto a gRPC tier with micro-batch scheduling. Orpheus runs 2.2× over industry-leading throughput.

CONCURRENT STREAMS / H100 53 Orpheus TTS · real-time

Led the zero-to-one build of a custom NIM server combining vLLM's serving layer with TRT-LLM kernels and decoupled SNAC decoding for scalable, optimized deployment.

Isolated SNAC onto an optimized gRPC server with micro-batch scheduling, achieving 2.2× industry-leading throughput at an average of 53 real-time concurrent streams per H100.

Accelerated SNAC decoding with CUDA Graphs over gRPC. Orchestrated a MegaPod setup on K8s with 7 LLM pods across 7 H100s and 3 SNAC instances on a single H100 — throughput bottlenecked only by the LLM backbone.

Trained a custom EAGLE-3 draft model for Orpheus with 63%+ draft acceptance, pushing vLLM throughput by 2× to support 16 concurrent real-time streams. Also implemented Suffix Tree Decoding for Orpheus and Qwen3 32B for agentic workflows.

8 → 84 streams. p95 TTFT: 150 ms.

5 TensorRT engines on a single H100. Replaced Python locks with CUDA events on the hot path — eliminated GIL contention, hit the memory-bound theoretical max at 84 sessions.

CONCURRENT STREAMS 8 → 84 single H100

P95 TTFT 150ms 1s audio chunks

Architected a disaggregated streaming ASR pipeline from scratch with 5 TensorRT engines on a single H100. Built a scalable, reliable gRPC server for online serving — industry-leading SLA at p95 TTFT of 150ms on 1s streaming audio chunks.

Scaled streaming ASR from 8 to 84 concurrent streams per H100 using CUDA streams, CUDA events, continuous batching, and a disaggregated encoder/decoder setup with per-slot synchronization.

Replaced Python locks with CUDA events for GPU-side encoder-decoder coordination, eliminating GIL contention on the hot path. Simplified from 5 streams to 2 dedicated streams with event-driven signaling, hitting the memory-bound theoretical max at 84 sessions.

1800+ RTFx on a single H100.

A 5–50× scalability gain on production voice-agent workloads. 2.29×–4.43× latency reduction.

REAL-TIME FACTOR 1800+ whisper-large-v2 · NIM

Optimized Whisper-Large-V2 on a custom NIM server, delivering a 5–50× scalability gain and 2.29×–4.43× latency reduction while achieving 1800+ RTFx on a single H100 for production voice-agent workloads.

2.7 GB → 224 MB. Same model. 0.2% loss.

ASR + NMT compressed for edge inference across 22 Indic languages. Under 1s for 7s of audio on Android.

SIZE REDUCTION 92% CTranslate2 · static quant

ACCURACY LOSS 0.2% end-to-end

Engineered a pipeline to compress ASR NeMo modules using 8-bit integer quantization, achieving 73% size reduction to 143MB for edge device inference.

Reduced CTranslate2 model from 2.7GB to 224MB using static quantization — 92% reduction with only 0.2% accuracy loss.

Developed a demo Android application for ASR across all 22 Indic languages, achieving under 1 second inference for 7 seconds of audio. Built a MacOS application for NMT offline inference.

Presented at DPG Dialogues '24 as a lightning talk. Watch →

hi, i'm aryan.

inference & optimizations engineer at sarvam ai. below is what i've shipped so far.

press to see it the way your GPU would. also just started writing.

Experience

Inference & Optimizations Intern

Sarvam AI

Jan'25 — Present

Orpheus TTS: 2.2x Over Industry, 53 Streams per H100

Led the zero-to-one build of a custom NIM server for the Orpheus TTS model, combining vLLM's serving layer with TRT-LLM kernels and decoupled SNAC decoding for scalable, optimized deployment.
Isolated SNAC onto an optimized gRPC server with micro-batch scheduling, achieving 2.2x industry-leading throughput at an average of 53 real-time concurrent streams per H100.
Accelerated SNAC decoding with CUDA Graphs over gRPC serving. Orchestrated a MegaPod setup on K8s with 7 LLM pods across 7 H100s and 3 SNAC instances on a single H100, making throughput bottlenecked only by the LLM backbone.
Trained a custom EAGLE-3 draft model for Orpheus with 63%+ draft acceptance rate (draft sampling in vLLM), pushing vLLM throughput by 2x to support 16 concurrent real-time streams. Also implemented Suffix Tree Decoding for Orpheus and Qwen3 32B for agentic workflows.

Streaming ASR: Disaggregated Setup, 8 → 84 Streams, p95 TTFT 150ms

Architected a disaggregated streaming ASR pipeline from scratch with 5 TensorRT engines on a single H100, and built a scalable, reliable gRPC server for online serving, achieving industry-leading SLA with p95 TTFT of 150ms on 1s streaming audio chunks.
Scaled streaming ASR from 8 to 84 concurrent streams per H100 using CUDA streams, CUDA events, continuous batching, and a disaggregated encoder/decoder setup with per-slot synchronization.
Replaced Python locks with CUDA events for GPU-side encoder-decoder coordination, eliminating GIL contention on the hot path. Simplified from 5 streams to 2 dedicated streams with event-driven signaling, hitting the memory-bound theoretical max at 84 sessions.

Whisper ASR: 1800+ RTFx on Single H100, 5-50x Scalability on NIM

Optimized Whisper-Large-V2 on a custom NIM server, delivering a 5-50x scalability gain and 2.29x-4.43x latency reduction while achieving 1800+ RTFx on a single H100 for production voice-agent workloads.

AI Intern

Bhashini · C4GT DMP'24

Jun'24 — Sept'24

Model Compression & Edge Deployment

Engineered pipeline to compress ASR NeMo modules using 8-bit integer quantization, achieving 73% size reduction to 143MB for edge device inference.
Reduced CTranslate2 model from 2.7GB to 224MB using static quantization, 92% reduction with only 0.2% accuracy loss.
Developed demo Android application for ASR across all 22 Indic languages, achieving under 1 sec inference for 7 sec audio. Built MacOS application for NMT offline inference.
Presented work at DPG Dialogues'24 as a lightning talk. Watch →

Projects

Moonlight Protein Engineering

GitHub →

Dec'22 — Feb'23

Applied Meta's ESM-2 model generating embedding sequences from moonlight protein sequences.
Compared Bi-LSTMs, MLP, and classical ML techniques (SVM, Random Forest, KNN, etc.) with detailed comparative analysis.
Enhanced accuracy by 7.2% using ensembling techniques, achieving ~81.7% accuracy.

Education

Cluster Innovation Centre, University of Delhi

B.Tech. (IT & Mathematical Innovation) · CGPA: 8.82

2022 — 2026

Skills

Python · Machine Learning · Deep Learning · PyTorch · Triton Server · TensorRT · ONNX · TensorRT-LLM · vLLM · SGLang · Docker · NVIDIA NIM · NVIDIA Dynamo · LLM Perf · LMCache · Spec Decoding

The Inference Engineer

53 streams from a single H100.

8 → 84 streams. p95 TTFT: 150 ms.

1800+ RTFx on a single H100.

2.7 GB → 224 MB. Same model. 0.2% loss.

Cluster Innovation Centre, University of Delhi

Experience

Inference & Optimizations Intern

AI Intern

Projects

Moonlight Protein Engineering

Education

Cluster Innovation Centre, University of Delhi

Skills