FILE A.SHARMA / 2026
EDITION VOL.01 / NO.01
Inference Engineering Quarterly · Delhi · 2026

The Inference Engineer

Squeezes more concurrent streams onto NVIDIA H100s than the industry standard. Builds disaggregated production pipelines for speech and language. Writes the kernel-level details down so you don't have to learn them the hard way.

53streams / H100
84concurrent ASR
1800+whisper RTFx
2.2×over industry
scroll for the dispatch

53 streams from a single H100.

A custom NIM server, vLLM serving with TRT-LLM kernels, and SNAC isolated onto a gRPC tier with micro-batch scheduling. Orpheus runs 2.2× over industry-leading throughput.

CONCURRENT STREAMS / H100 53 Orpheus TTS · real-time

Led the zero-to-one build of a custom NIM server combining vLLM's serving layer with TRT-LLM kernels and decoupled SNAC decoding for scalable, optimized deployment.

Isolated SNAC onto an optimized gRPC server with micro-batch scheduling, achieving 2.2× industry-leading throughput at an average of 53 real-time concurrent streams per H100.

Accelerated SNAC decoding with CUDA Graphs over gRPC. Orchestrated a MegaPod setup on K8s with 7 LLM pods across 7 H100s and 3 SNAC instances on a single H100 — throughput bottlenecked only by the LLM backbone.

Trained a custom EAGLE-3 draft model for Orpheus with 63%+ draft acceptance, pushing vLLM throughput by 2× to support 16 concurrent real-time streams. Also implemented Suffix Tree Decoding for Orpheus and Qwen3 32B for agentic workflows.

8 → 84 streams. p95 TTFT: 150 ms.

5 TensorRT engines on a single H100. Replaced Python locks with CUDA events on the hot path — eliminated GIL contention, hit the memory-bound theoretical max at 84 sessions.

CONCURRENT STREAMS 8 → 84 single H100
P95 TTFT 150ms 1s audio chunks

Architected a disaggregated streaming ASR pipeline from scratch with 5 TensorRT engines on a single H100. Built a scalable, reliable gRPC server for online serving — industry-leading SLA at p95 TTFT of 150ms on 1s streaming audio chunks.

Scaled streaming ASR from 8 to 84 concurrent streams per H100 using CUDA streams, CUDA events, continuous batching, and a disaggregated encoder/decoder setup with per-slot synchronization.

Replaced Python locks with CUDA events for GPU-side encoder-decoder coordination, eliminating GIL contention on the hot path. Simplified from 5 streams to 2 dedicated streams with event-driven signaling, hitting the memory-bound theoretical max at 84 sessions.

1800+ RTFx on a single H100.

A 5–50× scalability gain on production voice-agent workloads. 2.29×–4.43× latency reduction.

REAL-TIME FACTOR 1800+ whisper-large-v2 · NIM

Optimized Whisper-Large-V2 on a custom NIM server, delivering a 5–50× scalability gain and 2.29×–4.43× latency reduction while achieving 1800+ RTFx on a single H100 for production voice-agent workloads.

2.7 GB → 224 MB. Same model. 0.2% loss.

ASR + NMT compressed for edge inference across 22 Indic languages. Under 1s for 7s of audio on Android.

SIZE REDUCTION 92% CTranslate2 · static quant
ACCURACY LOSS 0.2% end-to-end

Engineered a pipeline to compress ASR NeMo modules using 8-bit integer quantization, achieving 73% size reduction to 143MB for edge device inference.

Reduced CTranslate2 model from 2.7GB to 224MB using static quantization — 92% reduction with only 0.2% accuracy loss.

Developed a demo Android application for ASR across all 22 Indic languages, achieving under 1 second inference for 7 seconds of audio. Built a MacOS application for NMT offline inference.

Presented at DPG Dialogues '24 as a lightning talk. Watch →

DOSSIER · EDUCATION

Cluster Innovation Centre, University of Delhi

B.Tech. (IT & Mathematical Innovation) · CGPA 8.82 · 2022 — 2026

DOSSIER · STACK
  • Python
  • PyTorch
  • TensorRT
  • TRT-LLM
  • vLLM
  • SGLang
  • Triton Server
  • ONNX
  • NVIDIA NIM
  • NVIDIA Dynamo
  • LLM Perf
  • LMCache
  • Spec Decoding
  • Docker
  • Kubernetes
  • gRPC
DOSSIER · WRITING

notes on CUDA, inference, and the things i broke along the way →

Started 2025. First post: CUDA Streams and Events: A Real-World Guide — how to go from 8 to 84 concurrent ASR sessions on an H100.

hi, i'm aryan.

inference & optimizations engineer at sarvam ai. below is what i've shipped so far.

press to see it the way your GPU would. also just started writing.

Experience

Inference & Optimizations Intern

Sarvam AI

Jan'25 — Present
Orpheus TTS: 2.2x Over Industry, 53 Streams per H100
  • Led the zero-to-one build of a custom NIM server for the Orpheus TTS model, combining vLLM's serving layer with TRT-LLM kernels and decoupled SNAC decoding for scalable, optimized deployment.
  • Isolated SNAC onto an optimized gRPC server with micro-batch scheduling, achieving 2.2x industry-leading throughput at an average of 53 real-time concurrent streams per H100.
  • Accelerated SNAC decoding with CUDA Graphs over gRPC serving. Orchestrated a MegaPod setup on K8s with 7 LLM pods across 7 H100s and 3 SNAC instances on a single H100, making throughput bottlenecked only by the LLM backbone.
  • Trained a custom EAGLE-3 draft model for Orpheus with 63%+ draft acceptance rate (draft sampling in vLLM), pushing vLLM throughput by 2x to support 16 concurrent real-time streams. Also implemented Suffix Tree Decoding for Orpheus and Qwen3 32B for agentic workflows.
Streaming ASR: Disaggregated Setup, 8 → 84 Streams, p95 TTFT 150ms
  • Architected a disaggregated streaming ASR pipeline from scratch with 5 TensorRT engines on a single H100, and built a scalable, reliable gRPC server for online serving, achieving industry-leading SLA with p95 TTFT of 150ms on 1s streaming audio chunks.
  • Scaled streaming ASR from 8 to 84 concurrent streams per H100 using CUDA streams, CUDA events, continuous batching, and a disaggregated encoder/decoder setup with per-slot synchronization.
  • Replaced Python locks with CUDA events for GPU-side encoder-decoder coordination, eliminating GIL contention on the hot path. Simplified from 5 streams to 2 dedicated streams with event-driven signaling, hitting the memory-bound theoretical max at 84 sessions.
Whisper ASR: 1800+ RTFx on Single H100, 5-50x Scalability on NIM
  • Optimized Whisper-Large-V2 on a custom NIM server, delivering a 5-50x scalability gain and 2.29x-4.43x latency reduction while achieving 1800+ RTFx on a single H100 for production voice-agent workloads.

AI Intern

Bhashini · C4GT DMP'24

Jun'24 — Sept'24
Model Compression & Edge Deployment
  • Engineered pipeline to compress ASR NeMo modules using 8-bit integer quantization, achieving 73% size reduction to 143MB for edge device inference.
  • Reduced CTranslate2 model from 2.7GB to 224MB using static quantization, 92% reduction with only 0.2% accuracy loss.
  • Developed demo Android application for ASR across all 22 Indic languages, achieving under 1 sec inference for 7 sec audio. Built MacOS application for NMT offline inference.
  • Presented work at DPG Dialogues'24 as a lightning talk. Watch →

Projects

Moonlight Protein Engineering

GitHub →

Dec'22 — Feb'23
  • Applied Meta's ESM-2 model generating embedding sequences from moonlight protein sequences.
  • Compared Bi-LSTMs, MLP, and classical ML techniques (SVM, Random Forest, KNN, etc.) with detailed comparative analysis.
  • Enhanced accuracy by 7.2% using ensembling techniques, achieving ~81.7% accuracy.

Education

Cluster Innovation Centre, University of Delhi

B.Tech. (IT & Mathematical Innovation) · CGPA: 8.82

2022 — 2026

Skills

Python · Machine Learning · Deep Learning · PyTorch · Triton Server · TensorRT · ONNX · TensorRT-LLM · vLLM · SGLang · Docker · NVIDIA NIM · NVIDIA Dynamo · LLM Perf · LMCache · Spec Decoding