Co-Founder

Pankaj Gupta

About

Pankaj Gupta is a co-founder of Baseten, where he leads model performance. Pankaj has spent his career making systems faster and more efficient, from optimizing data processing libraries at Twitter to search infrastructure at Uber and media processing at Adobe. A graduate of IIT Delhi, Pankaj now lives in the Bay Area, where he enjoys gardening and evening walks around his neighborhood.

GPU guides

Testing Llama 3.3 70B inference performance on NVIDIA GH200 in Lambda Cloud

The NVIDIA GH200 Superchip combines an NVIDIA Hopper GPU with an ARM CPU via high-bandwidth interconnect

Pankaj Gupta

1 other

Model performance

Driving model performance optimization: 2024 highlights

Baseten's model performance team works to optimize customer models for latency, throughput, quality, cost, features, and developer efficiency.

Pankaj Gupta

Model performance

How we built production-ready speculative decoding with TensorRT-LLM

Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.

Pankaj Gupta

2 others

Glossary

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

Pankaj Gupta

2 others

A ghostly, glowing llama walking ahead of a real llama

GPU guides

Evaluating NVIDIA H200 Tensor Core GPUs for LLM inference

Are NVIDIA H200 GPUs cost-effective for model inference? We tested an 8xH200 cluster provided by Lambda to discover suitable inference workload profiles.

Pankaj Gupta

1 other

Model performance

How to serve 10,000 fine-tuned LLMs from a single GPU

LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.

Pankaj Gupta

1 other

Prompt: Different-colored friendly robots standing in a field

GPU guides

Using fractional H100 GPUs for efficient model serving

Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.

Matt Howard

3 others

Prompt: Two tron-style motorcycles racing on an empty highway

Model performance

Benchmarking fast Mistral 7B inference

Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.

Abu Qader

3 others

Prompt: a model bullet train in a snowy village.

1 2

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.