Lead Developer Advocate

Philip Kiely

About

Philip Kiely is a software developer and author based out of Chicago. Originally from Clive, Iowa, he graduated from Grinnell College with honors in Computer Science. Philip joined Baseten in January 2022 and works across documentation, technical content, and developer experience. Outside of work, he's a lifelong martial artist, voracious reader, and, unfortunately, a Bears fan.

Model performance

How we built high-throughput embedding, reranker, and classifier inference with TensorRT-LLM

Discover how we optimized embedding, reranker, and classifier inference using TensorRT-LLM, doubling throughput and achieving ultra-low latency at scale.

Model performance

How multi-node inference works for massive LLMs like DeepSeek-R1

Running DeepSeek-R1 on H100 GPUs requires multi-node inference to connect the 16 H100s needed to hold the model weights.

1 other
GPU guides

Testing Llama 3.3 70B inference performance on NVIDIA GH200 in Lambda Cloud

The NVIDIA GH200 Superchip combines an NVIDIA Hopper GPU with an ARM CPU via high-bandwidth interconnect

ML models

Private, secure DeepSeek-R1 in production in US & EU data centers

Dedicated deployments of DeepSeek-R1 and DeepSeek-V3 offer private, secure, high-performance inference that's cheaper than OpenAI

Model performance

How we built production-ready speculative decoding with TensorRT-LLM

Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.

2 others
Glossary

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

2 others
GPU guides

Evaluating NVIDIA H200 Tensor Core GPUs for LLM inference

Are NVIDIA H200 GPUs cost-effective for model inference? We tested an 8xH200 cluster provided by Lambda to discover suitable inference workload profiles.

News

Export your model inference metrics to your favorite observability tool

Export model inference metrics like response time and hardware utilization to observability platforms like Grafana, New Relic, Datadog, and Prometheus.

2 others
1236

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.