The Smart Way to Optimize GPU Costs with Red Hat AI
04 Jun 2020
One of the most pressing questions for companies investing in AI projects is this: Are we truly using our GPU resources efficiently? LLM workloads are becoming increasingly complex, and when deployed on the wrong architecture, GPU costs can multiply rapidly. As the spectrum expands from simple chatbots to autonomous agent workloads, choosing the right inference infrastructure is no longer a preference its a necessity.
In this article, starting with vLLM Inference Server the core component of the Red Hat AI product family, we examine how vLLM together with llm-d, Semantic Router, KEDA, and Kueue creates an intelligent and cost-effective AI inference architecture.
What Is vLLM Inference Server and Why Does It Matter?
Traditional LLM inference architectures suffer from serious resource waste. Every request that enters the system — whether it’s a simple “hello” message or a 20,000-token code analysis — is processed by pre-allocating a fixed memory block sized to the maximum sequence length. When the user finishes in 50 tokens, the remaining memory sits idle until the next request.
Paged Attention and KV Cache Optimization
vLLM Inference Server uses the Paged Attention mechanism, inspired by virtual memory management in operating systems. With this approach, the KV Cache is divided into small, fixed-size blocks:
- New blocks are added as the request grows
- Blocks are released when the request completes
- Different requests using the same system prompt physically share the KV Cache blocks for that prefix — memory is not copied, it is reused
This architecture makes it possible to achieve faster response times, more efficient memory utilization, and a direct reduction in hardware costs.
Disaggregated Inference with llm-d: Separating Prefill and Decode
In traditional LLM inference, two critical stages must run on the same GPU:
- Prefill (compute-intensive): The stage that understands the query and generates context
- Decode (memory-intensive): The stage that generates the response token by token
Running both processes on the same resource creates a bottleneck: when a long prefill operation starts, decode must wait, and active users experience slower response streaming.
Further Reading: Visibility in Modern Data Centers: Discover Cisco MDS SAN Analytics
llm-d’s Solution: Pod-Based Work Separation
llm-d separates prefill and decode processes, routing each task to the most suitable pod:
- Prefill pods: Optimized for high compute power
- Decode pods: Optimized for memory bandwidth
Once the prefill pod generates the KV Cache, it transfers it to the decode pod. This allows TTFT (Time to First Token) and response fluidity to be improved independently of each other.
Semantic Router: The Intelligent Model Routing Layer
Enterprise AI infrastructures run dozens of different models simultaneously. A model that excels at mathematics may underperform in creative writing; routing a simple question to a model specialized in code analysis creates unnecessary cost. Using a large reasoning model for every request is both expensive and inefficient.
How Does It Balance Cost and Accuracy?
vLLM Semantic Router analyzes the content and complexity of incoming requests using a lightweight classifier such as ModernBERT:
- Simple queries → Routed to small, fast, and low-cost models
- Complex requests → Sent to powerful reasoning models with deep analytical capabilities
In addition to Semantic Router, Semantic Caching is also activated. Repeated or similar prompts can reuse existing inference results — significantly reducing processing load in production environments with recurring question patterns.
Measurable Results: MMLU-Pro Benchmark
The impact of Semantic Router was concretely measured in MMLU-Pro benchmark tests conducted with the Qwen3 30B model:
- Accuracy on complex tasks: +10.2% improvement
- Latency: -47.1% reduction
- Token usage: -48.5% reduction
GPU Resource Management and Job Queuing with Kueue
When working with GPU resources across large teams, an inevitable conflict arises: who gets to use how much, and when? When one team’s lengthy training job occupies all GPUs, other teams’ inference or experimental workloads are left waiting in the queue.
Kubernetes-Native Job Queue
Kueue operates as a Kubernetes-native job queuing system that prioritizes AI/ML workloads and distributes resources fairly across teams:
- When a workload is queued, it checks available resource quotas
- If sufficient resources exist, pods are spun up; otherwise, they are queued
- GPUs are not allocated needlessly — they are only used when there is actual work to process
- Resource quotas and priorities across teams are managed centrally
Further Reading: Sekom’s End-to-End Monitoring Engineering
Event-Driven Auto-Scaling with KEDA
OpenShift’s built-in auto-scaling mechanisms work with standard metrics such as CPU and memory. However, these are not the metrics that matter most for AI workloads — what truly matters is request volume hitting the model and response times.
Intelligent Scaling Based on AI Metrics
KEDA (Kubernetes Event-Driven Autoscaler) comes as the Custom Metrics Autoscaler in OpenShift and reads the metrics exposed by vLLM through Prometheus:
- Active request count
- KV Cache fill rate
- Queue length
KEDA’s most critical feature is its ability to scale to zero. When there are no active requests or queued items for a defined period, it shuts down the workload entirely. An idle GPU does not translate into idle cost.
Conclusion: Benefits of an Integrated Architecture
When vLLM Inference Server, llm-d, Semantic Router, Kueue, and KEDA are used together, they form a complementary optimization chain:
- vLLM: Maximizes memory efficiency with Paged Attention
- llm-d: Specializes GPU usage through prefill/decode separation
- Semantic Router: Optimizes both cost and accuracy by routing the right request to the right model
- Kueue: Manages GPU resources across teams fairly and efficiently
- KEDA: Significantly reduces waste through scaling based on real AI metrics
The underlying idea of this architecture is simple: every resource should run when needed, and only as much as needed. The integration of these components within the Red Hat AI ecosystem makes enterprise AI infrastructures both cost-sustainable and performance-competitive.
Frequently Asked Questions (FAQ)
When should vLLM Inference Server be preferred?
In applications that receive high volumes of concurrent requests and generate prompts and responses of varying sizes, vLLM Inference Server delivers noticeably better GPU utilization rates and lower latency compared to traditional inference solutions.
Which components of this architecture does OpenShift AI offer out of the box?
vLLM comes as the default inference runtime; KEDA is integrated into OpenShift as the Custom Metrics Autoscaler operator. Kueue requires a separate installation but works compatibly within the OpenShift ecosystem. llm-d and Semantic Router are newer components and continue to mature within the Red Hat AI product family.
Is llm-d required for every model?
The prefill/decode separation offered by llm-d is especially valuable for workloads that require long context and high concurrency. In environments with short prompts or low traffic, standard vLLM on a single pod may be sufficient without the additional architectural complexity.
Author: Başak Buğa – VAS Engineer – Sekom
Other Posts
What is BGP? Route Advertisements and the Backbone of the Internet
AI Datacenter Network Architecture | Why the Fastest GPUs Are Not Enough: The Defining Role of Network Infrastructure in AI Workloads
Meet Sekom at MWC2026 Barcelona: Network Intelligence for Real-World Operations
See all posts