The Smart Way to Optimize GPU Costs with Red Hat AI

Homepage Blog The Smart Way to Optimize GPU Costs with Red Hat AI

The Smart Way to Optimize GPU Costs with Red Hat AI

04 Jun 2020

One of the most pressing questions for companies investing in AI projects is this: Are we truly using our GPU resources efficiently? LLM workloads are becoming increasingly complex, and when deployed on the wrong architecture, GPU costs can multiply rapidly. As the spectrum expands from simple chatbots to autonomous agent workloads, choosing the right inference infrastructure is no longer a preference its a necessity.

In this article, starting with vLLM Inference Server the core component of the Red Hat AI product family, we examine how vLLM together with llm-d, Semantic Router, KEDA, and Kueue creates an intelligent and cost-effective AI inference architecture.

What Is vLLM Inference Server and Why Does It Matter?

Traditional LLM inference architectures suffer from serious resource waste. Every request that enters the system — whether it’s a simple “hello” message or a 20,000-token code analysis — is processed by pre-allocating a fixed memory block sized to the maximum sequence length. When the user finishes in 50 tokens, the remaining memory sits idle until the next request.

Paged Attention and KV Cache Optimization

vLLM Inference Server uses the Paged Attention mechanism, inspired by virtual memory management in operating systems. With this approach, the KV Cache is divided into small, fixed-size blocks:

New blocks are added as the request grows
Blocks are released when the request completes
Different requests using the same system prompt physically share the KV Cache blocks for that prefix — memory is not copied, it is reused

This architecture makes it possible to achieve faster response times, more efficient memory utilization, and a direct reduction in hardware costs.

Disaggregated Inference with llm-d: Separating Prefill and Decode

In traditional LLM inference, two critical stages must run on the same GPU:

Prefill (compute-intensive): The stage that understands the query and generates context
Decode (memory-intensive): The stage that generates the response token by token

Running both processes on the same resource creates a bottleneck: when a long prefill operation starts, decode must wait, and active users experience slower response streaming.

Further Reading: Visibility in Modern Data Centers: Discover Cisco MDS SAN Analytics

llm-d’s Solution: Pod-Based Work Separation

llm-d separates prefill and decode processes, routing each task to the most suitable pod:

Prefill pods: Optimized for high compute power
Decode pods: Optimized for memory bandwidth

Once the prefill pod generates the KV Cache, it transfers it to the decode pod. This allows TTFT (Time to First Token) and response fluidity to be improved independently of each other.

Semantic Router: The Intelligent Model Routing Layer

Enterprise AI infrastructures run dozens of different models simultaneously. A model that excels at mathematics may underperform in creative writing; routing a simple question to a model specialized in code analysis creates unnecessary cost. Using a large reasoning model for every request is both expensive and inefficient.

How Does It Balance Cost and Accuracy?

vLLM Semantic Router analyzes the content and complexity of incoming requests using a lightweight classifier such as ModernBERT:

Simple queries → Routed to small, fast, and low-cost models
Complex requests → Sent to powerful reasoning models with deep analytical capabilities

In addition to Semantic Router, Semantic Caching is also activated. Repeated or similar prompts can reuse existing inference results — significantly reducing processing load in production environments with recurring question patterns.

Measurable Results: MMLU-Pro Benchmark

The impact of Semantic Router was concretely measured in MMLU-Pro benchmark tests conducted with the Qwen3 30B model:

Accuracy on complex tasks: +10.2% improvement
Latency: -47.1% reduction
Token usage: -48.5% reduction

GPU Resource Management and Job Queuing with Kueue

When working with GPU resources across large teams, an inevitable conflict arises: who gets to use how much, and when? When one team’s lengthy training job occupies all GPUs, other teams’ inference or experimental workloads are left waiting in the queue.

Kubernetes-Native Job Queue

Kueue operates as a Kubernetes-native job queuing system that prioritizes AI/ML workloads and distributes resources fairly across teams:

When a workload is queued, it checks available resource quotas
If sufficient resources exist, pods are spun up; otherwise, they are queued
GPUs are not allocated needlessly — they are only used when there is actual work to process
Resource quotas and priorities across teams are managed centrally

Further Reading: Sekom’s End-to-End Monitoring Engineering

Event-Driven Auto-Scaling with KEDA

OpenShift’s built-in auto-scaling mechanisms work with standard metrics such as CPU and memory. However, these are not the metrics that matter most for AI workloads — what truly matters is request volume hitting the model and response times.

Intelligent Scaling Based on AI Metrics

KEDA (Kubernetes Event-Driven Autoscaler) comes as the Custom Metrics Autoscaler in OpenShift and reads the metrics exposed by vLLM through Prometheus:

Active request count
KV Cache fill rate
Queue length

KEDA’s most critical feature is its ability to scale to zero. When there are no active requests or queued items for a defined period, it shuts down the workload entirely. An idle GPU does not translate into idle cost.

Conclusion: Benefits of an Integrated Architecture

When vLLM Inference Server, llm-d, Semantic Router, Kueue, and KEDA are used together, they form a complementary optimization chain:

vLLM: Maximizes memory efficiency with Paged Attention
llm-d: Specializes GPU usage through prefill/decode separation
Semantic Router: Optimizes both cost and accuracy by routing the right request to the right model
Kueue: Manages GPU resources across teams fairly and efficiently
KEDA: Significantly reduces waste through scaling based on real AI metrics

The underlying idea of this architecture is simple: every resource should run when needed, and only as much as needed. The integration of these components within the Red Hat AI ecosystem makes enterprise AI infrastructures both cost-sustainable and performance-competitive.

Frequently Asked Questions (FAQ)

When should vLLM Inference Server be preferred?

In applications that receive high volumes of concurrent requests and generate prompts and responses of varying sizes, vLLM Inference Server delivers noticeably better GPU utilization rates and lower latency compared to traditional inference solutions.

Which components of this architecture does OpenShift AI offer out of the box?

vLLM comes as the default inference runtime; KEDA is integrated into OpenShift as the Custom Metrics Autoscaler operator. Kueue requires a separate installation but works compatibly within the OpenShift ecosystem. llm-d and Semantic Router are newer components and continue to mature within the Red Hat AI product family.

Is llm-d required for every model?

The prefill/decode separation offered by llm-d is especially valuable for workloads that require long context and high concurrency. In environments with short prompts or low traffic, standard vLLM on a single pod may be sufficient without the additional architectural complexity.

Author: Başak Buğa – VAS Engineer – Sekom

The Smart Way to Optimize GPU Costs with Red Hat AI