The Smart Way to Optimize GPU Costs with Red Hat AI

The Smart Way to Optimize GPU Costs with Red Hat AI

04 Jun 2020

One of the most pressing questions for companies investing in AI projects is this: Are we truly using our GPU resources efficiently? LLM workloads are becoming increasingly complex, and when deployed on the wrong architecture, GPU costs can multiply rapidly. As the spectrum expands from simple chatbots to autonomous agent workloads, choosing the right inference infrastructure is no longer a preference its a necessity.

In this article, starting with vLLM Inference Server the core component of the Red Hat AI product family, we examine how vLLM together with llm-d, Semantic Router, KEDA, and Kueue creates an intelligent and cost-effective AI inference architecture.

What Is vLLM Inference Server and Why Does It Matter?

Traditional LLM inference architectures suffer from serious resource waste. Every request that enters the system — whether it’s a simple “hello” message or a 20,000-token code analysis — is processed by pre-allocating a fixed memory block sized to the maximum sequence length. When the user finishes in 50 tokens, the remaining memory sits idle until the next request.

Paged Attention and KV Cache Optimization

vLLM Inference Server uses the Paged Attention mechanism, inspired by virtual memory management in operating systems. With this approach, the KV Cache is divided into small, fixed-size blocks:

  • New blocks are added as the request grows
  • Blocks are released when the request completes
  • Different requests using the same system prompt physically share the KV Cache blocks for that prefix — memory is not copied, it is reused

This architecture makes it possible to achieve faster response times, more efficient memory utilization, and a direct reduction in hardware costs.

Disaggregated Inference with llm-d: Separating Prefill and Decode

In traditional LLM inference, two critical stages must run on the same GPU:

  • Prefill (compute-intensive): The stage that understands the query and generates context
  • Decode (memory-intensive): The stage that generates the response token by token

Running both processes on the same resource creates a bottleneck: when a long prefill operation starts, decode must wait, and active users experience slower response streaming.

Further Reading: Visibility in Modern Data Centers: Discover Cisco MDS SAN Analytics

llm-d’s Solution: Pod-Based Work Separation

llm-d separates prefill and decode processes, routing each task to the most suitable pod:

  • Prefill pods: Optimized for high compute power
  • Decode pods: Optimized for memory bandwidth

Once the prefill pod generates the KV Cache, it transfers it to the decode pod. This allows TTFT (Time to First Token) and response fluidity to be improved independently of each other.

Semantic Router: The Intelligent Model Routing Layer

Enterprise AI infrastructures run dozens of different models simultaneously. A model that excels at mathematics may underperform in creative writing; routing a simple question to a model specialized in code analysis creates unnecessary cost. Using a large reasoning model for every request is both expensive and inefficient.

How Does It Balance Cost and Accuracy?

vLLM Semantic Router analyzes the content and complexity of incoming requests using a lightweight classifier such as ModernBERT:

  • Simple queries → Routed to small, fast, and low-cost models
  • Complex requests → Sent to powerful reasoning models with deep analytical capabilities

In addition to Semantic Router, Semantic Caching is also activated. Repeated or similar prompts can reuse existing inference results — significantly reducing processing load in production environments with recurring question patterns.

Measurable Results: MMLU-Pro Benchmark

The impact of Semantic Router was concretely measured in MMLU-Pro benchmark tests conducted with the Qwen3 30B model:

  • Accuracy on complex tasks: +10.2% improvement
  • Latency: -47.1% reduction
  • Token usage: -48.5% reduction

GPU Resource Management and Job Queuing with Kueue

When working with GPU resources across large teams, an inevitable conflict arises: who gets to use how much, and when? When one team’s lengthy training job occupies all GPUs, other teams’ inference or experimental workloads are left waiting in the queue.

Kubernetes-Native Job Queue

Kueue operates as a Kubernetes-native job queuing system that prioritizes AI/ML workloads and distributes resources fairly across teams:

  • When a workload is queued, it checks available resource quotas
  • If sufficient resources exist, pods are spun up; otherwise, they are queued
  • GPUs are not allocated needlessly — they are only used when there is actual work to process
  • Resource quotas and priorities across teams are managed centrally

Further Reading: Sekom’s End-to-End Monitoring Engineering

Event-Driven Auto-Scaling with KEDA

OpenShift’s built-in auto-scaling mechanisms work with standard metrics such as CPU and memory. However, these are not the metrics that matter most for AI workloads — what truly matters is request volume hitting the model and response times.

Intelligent Scaling Based on AI Metrics

KEDA (Kubernetes Event-Driven Autoscaler) comes as the Custom Metrics Autoscaler in OpenShift and reads the metrics exposed by vLLM through Prometheus:

  • Active request count
  • KV Cache fill rate
  • Queue length

KEDA’s most critical feature is its ability to scale to zero. When there are no active requests or queued items for a defined period, it shuts down the workload entirely. An idle GPU does not translate into idle cost.

Conclusion: Benefits of an Integrated Architecture

When vLLM Inference Server, llm-d, Semantic Router, Kueue, and KEDA are used together, they form a complementary optimization chain:

  • vLLM: Maximizes memory efficiency with Paged Attention
  • llm-d: Specializes GPU usage through prefill/decode separation
  • Semantic Router: Optimizes both cost and accuracy by routing the right request to the right model
  • Kueue: Manages GPU resources across teams fairly and efficiently
  • KEDA: Significantly reduces waste through scaling based on real AI metrics

The underlying idea of this architecture is simple: every resource should run when needed, and only as much as needed. The integration of these components within the Red Hat AI ecosystem makes enterprise AI infrastructures both cost-sustainable and performance-competitive.

Frequently Asked Questions (FAQ)

When should vLLM Inference Server be preferred?

In applications that receive high volumes of concurrent requests and generate prompts and responses of varying sizes, vLLM Inference Server delivers noticeably better GPU utilization rates and lower latency compared to traditional inference solutions.

Which components of this architecture does OpenShift AI offer out of the box?

vLLM comes as the default inference runtime; KEDA is integrated into OpenShift as the Custom Metrics Autoscaler operator. Kueue requires a separate installation but works compatibly within the OpenShift ecosystem. llm-d and Semantic Router are newer components and continue to mature within the Red Hat AI product family.

Is llm-d required for every model?

The prefill/decode separation offered by llm-d is especially valuable for workloads that require long context and high concurrency. In environments with short prompts or low traffic, standard vLLM on a single pod may be sufficient without the additional architectural complexity.

Author: Başak Buğa – VAS Engineer – Sekom

Other Posts

Sekom | What is BGP? Route Advertisements and the Backbone of the Internet
What is BGP? Route Advertisements and the Backbone of the Internet

Border Gateway Protocol (BGP) is the foundational protocol that connects tens of thousands of independent networks and routes data traffic across the internet.

Read More
Sekom | AI Datacenter Network Architecture | Why the Fastest GPUs Are Not Enough: The Defining Role of Network Infrastructure in AI Workloads
AI Datacenter Network Architecture | Why the Fastest GPUs Are Not Enough: The Defining Role of Network Infrastructure in AI Workloads

Build high-performance, low-latency, and scalable infrastructures with AI Data Center Network Architecture. Explore modern solutions for GPU-centric network designs, data flow optimization, and AI workloads.

Read More
Sekom | Ensuring Reliability and Governance in Artificial Intelligence: A Guardrail-Driven Security Framework
Ensuring Reliability and Governance in Artificial Intelligence: A Guardrail-Driven Security Framework

Control AI risks before they escalate. Guardrails and Red Teaming help prevent data leaks, hallucinations, and regulatory violations.

Read More
Sekom | Cisco Collaboration Solutions - Redefining Connectivity in the Modern Business World
Cisco Collaboration Solutions – Redefining Connectivity in the Modern Business World

Enhance hybrid work and secure communication with Cisco Collaboration Solutions. Modernize with Sekom’s Cisco Gold Partner expertise.

Read More
Sekom | Observe, Measure, Manage - Sekom’s End-to-End Monitoring Engineering
Observe, Measure, Manage – Sekom’s End-to-End Monitoring Engineering

Boost reliability with open-source monitoring, full-stack observability, and workflows. Discover Sekom’s monitoring approach today.

Read More
Sekom | Discover the Power of Automation - Boost Efficiency by Advancing from AWX to Ansible Automation Platform
Discover the Power of Automation – Boost Efficiency by Advancing from AWX to Ansible Automation Platform

Modernize automation with Ansible Automation Platform. Achieve secure, scalable, efficient operations by migrating from AWX with confidence.

Read More

“Building Digital Future”

We are a well-established, reliable, and expert digital transformation integrator, committed to the satisfaction of both our customers and our employees.

Explore
Wireskop Carrier-grade service orchestration and intelligence platform UC Toolbox End-to-end visibility for Unified Communications Clarity Integrated Network and Infrastructure Observability platform
Sekans Centralized DHCP and IP address management solution Kognosphere Centralized DPI management and orchestration platform Autosphere Enterprise-scale IT automation and orchestration platform
For more information, feel free to contact us.
Wireskop Operatör seviyesinde servis orkestrasyonu ve zeka platformu UC Toolbox Birleşik İletişim altyapıları için uçtan uca görünürlük Clarity Bütünleşik Ağ ve Altyapı Gözlemlenebilirlik Platformu
Sekans Merkezi DHCP ve IP adres yönetimi çözümü Kognosphere Merkezi DPI yönetimi ve orkestrasyon platformu Autosphere Kurumsal ölçekte BT otomasyon ve orkestrasyon platformu
Daha fazla bilgi için lütfen bizimle iletişime geçin.