Kubernetes OpenShift Platform for LLMs & AI Workloads

Client

The client is a large telecommunications company operating in a highly regulated environment. They required an internal AI platform capable of hosting and operating Large Language Models (LLMs) while complying with strict security, data sovereignty, and operational policies.

Due to the sensitive nature of telecommunications data, all AI workloads needed to run on an isolated, controlled Kubernetes platform with no direct external connectivity. The platform also had to be robust enough to support future AI use cases beyond LLM inference.

Challenge

Building an AI platform for a telecommunications environment introduced several complex challenges.

Security and compliance
All workloads had to run in a fully isolated environment, enforcing strict access controls, auditable operations, and data locality guarantees.
Limited platform knowledge
Development teams had limited experience with OpenShift and Kubernetes-based AI platforms, increasing the risk of misconfiguration and slow adoption.
Performance requirements for LLM workloads
Large language models required GPU acceleration, fast local storage, and optimized scheduling to ensure acceptable model load times and inference performance.
Operational reliability
The platform needed robust monitoring, backup, and recovery mechanisms suitable for production use in a telecom environment.

Solution Overview

We designed and implemented a secure, production-ready OpenShift platform optimized for AI and LLM workloads. The solution emphasized strong isolation, predictable performance, and operational reliability while enabling development teams to adopt modern GitOps workflows.

The platform was deployed as a fully air-gapped environment, with controlled software supply chains and internal registries. GPU-enabled worker nodes and high-performance storage were integrated to support model training and inference, while GitOps-based deployment pipelines reduced operational risk and improved consistency across environments.

Platform Architecture & Design Decisions

Several key architectural decisions were made to support secure and performant AI workloads.

GPU-enabled worker nodes were introduced to handle LLM inference and experimentation. Local NVMe storage was used to minimize model load times, while OpenShift Data Foundation (Ceph) provided reliable persistent storage for platform services and shared data.

GitOps was adopted as the primary deployment model using OpenShift GitOps (Argo CD) and GitLab. This ensured all changes were version-controlled, auditable, and repeatable — a critical requirement for regulated environments.

Results & Value Delivered

The engagement resulted in a secure, enterprise-grade AI platform built on OpenShift.

Delivered a fully air-gapped Kubernetes platform suitable for telecom-grade AI workloads
Enabled high-performance LLM hosting using NVIDIA GPUs and NVMe storage
Improved platform reliability and observability through centralized monitoring
Reduced deployment risk using GitOps-based CI/CD pipelines
Increased developer confidence and adoption of OpenShift for AI workloads

The platform now serves as a foundation for future AI initiatives within the organization.

Technologies Used

Red Hat OpenShift
Kubernetes
NVIDIA GPUs
OpenShift Data Foundation (Ceph)
NVMe local storage
OpenShift GitOps (Argo CD)
GitLab
Velero
Prometheus & Grafana

Concept

The platform was designed around several core principles:

Fully air-gapped environment
Ensured strict security and data sovereignty compliance.
GPU acceleration
NVIDIA GPUs enabled efficient execution of LLM inference and AI workloads.
Reliable persistent storage
OpenShift Data Foundation (Ceph) provided durable storage for platform services.
High-performance local storage
NVMe storage reduced model load times and improved overall performance.
GitOps-based CI/CD
OpenShift GitOps, GitLab, and Argo CD enabled controlled, repeatable deployments.
Backup and disaster recovery
Velero was used to support backup, restore, and disaster recovery workflows.
Enterprise monitoring
Prometheus and Grafana were integrated with the company’s central monitoring systems.

Key Takeaways

OpenShift is well-suited for secure, enterprise AI and LLM platforms
Air-gapped Kubernetes environments require careful planning of software supply chains
GPU scheduling and storage performance are critical for LLM workloads
GitOps significantly reduces operational risk in regulated environments

Previous Case

Deploying OpenShift On-Premises | Air-Gapped Kubernetes Setup

Next Case

IoT Monitoring Platform with Grafana | Real-Time Data Insights

Schedule a Meeting Now

Struggling with complex AWS environments, your Kubernetes cluster doesn’t work or need guidance on implementing scalable and secure solutions? Schedule a 1 hour free consultation with our experts today. We’ll discuss your unique challenges, identify opportunities for improvement.