Kubernetes OpenShift Platform for LLMs & AI Workloads
Client
The client is a large telecommunications company operating in a highly regulated environment. They required an internal AI platform capable of hosting and operating Large Language Models (LLMs) while complying with strict security, data sovereignty, and operational policies.
Due to the sensitive nature of telecommunications data, all AI workloads needed to run on an isolated, controlled Kubernetes platform with no direct external connectivity. The platform also had to be robust enough to support future AI use cases beyond LLM inference.
Challenge
Building an AI platform for a telecommunications environment introduced several complex challenges.
Security and compliance
All workloads had to run in a fully isolated environment, enforcing strict access controls, auditable operations, and data locality guarantees.Limited platform knowledge
Development teams had limited experience with OpenShift and Kubernetes-based AI platforms, increasing the risk of misconfiguration and slow adoption.Performance requirements for LLM workloads
Large language models required GPU acceleration, fast local storage, and optimized scheduling to ensure acceptable model load times and inference performance.Operational reliability
The platform needed robust monitoring, backup, and recovery mechanisms suitable for production use in a telecom environment.
Solution Overview
We designed and implemented a secure, production-ready OpenShift platform optimized for AI and LLM workloads. The solution emphasized strong isolation, predictable performance, and operational reliability while enabling development teams to adopt modern GitOps workflows.
The platform was deployed as a fully air-gapped environment, with controlled software supply chains and internal registries. GPU-enabled worker nodes and high-performance storage were integrated to support model training and inference, while GitOps-based deployment pipelines reduced operational risk and improved consistency across environments.
Platform Architecture & Design Decisions
Several key architectural decisions were made to support secure and performant AI workloads.
GPU-enabled worker nodes were introduced to handle LLM inference and experimentation. Local NVMe storage was used to minimize model load times, while OpenShift Data Foundation (Ceph) provided reliable persistent storage for platform services and shared data.
GitOps was adopted as the primary deployment model using OpenShift GitOps (Argo CD) and GitLab. This ensured all changes were version-controlled, auditable, and repeatable — a critical requirement for regulated environments.
Results & Value Delivered
The engagement resulted in a secure, enterprise-grade AI platform built on OpenShift.
- Delivered a fully air-gapped Kubernetes platform suitable for telecom-grade AI workloads
- Enabled high-performance LLM hosting using NVIDIA GPUs and NVMe storage
- Improved platform reliability and observability through centralized monitoring
- Reduced deployment risk using GitOps-based CI/CD pipelines
- Increased developer confidence and adoption of OpenShift for AI workloads
The platform now serves as a foundation for future AI initiatives within the organization.
Technologies Used
- Red Hat OpenShift
- Kubernetes
- NVIDIA GPUs
- OpenShift Data Foundation (Ceph)
- NVMe local storage
- OpenShift GitOps (Argo CD)
- GitLab
- Velero
- Prometheus & Grafana
Concept
The platform was designed around several core principles:
Fully air-gapped environment
Ensured strict security and data sovereignty compliance.GPU acceleration
NVIDIA GPUs enabled efficient execution of LLM inference and AI workloads.Reliable persistent storage
OpenShift Data Foundation (Ceph) provided durable storage for platform services.High-performance local storage
NVMe storage reduced model load times and improved overall performance.GitOps-based CI/CD
OpenShift GitOps, GitLab, and Argo CD enabled controlled, repeatable deployments.Backup and disaster recovery
Velero was used to support backup, restore, and disaster recovery workflows.Enterprise monitoring
Prometheus and Grafana were integrated with the company’s central monitoring systems.
Key Takeaways
- OpenShift is well-suited for secure, enterprise AI and LLM platforms
- Air-gapped Kubernetes environments require careful planning of software supply chains
- GPU scheduling and storage performance are critical for LLM workloads
- GitOps significantly reduces operational risk in regulated environments
Schedule a Meeting Now
Struggling with complex AWS environments, your Kubernetes cluster doesn’t work or need guidance on implementing scalable and secure solutions? Schedule a 1 hour free consultation with our experts today. We’ll discuss your unique challenges, identify opportunities for improvement.
Contact Us
