Optimizing Infrastructure for Large Language Models: What Really Matters on the Hardware Side

Robin Mader

published on September 2, 2025

Organisations are increasingly introducing use cases based on large language models. However, some organisations face the challenge that they cannot use commercially available models such as OpenAI, OpenAI on Azure, or AWS Bedrock via API by default, as their data or use cases are subject to regulations that prohibit processing in the cloud.

If they don't want to be left behind, they need to host the LLM models

Below, we outline the key requirements you need to consider, the typical challenges you’ll encounter, and pragmatic ways to optimize your infrastructure.

1. Understand the Hardware Requirements

Which LLM are you deploying?

While smaller models may run on a few GPUs, state-of-the-art models often require several hundred gigabytes to around a terabyte of VRAM and up to 8 GPUs or more (for example, Qwen3 235B, one of the most popular current open-weight models, needs about 600 GB for full performance). In addition, disk space is a critical factor: storing models typically requires at least the same amount of disk capacity as VRAM. Although storage is significantly cheaper than VRAM, it should not be overlooked when planning infrastructure. Misjudging requirements leads directly to bottlenecks and prevents efficient fine-tuning or inference.

Recommendation: Build a clear model profile upfront including memory needs, latency requirements, and batch sizes.

2. Design Your Infrastructure for Current but also Future Needs

Workloads can change over time and pose new requirements on the infrastructure. Optimizing infrastructure typically requires changing it. For small LLMs, conventional GPU servers are often sufficient. Larger LLMs, however, typically require specialized servers that can host a high number of GPUs in parallel. While GPUs can be connected across servers via network, this usually results in significantly higher latency often a deal breaker for interactive LLM systems. Maintaining flexibility in your setup ensures that infrastructure can scale from small experiments to production workloads without costly redesigns.

Example: In practice, consolidating GPUs on a single server often reduces latency significantly compared to spreading them across machines.

3. GPUs as the Core: Why NVIDIA Leads

For production-ready LLMs, GPUs are not optional – they are the core of the system. The choice of hardware and vendor matters: mixing different GPUs in one setup often leads to compatibility issues and suboptimal performance. NVIDIA remains the market leader, offering the best combination of raw performance and a mature software ecosystem (CUDA, TensorRT). While alternative vendors can sometimes provide cost advantages, they often lag in software optimization, which makes NVIDIA the safer choice for most enterprise deployments.

Recommendation: Plan realistically. Complex models typically require multiple GPUs on the same server, not distributed across clusters or even data centers.

4. Create Space for Testing

Without dedicated test environments, optimizations often end up happening directly in production, a high-risk strategy.

Best Practice: Set up a sandbox where GPU configurations, memory allocation, and quantization strategies can be tested safely without affecting live workloads.

5. Budget Realistically

GPUs are expensive and larger models require a large number of them. Beyond purchase costs, organizations must account for energy consumption, cooling, space and also for maintenance and potential hardware replacement. In practice, failures do occur: recently, we saw a customer face unexpected downtime and costs because a GPU server had to be replaced on short notice. While electricity is often a smaller portion of total costs, it adds up in continuous operation.

Recommendation: Build a Total Cost of Ownership (TCO) model that includes acquisition, operation and scalability.

6. Plan for Redundancy

Failover deployments and dedicated development environments demand extra hardware and therefore additional budget. If not accounted for early, this can lead to unpleasant surprises.

7. Typical Optimization Challenges

Rising hardware requirements: Each new LLM generation tends to demand more VRAM and compute. Model upgrades are almost always tied to new hardware purchases.
Slow internal processes: Trial-and-error is part of optimization. Long approval chains or rigid change processes slow everything down.
Quantization and compression: They can drastically reduce VRAM needs with minimal quality loss, but striking the right balance is difficult and technically complex.
Data protection and location: Hardware location often matters for compliance. On-prem may sound straightforward, but for international companies “on-prem” can still mean far away from where data is generated.

Key Takeaways

Do the analysis before purchasing: Understand the LLM’s exact requirements first.
Optimize GPU setup: Keep GPUs physically close to minimize latency.
Invest in testing environments: Sandbox setups are critical for safe optimization.
Budget holistically: Include acquisition, operation, and redundancy costs.
Use quantization carefully: Reduce memory usage without sacrificing too much model quality.
Ensure infrastructure flexibility: The ability to quickly adjust and experiment with hardware setups is essential – especially in larger organizations where rigid approval processes and ticket systems can slow down innovation.

Many companies underestimate the complexity of LLM infrastructure until bottlenecks appear in production. A structured approach to hardware planning helps avoid costly mistakes. If you want to validate your setup or explore optimization paths, our team is ready to share best practices from real-world projects.

Reach out for an expert exchange. Contact us now!