The global AI compute market has never been more competitive — or more expensive. With NVIDIA's H100 commanding premium rental rates and the A100 still widely deployed across major cloud platforms, choosing the right GPU is a decision with real financial stakes. Rent the wrong chip and you're either paying for overkill capacity or bottlenecking your entire pipeline.
This guide focuses on the three GPUs you're most likely to encounter when renting AI compute in 2026: the H100 (Hopper), the A100 (Ampere), and the L40S (Ada Lovelace). We cover architecture, benchmarks, pricing, and specific use cases — everything you need to make a data-driven decision.
Quick Answer: Which GPU for Which Workload?
Architecture Overview: Hopper vs Ampere vs Ada Lovelace
Each GPU in this comparison was built for a different primary mission. Understanding the architecture helps you predict how each will handle your specific workload — not just the headline benchmark numbers.
H100
A100
L40S
NVIDIA H100 (Hopper Architecture)
The H100 is NVIDIA's ninth-generation data center GPU and the benchmark for modern AI infrastructure. Its defining features are fourth-generation Tensor Cores with FP8 support, 3.35 TB/s of HBM3 memory bandwidth, and a Transformer Engine purpose-built for attention-heavy architectures like GPT, LLaMA, and Mistral families. The SXM5 form factor enables NVLink 4.0 at 900 GB/s, making it the only viable option for clusters beyond eight GPUs.
The Transformer Engine dynamically selects FP8 or BF16 precision per layer during training — this single feature delivers 2–3× throughput improvements on attention-heavy workloads compared to the A100, without accuracy degradation.
NVIDIA A100 (Ampere Architecture)
Launched in 2020, the NVIDIA A100 defined the modern AI GPU era. Its third-generation Tensor Cores, Multi-Instance GPU (MIG) technology, and strong FP64 performance made it the go-to chip for both AI and HPC. In 2026, the A100 remains capable and is in massive deployment globally — but NVIDIA has indicated it's reaching end-of-life (EOL) status. Buying fresh A100 hardware today locks you into a legacy architecture just as model sizes and memory requirements are accelerating. For rental workloads, the A100 still makes sense in specific scenarios, but new AI infrastructure should plan for Hopper-class or newer.
The A100 is effectively legacy capacity in 2026. It remains serviceable for established ML pipelines and scientific workloads requiring FP64, but building new LLM stacks on A100 is not recommended. The 20% cost savings over the H100 don't justify the architectural gap for most modern workloads.
NVIDIA L40S (Ada Lovelace Architecture)
Released in late 2023, the L40S GPU is a deliberate hybrid — it targets data centers that need both AI compute and graphics/media acceleration without running two separate GPU fleets. It includes third-generation RT Cores, FP8 support, and 18,176 CUDA cores, making it surprisingly strong for AI inference. Its 48 GB of GDDR6 memory is bandwidth-limited compared to HBM3 (864 GB/s vs 3,350 GB/s on the H100), which constrains performance on memory-intensive training runs, but for inference of models under 30B parameters, the lower memory bandwidth is rarely the bottleneck.
The L40S is the only GPU in this comparison with RT Cores for real-time ray tracing. For organizations running mixed workloads — generative AI inference alongside 3D rendering, VFX, or digital twin pipelines — the L40S is the only chip that handles both natively.
Full Specification Comparison
| Specification | H100 SXM | A100 SXM4 | L40S |
|---|---|---|---|
| Architecture | Hopper | Ampere | Ada Lovelace |
| Memory | 80 GB HBM3 | 80 GB HBM2e | 48 GB GDDR6 |
| Memory Bandwidth | 3,350 GB/s | 2,000 GB/s | 864 GB/s |
| CUDA Cores | 16,896 | 6,912 | 18,176 |
| Tensor Cores | 528 (4th-gen) | 432 (3rd-gen) | 568 (4th-gen) |
| RT Cores | None | None | 142 (3rd-gen) |
| FP8 Tensor (TFLOPS) | 3,958 | N/A | 1,457 |
| BF16 Tensor (TFLOPS) | 1,979 | 312 | 733 |
| FP32 (TFLOPS) | 67 | 19.5 | 91.6 |
| FP64 (TFLOPS) | 34 | 9.7 | N/A (0.09) |
| NVLink Bandwidth | 900 GB/s | 600 GB/s | None |
| Form Factor | SXM5 / PCIe | SXM4 / PCIe | PCIe (dual-slot) |
| TDP | 700 W | 400 W | 350 W |
| MIG Support | Yes (7 instances) | Yes (7 instances) | No |
| Max GPU Cluster | 32+ (NVLink) | 16+ (NVLink) | 4–8 (PCIe) |
PCIe vs SXM: The H100 and A100 both come in PCIe variants, which are less expensive to rent but have lower memory bandwidth. If your workload involves multi-GPU training, always choose the SXM variant for the full NVLink benefit. The L40S is PCIe only.
Real-World Benchmarks
Raw spec numbers don't tell the full story. The following benchmark data comes from controlled tests on identical software stacks (PyTorch, CUDA 12.x, Transformers library) running BERT-base masked-LM training and inference workloads.
Training throughput: BERT-base (tokens/second)
Inference throughput: LLaMA-3 8B (tokens/second, FP16)
Cost efficiency: cost per million training tokens
Key takeaway: Despite its higher hourly rate, the H100 GPU delivers the lowest cost-per-token for training workloads — by a significant margin — because its raw throughput far outpaces its price premium. The A100, counterintuitively, is the least cost-efficient option for modern LLM training. The L40S sits in the middle but shines for inference workloads where its lower memory bandwidth is less of a constraint.
GPU Rental Pricing in 2026
Rental prices vary significantly between providers. Hyperscalers (AWS, GCP, Azure) typically price 40–80% higher than specialized GPU clouds, but offer tighter SLA guarantees and integrated cloud ecosystems. The following represents market rates from specialized providers.
Pro tip on pricing: Always calculate cost-per-output (tokens, images, inferences) rather than comparing hourly rates directly. An H100 at $2.25/hr producing 3,000 tokens/second is dramatically cheaper per token than an A100 at $1.35/hr producing 1,300 tokens/second. For any sustained workload, run the math before committing to a cheaper hourly rate.
Use Case Breakdown: Which GPU Wins Where?
LLM Training: H100 Wins Decisively
For training transformer-based models — LLaMA, Mistral, Falcon, GPT derivatives — the H100 is the clear choice. Its Transformer Engine, fourth-generation Tensor Cores, and FP8 precision deliver 2–6× better throughput than the A100 on these workloads. The memory bandwidth gap (3.35 TB/s vs 2 TB/s) means the H100 can sustain larger batch sizes without hitting memory bottlenecks, reducing time-to-accuracy by 30–50% on typical fine-tuning runs.
Inference: It Depends on Scale and Budget
For high-QPS production inference serving (thousands of requests per second), the H100 dominates — its low-latency HBM3 memory and FP8 support for quantized serving make it ideal for latency-sensitive SLAs. For cost-optimized inference of models under 20B parameters, the L40S is the smart pick: its $0.87/hr rate and solid FP8 support deliver better cost-per-token than either the A100 or the H100 at typical inference batch sizes.
Graphics + AI Hybrid: L40S Only
Neither the H100 nor A100 include RT Cores or video output capabilities. If your platform needs to combine AI inference with 3D rendering, real-time ray tracing, VFX pipelines, or digital twin visualization, the L40S is the only data center GPU that handles both natively. This makes it uniquely positioned for industries like architecture, automotive, media production, and medical imaging.
Scientific Computing: A100's Last Strong Hold
The A100's standout feature in 2026 is its FP64 double-precision performance at 9.7 TFLOPS — significantly higher than the L40S's near-zero FP64 capability, and relevant for molecular dynamics simulations, quantum chemistry, and CFD workloads. If your HPC pipeline genuinely requires FP64 precision, the A100 remains the right choice. The H100 also supports FP64 at 34 TFLOPS and is technically superior, but the cost delta may not be justified for pure FP64 workloads.
Decision Framework: Which GPU Should You Rent?
Work through these questions in order. Your optimal GPU choice should be clear by the end.
Frequently Asked Questions
Is the NVIDIA A100 still worth renting in 2026?
For most AI workloads, no. The A100 is approaching end-of-life status and delivers worse cost-per-token than both the H100 and L40S for LLM training and inference respectively. Its strong FP64 performance keeps it relevant for scientific computing and HPC simulations. If you're running established pipelines that are fully optimized for Ampere architecture and have stable workloads, continuing to use A100 instances makes sense rather than migrating mid-project. But new AI infrastructure should not be built on A100.
Can the L40S replace the A100 for AI training?
For small to mid-scale training runs, yes. An 8×L40S configuration outperforms an 8×A100 system by approximately 1.7× in AI training throughput, largely due to higher FP32 CUDA core counts and FP8 support. The critical limitation is the 48GB GDDR6 memory ceiling — models larger than approximately 20B parameters in FP16 won't fit on a single L40S, whereas the A100 can hold 30B+ parameter models. The L40S also can't scale beyond 4–8 GPUs without NVLink, ruling it out for frontier model training.
What about the NVIDIA H200? Should I wait?
The H200 shares the Hopper architecture with the H100 but adds 141 GB of HBM3e memory at 4.8 TB/s bandwidth — nearly double the H100's memory. For workloads that are memory-bound (100B+ parameter models, long-context inference, very large batch sizes), the H200 provides a meaningful upgrade. However, H200 availability remains limited and rental prices are significantly higher. If your models comfortably fit within 80 GB and you're not hitting memory ceilings, the H100 remains the sweet spot for the next 12–18 months.
How much does it cost to train LLaMA-3 8B on each GPU?
Estimated single-GPU fine-tuning cost for LLaMA-3 8B (LoRA, 1 epoch on a 10M token dataset): H100 ~$18, A100 ~$28, L40S ~$22. The H100 wins on total cost despite its higher hourly rate because fine-tuning completes roughly 2× faster. For full pre-training at scale, the cost advantage of the H100 compounds further across hundreds of GPU-hours.
Does the L40S support FP8 precision like the H100?
Yes. The L40S includes FP8 support through its Ada Lovelace architecture, which is a key differentiator over the A100 (which lacks FP8). This enables quantized inference and mixed-precision training with FP8, making the L40S more capable for modern LLM inference pipelines than its memory bandwidth numbers alone would suggest.
Which GPU is best for Stable Diffusion and image generation?
The L40S. Image generation workloads benefit from high FP32 performance (91.6 TFLOPS on L40S vs 67 on H100), fast VRAM (GDDR6 DRAM at 864 GB/s is sufficient for diffusion models), and the lower rental cost allows longer batch generation runs. The L40S also includes hardware-accelerated video encoding/decoding, useful for video diffusion models. The H100 is technically faster but represents significant overkill for most image generation pipelines.
Final Verdict
There is no universally "best" GPU — there's only the best GPU for your specific workload, budget, and scale requirements. The choice that too many teams make is renting H100s for workloads that would run more cost-efficiently on an L40S, or cutting costs with L40S instances for training jobs that genuinely need H100 memory bandwidth and NVLink topology.
The most successful AI teams architect heterogeneous environments — H100s for training runs, L40S instances for inference serving, and (where it exists) A100 legacy capacity for established pipelines. This workload-matched approach consistently delivers better ROI than committing to a single GPU type across all infrastructure.