How Much VRAM Do You Really Need? A Complete Guide for LLM Users
Demystify VRAM requirements for running local LLMs. Learn the VRAM formula, understand overhead, and find out which models fit your GPU — from 4 GB to 48 GB cards.
VRAM is the hard limit for local LLMs. Unlike system RAM, you cannot simply add more when you run out. This guide explains exactly how much VRAM you need and how to calculate it for any model.
The VRAM Formula
For a quantized model, VRAM usage is approximately:
VRAM = (Parameters × Bits per parameter / 8) + Overhead
Where:
- Parameters: The model size in billions (e.g., 7.0 for Llama 3.1 8B)
- Bits per parameter: 16 for FP16, 8 for Q8_0, 4 for Q4_K_M
- Overhead: Typically 1.5–3 GB for KV cache, activations, and context
VRAM by Model Size
Small Models (1–3B parameters)
These run on almost anything with a GPU:
- Q4_K_M: 1–2.5 GB — works on 4 GB cards like GTX 1650
- Q8_0: 2–4 GB — runs on 6 GB cards like RTX 2060
Best for: coding assistants, small chatbots, embedded applications
Mid-Range Models (7–8B parameters)
The most popular tier for local use:
- Q4_K_M: 4–5.5 GB — runs on RTX 3060 (12 GB), RTX 4060 (8 GB)
- Q8_0: 7–9 GB — runs on RTX 4070 (12 GB), RTX 3080 (10 GB)
Best for: general chat, writing, summarization, RAG
Large Models (13–35B parameters)
Serious performance requires serious hardware:
- Q4_K_M: 7–14 GB — runs on RTX 4080 (16 GB), RTX 3090 (24 GB)
- Q8_0: 14–28 GB — needs RTX 4090 (24 GB) or dual-GPU setups
Best for: complex reasoning, long-form writing, code generation
Massive Models (70B+ parameters)
- Q4_K_M: 35 GB+ — needs A6000 (48 GB) or Apple M2 Ultra
- Q2_K: 18–20 GB — borderline on RTX 4090 (24 GB)
Best for: research, advanced reasoning, maximum quality
Common Misconceptions
"More VRAM = Faster" Not exactly. More VRAM lets you run larger models, not the same model faster. A 7B Q4 model on an RTX 4090 runs at roughly the same speed as on an RTX 4080 — but the RTX 4090 can also run a 13B Q8 model that won't fit on the 4080.
"System RAM Can Substitute for VRAM" Technically yes (via CPU offloading), but practically no. Once you spill into system RAM, token generation slows to 1–3 tok/s — unusably slow for interactive chat. VRAM is 10–30× faster than system RAM for LLM inference.
"You Need 24 GB to Run Anything Good" Absolutely not. A 7B Q4 model running on 8 GB of VRAM can produce excellent results. Modern quantization has made local LLMs accessible to nearly every gaming PC from the last 5 years.
Quick Reference
| GPU VRAM | Best Model Tier | Recommended Quant | |----------|----------------|-------------------| | 4–6 GB | 1–3B | Q4_K_M | | 8 GB | 7–8B | Q4_K_M | | 12 GB | 7–8B | Q8_0 | | 16 GB | 7–13B | Q8_0 / Q4_K_M | | 24 GB | 13–35B | Q8_0 / Q4_K_M | | 48 GB | 35–70B+ | Q8_0 / Q4_K_M |
Use our Hardware Detection Tool to automatically find the best models for your GPU.