How Much VRAM Do You Really Need? A Complete Guide for LLM Users

VRAM is the hard limit for local LLMs. Unlike system RAM, you cannot simply add more when you run out. This guide explains exactly how much VRAM you need and how to calculate it for any model.

The VRAM Formula

For a quantized model, VRAM usage is approximately:

VRAM = (Parameters × Bits per parameter / 8) + Overhead

Where:

Parameters: The model size in billions (e.g., 7.0 for Llama 3.1 8B)
Bits per parameter: 16 for FP16, 8 for Q8_0, 4 for Q4_K_M
Overhead: Typically 1.5–3 GB for KV cache, activations, and context

VRAM by Model Size

Small Models (1–3B parameters)

These run on almost anything with a GPU:

Q4_K_M: 1–2.5 GB — works on 4 GB cards like GTX 1650
Q8_0: 2–4 GB — runs on 6 GB cards like RTX 2060

Best for: coding assistants, small chatbots, embedded applications

Mid-Range Models (7–8B parameters)

The most popular tier for local use:

Q4_K_M: 4–5.5 GB — runs on RTX 3060 (12 GB), RTX 4060 (8 GB)
Q8_0: 7–9 GB — runs on RTX 4070 (12 GB), RTX 3080 (10 GB)

Best for: general chat, writing, summarization, RAG

Large Models (13–35B parameters)

Serious performance requires serious hardware:

Q4_K_M: 7–14 GB — runs on RTX 4080 (16 GB), RTX 3090 (24 GB)
Q8_0: 14–28 GB — needs RTX 4090 (24 GB) or dual-GPU setups

Best for: complex reasoning, long-form writing, code generation

Massive Models (70B+ parameters)

Q4_K_M: 35 GB+ — needs A6000 (48 GB) or Apple M2 Ultra
Q2_K: 18–20 GB — borderline on RTX 4090 (24 GB)

Best for: research, advanced reasoning, maximum quality

Common Misconceptions

"More VRAM = Faster" Not exactly. More VRAM lets you run larger models, not the same model faster. A 7B Q4 model on an RTX 4090 runs at roughly the same speed as on an RTX 4080 — but the RTX 4090 can also run a 13B Q8 model that won't fit on the 4080.

"System RAM Can Substitute for VRAM" Technically yes (via CPU offloading), but practically no. Once you spill into system RAM, token generation slows to 1–3 tok/s — unusably slow for interactive chat. VRAM is 10–30× faster than system RAM for LLM inference.

"You Need 24 GB to Run Anything Good" Absolutely not. A 7B Q4 model running on 8 GB of VRAM can produce excellent results. Modern quantization has made local LLMs accessible to nearly every gaming PC from the last 5 years.

Quick Reference

| GPU VRAM | Best Model Tier | Recommended Quant | |----------|----------------|-------------------| | 4–6 GB | 1–3B | Q4_K_M | | 8 GB | 7–8B | Q4_K_M | | 12 GB | 7–8B | Q8_0 | | 16 GB | 7–13B | Q8_0 / Q4_K_M | | 24 GB | 13–35B | Q8_0 / Q4_K_M | | 48 GB | 35–70B+ | Q8_0 / Q4_K_M |

Use our Hardware Detection Tool to automatically find the best models for your GPU.