Introduction: Which model should you run?
Llama 4 comes in several sizes, but the ones manageable on consumer-grade PCs are mainly the following two. First, decide on your target based on your hardware.
| Model | VRAM Required (4-bit) | Recommended GPU / Environment |
|---|---|---|
| Llama 4 12B (Scout) | 8GB - 10GB | RTX 3060/4060 Ti (12GB/16GB) recommended. *Can barely run on 8GB VRAM but with no headroom. |
| Llama 4 70B | 24GB - 48GB | 1x RTX 3090/4090 (24GB) (for 4-bit GGUF) or Mac Studio (M2/M3 Max 64GB+) |
| Llama 4 405B | 250GB+ | Not runnable on consumer PCs (Requires 4-8x H100) |
Method A: Fastest Setup with Ollama
While it uses a terminal (the "black screen"), this is actually the simplest method. A Web UI can be added later.
Install Ollama
Go to the official site (ollama.com), click "Download for Windows," and run the installer.
Once installed, confirm the 🦙 icon is present in your system tray (bottom right).
Run Llama 4
Open PowerShell or Command Prompt and simply enter the following command; the model download and execution will be automated.
12B Model (Mainstream)
* If you have a high-spec PC capable of running 70B:
70B Model (High-end)
>>> prompt will appear. Try talking to it!
Method B: Visual Interaction with LM Studio
Recommended for those who prefer not to use command lines or want to fine-tune GPU settings visually.
Install LM Studio
Go to the official site (lmstudio.ai), click "Download LM Studio for Windows," and install it.
Search and Download Models
Click the magnifying glass icon (Search) on the left and enter llama 4.
From the search results, pick a model with good "Compatibility" (marked in green) using the filters on the left.
- The Q4_K_M quantization format is recommended for its good balance.
- Click the download button, and progress will be shown at the bottom of the screen.
Start Chatting
Click the speech bubble icon (AI Chat) on the left, and select the Llama 4 model you
just downloaded from the dropdown in the top center.
You're ready to chat. Maxing out GPU Offload in the right-hand settings panel will
fully utilize your GPU for faster responses.
Troubleshooting
Q. It's very slow / crashes
Most likely a lack of VRAM.
- Check your GPU's dedicated memory usage in the Task Manager "Performance" tab.
- If it's overflowing, you'll need to try a smaller model (e.g., 12B → 8B) or a lower quantization level (e.g., Q4 → Q2).
- In LM Studio, lowering the
GPU Offloadslider slightly to offload to main system memory (RAM) might get it to work, though it will be slower.
Q. The language output is strange
While Llama 4 is an English-centric model, its multilingual capabilities are high. If you encounter odd phrasing, try adding the following to the system prompt:


