Llama 4 Local Setup Guide - Ollama & LM Studio

Introduction: Which model should you run?

Llama 4 comes in several sizes, but the ones manageable on consumer-grade PCs are mainly the following two. First, decide on your target based on your hardware.

Model	VRAM Required (4-bit)	Recommended GPU / Environment
Llama 4 12B (Scout)	8GB - 10GB	RTX 3060/4060 Ti (12GB/16GB) recommended. *Can barely run on 8GB VRAM but with no headroom.
Llama 4 70B	24GB - 48GB	1x RTX 3090/4090 (24GB) (for 4-bit GGUF) or Mac Studio (M2/M3 Max 64GB+)
Llama 4 405B	250GB+	Not runnable on consumer PCs (Requires 4-8x H100)

Method A: Ollama (Recommended)

Method A: Fastest Setup with Ollama

While it uses a terminal (the "black screen"), this is actually the simplest method. A Web UI can be added later.

Step 1

Install Ollama

Go to the official site (ollama.com), click "Download for Windows," and run the installer.

Once installed, confirm the 🦙 icon is present in your system tray (bottom right).

Step 2

Run Llama 4

Open PowerShell or Command Prompt and simply enter the following command; the model download and execution will be automated.

12B Model (Mainstream)

ollama run llama4

* If you have a high-spec PC capable of running 70B:

70B Model (High-end)

ollama run llama4:70b

💡 Hint: The initial download will be several gigabytes and may take some time. Once complete, a >>> prompt will appear. Try talking to it!

Method B: LM Studio (GUI)

Method B: Visual Interaction with LM Studio

Recommended for those who prefer not to use command lines or want to fine-tune GPU settings visually.

Step 1

Install LM Studio

Go to the official site (lmstudio.ai), click "Download LM Studio for Windows," and install it.

Step 2

Search and Download Models

Click the magnifying glass icon (Search) on the left and enter llama 4.

From the search results, pick a model with good "Compatibility" (marked in green) using the filters on the left.

The Q4_K_M quantization format is recommended for its good balance.
Click the download button, and progress will be shown at the bottom of the screen.

Step 3

Start Chatting

Click the speech bubble icon (AI Chat) on the left, and select the Llama 4 model you just downloaded from the dropdown in the top center.

You're ready to chat. Maxing out GPU Offload in the right-hand settings panel will fully utilize your GPU for faster responses.

Troubleshooting

Q. It's very slow / crashes

Most likely a lack of VRAM.

Check your GPU's dedicated memory usage in the Task Manager "Performance" tab.
If it's overflowing, you'll need to try a smaller model (e.g., 12B → 8B) or a lower quantization level (e.g., Q4 → Q2).
In LM Studio, lowering the GPU Offload slider slightly to offload to main system memory (RAM) might get it to work, though it will be slower.

Q. The language output is strange

While Llama 4 is an English-centric model, its multilingual capabilities are high. If you encounter odd phrasing, try adding the following to the system prompt:

You are a helpful AI assistant. Answer in detailed and natural Japanese (or your preferred language).