Running LLMs Locally on Mac: A Hardware Guide from 2015 MacBook to M4 Mac Mini
A practical guide to running large language models locally on Mac hardware. From a 2015 MacBook Pro to the M4 Mac Mini, here's what you can actually run, how fast it'll be, and when it makes more sense to just use an API.
I’ve been curious about running LLMs locally with Ollama across different Mac hardware I have access to. The question isn’t just “can it run?” — it’s “can it run well enough to actually be useful?” Here’s what I’ve learned testing across machines spanning a decade of Apple hardware.
The 2015 MacBook Pro: Humbling Beginnings
Let’s start at the bottom. A 2015 MacBook Pro with a dual-core Intel i7 at 3.1 GHz and 16 GB of RAM. No Apple Silicon, no GPU acceleration — Ollama runs on CPU only here since Intel integrated graphics don’t help.
The constraints are real:
- CPU-only inference (Intel Iris Graphics is useless for LLMs)
- 16 GB RAM (need room for macOS + the model)
- 2 cores / 4 threads (Broadwell era)
Here’s what actually works:
| Model | Size | Speed |
|---|---|---|
| Phi-3 Mini 3.8B (Q4) | ~2.3 GB | ~5-10 tok/s |
| Gemma 2 2B | ~1.6 GB | ~5-10 tok/s |
| Llama 3.2 3B | ~2 GB | ~5-10 tok/s |
| Mistral 7B (Q4) | ~4.1 GB | ~2-5 tok/s |
| Llama 3.1 8B (Q4) | ~4.7 GB | ~2-5 tok/s |
The best pick here is ollama run phi3:mini — it punches above its weight for a 3.8B model. You can bump up to Mistral 7B, but expect noticeable lag. Anything 13B or above? Don’t bother.
What is Quantization and Why Does it Matter?
Before jumping to better hardware, you need to understand quantization. It’s the reason a 7-billion-parameter model can fit in 4 GB of RAM instead of 28 GB.
A neural network is just billions of numbers (called weights). How you store those numbers changes everything:
| Format | Bits per weight | Size of a 7B model |
|---|---|---|
| FP32 (full precision) | 32 bits | ~28 GB |
| FP16 (half precision) | 16 bits | ~14 GB |
| Q8 (8-bit quantized) | 8 bits | ~7 GB |
| Q4 (4-bit quantized) | 4 bits | ~4 GB |
Think of it like JPEG compression. A raw photo might be 30 MB, but a decent JPEG is 3 MB and looks nearly identical. With quantization, you’re throwing away precision that the model’s output won’t meaningfully miss.
FP16 means Floating Point 16-bit — storing each weight as a 16-bit decimal number. Those 16 bits break down like scientific notation:
1
2
[1 bit sign] [5 bits exponent] [10 bits mantissa]
+/- how big how precise
When you see Ollama model names with suffixes like q4_0 or q4_K_M:
- q4_0 — simplest 4-bit quantization, fastest
- q4_K_M — smarter 4-bit that keeps important weights at higher precision (this is the default)
- q4_K_S — slightly more aggressive compression
For limited hardware, q4_0 squeezes out extra speed at a small quality tradeoff.
M2 Mac Mini (16 GB): The Entry Point
The base M2 with 16 GB is where things start getting interesting. Apple Silicon brings Metal GPU acceleration and unified memory — the GPU can access all your RAM directly.
Keep models under ~10-12 GB to leave room for macOS:
| Model | Size | Speed |
|---|---|---|
| Llama 3.1 8B | ~4.7 GB | ~40-50 tok/s |
| Gemma 2 9B | ~5.4 GB | ~35-40 tok/s |
| Mistral 7B | ~4.1 GB | ~45-50 tok/s |
| Qwen 2.5 14B (Q4) | ~9 GB | ~20-25 tok/s |
| Deepseek Coder V2 16B | ~9 GB | ~20-25 tok/s |
The jump from Intel to M2 is night and day. Models that crawled at 2-5 tokens/sec on the 2015 MacBook Pro now fly at 40-50 tokens/sec. This is the 7B-14B sweet spot — genuinely useful models at genuinely usable speeds.
Avoid 27B+ models here. They’ll technically load but will swap to disk and crawl.
M2 Pro Mac Mini (32 GB): The Sweet Spot
The M2 Pro upgrades what matters most for LLM inference: memory bandwidth. It doubles from 100 GB/s to 200 GB/s. LLM inference is memory-bandwidth-bound (it’s constantly reading weights from RAM), so this translates to roughly 2x faster token generation.
| M2 | M2 Pro | |
|---|---|---|
| CPU cores | 8 | 12 |
| GPU cores | 10 | 19 |
| Memory bandwidth | 100 GB/s | 200 GB/s |
With 32 GB of RAM, you unlock the 27B-32B class of models:
| Model | Size | Speed |
|---|---|---|
| Llama 3.1 8B | ~4.7 GB | ~80-100 tok/s |
| Codestral 22B | ~13 GB | ~25-35 tok/s |
| Gemma 2 27B | ~16 GB | ~25-30 tok/s |
| Qwen 2.5 32B (Q4) | ~20 GB | ~18-22 tok/s |
The best picks: Qwen 2.5 32B for general use, Codestral 22B for coding, or Gemma 2 27B for the best quality/speed balance.
M4 Mac Mini: The Current Generation
The M4 lineup comes in three tiers, and the differences matter a lot for local LLMs:
| M4 | M4 Pro | M4 Max | |
|---|---|---|---|
| CPU | 10 cores | 14 cores | 16 cores |
| GPU | 10 cores | 20 cores | 40 cores |
| Bandwidth | 120 GB/s | 273 GB/s | 546 GB/s |
| Max RAM | 32 GB | 64 GB | 128 GB |
M4 base (16-32 GB) — ~20% faster than M2. Good for 7B-14B models.
M4 Pro (24-64 GB) — The local LLM sweet spot. With 48-64 GB, you can comfortably run 70B parameter models:
| Model | Size | Speed (M4 Pro) |
|---|---|---|
| Llama 3.1 8B | ~4.7 GB | ~100+ tok/s |
| Qwen 2.5 32B | ~20 GB | ~30-40 tok/s |
| Llama 3.3 70B (Q4) | ~40 GB | ~15-20 tok/s |
| Deepseek R1 70B (Q4) | ~43 GB | ~12-18 tok/s |
M4 Max (64-128 GB) — The beast. 546 GB/s bandwidth means 70B models run at 30-40 tok/s, and you can even squeeze in Deepseek R1 671B at Q2 quantization (barely).
When to Just Use an API Instead
Running local is fun, but sometimes the math doesn’t work out. Here’s what the API landscape looks like:
Budget Tier (< $1 / million input tokens)
| Model | Provider | Input/Output per 1M tokens |
|---|---|---|
| Gemini 2.5 Flash-Lite | $0.10 / $0.40 | |
| GPT-4o Mini | OpenAI | $0.15 / $0.60 |
| DeepSeek V3.2 | DeepSeek | $0.28 / $0.42 |
| Mistral Small | Mistral | $0.02 / $0.30 |
Mid Tier ($1-$5 / million input tokens)
| Model | Provider | Input/Output per 1M tokens |
|---|---|---|
| GPT-5 | OpenAI | $1.25 / $10.00 |
| Gemini 2.5 Pro | $1.25 / $10.00 | |
| Claude Sonnet 4.5 | Anthropic | $3.00 / $15.00 |
Premium Tier
| Model | Provider | Input/Output per 1M tokens |
|---|---|---|
| Claude Opus 4 | Anthropic | $15.00 / $75.00 |
| o1-preview | OpenAI | $15.00 / $60.00 |
For context, a typical chat conversation uses 1-5K tokens. At budget API prices, 100 chats per day costs under $1/month. At mid-tier prices, maybe $15/month.
Local makes sense when:
- Privacy matters (data never leaves your machine)
- You’re doing high-volume batch processing
- You want to experiment and tinker
- You hate subscriptions
API makes sense when:
- You need frontier-model quality (GPT-5, Claude Opus 4, Gemini 2.5 Pro)
- Your usage is light to moderate
- You don’t want to babysit hardware
- You need the latest models immediately
Key Takeaways
- Memory bandwidth is king for local LLM inference, not CPU speed or core count. That’s why the M2 Pro (200 GB/s) feels 2x faster than the M2 (100 GB/s) on the same model.
- Quantization is your friend. Q4 models are ~7x smaller than FP32 with surprisingly small quality loss. Always use quantized models locally.
- The M4 Pro 48 GB is the best bang-for-buck local LLM machine right now. It unlocks 70B models, which is a massive quality jump over 7B-14B.
- Don’t sleep on DeepSeek’s API pricing. R1 reasoning at $0.55/million input tokens is absurdly cheap compared to running a 70B model on dedicated hardware.
- Match the model to the machine. A fast 8B model you’ll actually enjoy using beats a 70B model that takes 10 seconds per response.
Resources
- Ollama — The easiest way to run LLMs locally on Mac
- PricePerToken.com — Live API pricing comparison across 300+ models
- Helicone LLM Cost Calculator — Another great pricing comparison tool