Post

Running LLMs Locally on Mac: A Hardware Guide from 2015 MacBook to M4 Mac Mini

A practical guide to running large language models locally on Mac hardware. From a 2015 MacBook Pro to the M4 Mac Mini, here's what you can actually run, how fast it'll be, and when it makes more sense to just use an API.

Running LLMs Locally on Mac: A Hardware Guide from 2015 MacBook to M4 Mac Mini

I’ve been curious about running LLMs locally with Ollama across different Mac hardware I have access to. The question isn’t just “can it run?” — it’s “can it run well enough to actually be useful?” Here’s what I’ve learned testing across machines spanning a decade of Apple hardware.

The 2015 MacBook Pro: Humbling Beginnings

Let’s start at the bottom. A 2015 MacBook Pro with a dual-core Intel i7 at 3.1 GHz and 16 GB of RAM. No Apple Silicon, no GPU acceleration — Ollama runs on CPU only here since Intel integrated graphics don’t help.

The constraints are real:

  • CPU-only inference (Intel Iris Graphics is useless for LLMs)
  • 16 GB RAM (need room for macOS + the model)
  • 2 cores / 4 threads (Broadwell era)

Here’s what actually works:

Model Size Speed
Phi-3 Mini 3.8B (Q4) ~2.3 GB ~5-10 tok/s
Gemma 2 2B ~1.6 GB ~5-10 tok/s
Llama 3.2 3B ~2 GB ~5-10 tok/s
Mistral 7B (Q4) ~4.1 GB ~2-5 tok/s
Llama 3.1 8B (Q4) ~4.7 GB ~2-5 tok/s

The best pick here is ollama run phi3:mini — it punches above its weight for a 3.8B model. You can bump up to Mistral 7B, but expect noticeable lag. Anything 13B or above? Don’t bother.

What is Quantization and Why Does it Matter?

Before jumping to better hardware, you need to understand quantization. It’s the reason a 7-billion-parameter model can fit in 4 GB of RAM instead of 28 GB.

A neural network is just billions of numbers (called weights). How you store those numbers changes everything:

Format Bits per weight Size of a 7B model
FP32 (full precision) 32 bits ~28 GB
FP16 (half precision) 16 bits ~14 GB
Q8 (8-bit quantized) 8 bits ~7 GB
Q4 (4-bit quantized) 4 bits ~4 GB

Think of it like JPEG compression. A raw photo might be 30 MB, but a decent JPEG is 3 MB and looks nearly identical. With quantization, you’re throwing away precision that the model’s output won’t meaningfully miss.

FP16 means Floating Point 16-bit — storing each weight as a 16-bit decimal number. Those 16 bits break down like scientific notation:

1
2
[1 bit sign] [5 bits exponent] [10 bits mantissa]
     +/-        how big          how precise

When you see Ollama model names with suffixes like q4_0 or q4_K_M:

  • q4_0 — simplest 4-bit quantization, fastest
  • q4_K_M — smarter 4-bit that keeps important weights at higher precision (this is the default)
  • q4_K_S — slightly more aggressive compression

For limited hardware, q4_0 squeezes out extra speed at a small quality tradeoff.

M2 Mac Mini (16 GB): The Entry Point

The base M2 with 16 GB is where things start getting interesting. Apple Silicon brings Metal GPU acceleration and unified memory — the GPU can access all your RAM directly.

Keep models under ~10-12 GB to leave room for macOS:

Model Size Speed
Llama 3.1 8B ~4.7 GB ~40-50 tok/s
Gemma 2 9B ~5.4 GB ~35-40 tok/s
Mistral 7B ~4.1 GB ~45-50 tok/s
Qwen 2.5 14B (Q4) ~9 GB ~20-25 tok/s
Deepseek Coder V2 16B ~9 GB ~20-25 tok/s

The jump from Intel to M2 is night and day. Models that crawled at 2-5 tokens/sec on the 2015 MacBook Pro now fly at 40-50 tokens/sec. This is the 7B-14B sweet spot — genuinely useful models at genuinely usable speeds.

Avoid 27B+ models here. They’ll technically load but will swap to disk and crawl.

M2 Pro Mac Mini (32 GB): The Sweet Spot

The M2 Pro upgrades what matters most for LLM inference: memory bandwidth. It doubles from 100 GB/s to 200 GB/s. LLM inference is memory-bandwidth-bound (it’s constantly reading weights from RAM), so this translates to roughly 2x faster token generation.

  M2 M2 Pro
CPU cores 8 12
GPU cores 10 19
Memory bandwidth 100 GB/s 200 GB/s

With 32 GB of RAM, you unlock the 27B-32B class of models:

Model Size Speed
Llama 3.1 8B ~4.7 GB ~80-100 tok/s
Codestral 22B ~13 GB ~25-35 tok/s
Gemma 2 27B ~16 GB ~25-30 tok/s
Qwen 2.5 32B (Q4) ~20 GB ~18-22 tok/s

The best picks: Qwen 2.5 32B for general use, Codestral 22B for coding, or Gemma 2 27B for the best quality/speed balance.

M4 Mac Mini: The Current Generation

The M4 lineup comes in three tiers, and the differences matter a lot for local LLMs:

  M4 M4 Pro M4 Max
CPU 10 cores 14 cores 16 cores
GPU 10 cores 20 cores 40 cores
Bandwidth 120 GB/s 273 GB/s 546 GB/s
Max RAM 32 GB 64 GB 128 GB

M4 base (16-32 GB) — ~20% faster than M2. Good for 7B-14B models.

M4 Pro (24-64 GB) — The local LLM sweet spot. With 48-64 GB, you can comfortably run 70B parameter models:

Model Size Speed (M4 Pro)
Llama 3.1 8B ~4.7 GB ~100+ tok/s
Qwen 2.5 32B ~20 GB ~30-40 tok/s
Llama 3.3 70B (Q4) ~40 GB ~15-20 tok/s
Deepseek R1 70B (Q4) ~43 GB ~12-18 tok/s

M4 Max (64-128 GB) — The beast. 546 GB/s bandwidth means 70B models run at 30-40 tok/s, and you can even squeeze in Deepseek R1 671B at Q2 quantization (barely).

When to Just Use an API Instead

Running local is fun, but sometimes the math doesn’t work out. Here’s what the API landscape looks like:

Budget Tier (< $1 / million input tokens)

Model Provider Input/Output per 1M tokens
Gemini 2.5 Flash-Lite Google $0.10 / $0.40
GPT-4o Mini OpenAI $0.15 / $0.60
DeepSeek V3.2 DeepSeek $0.28 / $0.42
Mistral Small Mistral $0.02 / $0.30

Mid Tier ($1-$5 / million input tokens)

Model Provider Input/Output per 1M tokens
GPT-5 OpenAI $1.25 / $10.00
Gemini 2.5 Pro Google $1.25 / $10.00
Claude Sonnet 4.5 Anthropic $3.00 / $15.00

Premium Tier

Model Provider Input/Output per 1M tokens
Claude Opus 4 Anthropic $15.00 / $75.00
o1-preview OpenAI $15.00 / $60.00

For context, a typical chat conversation uses 1-5K tokens. At budget API prices, 100 chats per day costs under $1/month. At mid-tier prices, maybe $15/month.

Local makes sense when:

  • Privacy matters (data never leaves your machine)
  • You’re doing high-volume batch processing
  • You want to experiment and tinker
  • You hate subscriptions

API makes sense when:

  • You need frontier-model quality (GPT-5, Claude Opus 4, Gemini 2.5 Pro)
  • Your usage is light to moderate
  • You don’t want to babysit hardware
  • You need the latest models immediately

Key Takeaways

  • Memory bandwidth is king for local LLM inference, not CPU speed or core count. That’s why the M2 Pro (200 GB/s) feels 2x faster than the M2 (100 GB/s) on the same model.
  • Quantization is your friend. Q4 models are ~7x smaller than FP32 with surprisingly small quality loss. Always use quantized models locally.
  • The M4 Pro 48 GB is the best bang-for-buck local LLM machine right now. It unlocks 70B models, which is a massive quality jump over 7B-14B.
  • Don’t sleep on DeepSeek’s API pricing. R1 reasoning at $0.55/million input tokens is absurdly cheap compared to running a 70B model on dedicated hardware.
  • Match the model to the machine. A fast 8B model you’ll actually enjoy using beats a 70B model that takes 10 seconds per response.

Resources

This post is licensed under CC BY 4.0 by the author.