Running LLMs Locally on Mac: A Hardware Guide from 2015 MacBook to M4 Mac Mini

A practical guide to running large language models locally on Mac hardware. From a 2015 MacBook Pro to the M4 Mac Mini, here's what you can actually run, how fast it'll be, and when it makes more sense to just use an API.

Posted Feb 14, 2026

By Michael Masters & Claude

6 min read

I’ve been curious about running LLMs locally with Ollama across different Mac hardware I have access to. The question isn’t just “can it run?” — it’s “can it run well enough to actually be useful?” Here’s what I’ve learned testing across machines spanning a decade of Apple hardware.

The 2015 MacBook Pro: Humbling Beginnings

Let’s start at the bottom. A 2015 MacBook Pro with a dual-core Intel i7 at 3.1 GHz and 16 GB of RAM. No Apple Silicon, no GPU acceleration — Ollama runs on CPU only here since Intel integrated graphics don’t help.

The constraints are real:

CPU-only inference (Intel Iris Graphics is useless for LLMs)
16 GB RAM (need room for macOS + the model)
2 cores / 4 threads (Broadwell era)

Here’s what actually works:

Model	Size	Speed
Phi-3 Mini 3.8B (Q4)	~2.3 GB	~5-10 tok/s
Gemma 2 2B	~1.6 GB	~5-10 tok/s
Llama 3.2 3B	~2 GB	~5-10 tok/s
Mistral 7B (Q4)	~4.1 GB	~2-5 tok/s
Llama 3.1 8B (Q4)	~4.7 GB	~2-5 tok/s

The best pick here is ollama run phi3:mini — it punches above its weight for a 3.8B model. You can bump up to Mistral 7B, but expect noticeable lag. Anything 13B or above? Don’t bother.

What is Quantization and Why Does it Matter?

Before jumping to better hardware, you need to understand quantization. It’s the reason a 7-billion-parameter model can fit in 4 GB of RAM instead of 28 GB.

A neural network is just billions of numbers (called weights). How you store those numbers changes everything:

Format	Bits per weight	Size of a 7B model
FP32 (full precision)	32 bits	~28 GB
FP16 (half precision)	16 bits	~14 GB
Q8 (8-bit quantized)	8 bits	~7 GB
Q4 (4-bit quantized)	4 bits	~4 GB

Think of it like JPEG compression. A raw photo might be 30 MB, but a decent JPEG is 3 MB and looks nearly identical. With quantization, you’re throwing away precision that the model’s output won’t meaningfully miss.

FP16 means Floating Point 16-bit — storing each weight as a 16-bit decimal number. Those 16 bits break down like scientific notation:

1
2
[1 bit sign] [5 bits exponent] [10 bits mantissa]
     +/-        how big          how precise

When you see Ollama model names with suffixes like q4_0 or q4_K_M:

q4_0 — simplest 4-bit quantization, fastest
q4_K_M — smarter 4-bit that keeps important weights at higher precision (this is the default)
q4_K_S — slightly more aggressive compression

For limited hardware, q4_0 squeezes out extra speed at a small quality tradeoff.

M2 Mac Mini (16 GB): The Entry Point

The base M2 with 16 GB is where things start getting interesting. Apple Silicon brings Metal GPU acceleration and unified memory — the GPU can access all your RAM directly.

Keep models under ~10-12 GB to leave room for macOS:

Model	Size	Speed
Llama 3.1 8B	~4.7 GB	~40-50 tok/s
Gemma 2 9B	~5.4 GB	~35-40 tok/s
Mistral 7B	~4.1 GB	~45-50 tok/s
Qwen 2.5 14B (Q4)	~9 GB	~20-25 tok/s
Deepseek Coder V2 16B	~9 GB	~20-25 tok/s

The jump from Intel to M2 is night and day. Models that crawled at 2-5 tokens/sec on the 2015 MacBook Pro now fly at 40-50 tokens/sec. This is the 7B-14B sweet spot — genuinely useful models at genuinely usable speeds.

Avoid 27B+ models here. They’ll technically load but will swap to disk and crawl.

M2 Pro Mac Mini (32 GB): The Sweet Spot

The M2 Pro upgrades what matters most for LLM inference: memory bandwidth. It doubles from 100 GB/s to 200 GB/s. LLM inference is memory-bandwidth-bound (it’s constantly reading weights from RAM), so this translates to roughly 2x faster token generation.

	M2	M2 Pro
CPU cores	8	12
GPU cores	10	19
Memory bandwidth	100 GB/s	200 GB/s

With 32 GB of RAM, you unlock the 27B-32B class of models:

Model	Size	Speed
Llama 3.1 8B	~4.7 GB	~80-100 tok/s
Codestral 22B	~13 GB	~25-35 tok/s
Gemma 2 27B	~16 GB	~25-30 tok/s
Qwen 2.5 32B (Q4)	~20 GB	~18-22 tok/s

The best picks: Qwen 2.5 32B for general use, Codestral 22B for coding, or Gemma 2 27B for the best quality/speed balance.

M4 Mac Mini: The Current Generation

The M4 lineup comes in three tiers, and the differences matter a lot for local LLMs:

	M4	M4 Pro	M4 Max
CPU	10 cores	14 cores	16 cores
GPU	10 cores	20 cores	40 cores
Bandwidth	120 GB/s	273 GB/s	546 GB/s
Max RAM	32 GB	64 GB	128 GB

M4 base (16-32 GB) — ~20% faster than M2. Good for 7B-14B models.

M4 Pro (24-64 GB) — The local LLM sweet spot. With 48-64 GB, you can comfortably run 70B parameter models:

Model	Size	Speed (M4 Pro)
Llama 3.1 8B	~4.7 GB	~100+ tok/s
Qwen 2.5 32B	~20 GB	~30-40 tok/s
Llama 3.3 70B (Q4)	~40 GB	~15-20 tok/s
Deepseek R1 70B (Q4)	~43 GB	~12-18 tok/s

M4 Max (64-128 GB) — The beast. 546 GB/s bandwidth means 70B models run at 30-40 tok/s, and you can even squeeze in Deepseek R1 671B at Q2 quantization (barely).

When to Just Use an API Instead

Running local is fun, but sometimes the math doesn’t work out. Here’s what the API landscape looks like:

Budget Tier (< $1 / million input tokens)

Model	Provider	Input/Output per 1M tokens
Gemini 2.5 Flash-Lite	Google	$0.10 / $0.40
GPT-4o Mini	OpenAI	$0.15 / $0.60
DeepSeek V3.2	DeepSeek	$0.28 / $0.42
Mistral Small	Mistral	$0.02 / $0.30

Mid Tier ($1-$5 / million input tokens)

Model	Provider	Input/Output per 1M tokens
GPT-5	OpenAI	$1.25 / $10.00
Gemini 2.5 Pro	Google	$1.25 / $10.00
Claude Sonnet 4.5	Anthropic	$3.00 / $15.00

Premium Tier

Model	Provider	Input/Output per 1M tokens
Claude Opus 4	Anthropic	$15.00 / $75.00
o1-preview	OpenAI	$15.00 / $60.00

For context, a typical chat conversation uses 1-5K tokens. At budget API prices, 100 chats per day costs under $1/month. At mid-tier prices, maybe $15/month.

Local makes sense when:

Privacy matters (data never leaves your machine)
You’re doing high-volume batch processing
You want to experiment and tinker
You hate subscriptions

API makes sense when:

You need frontier-model quality (GPT-5, Claude Opus 4, Gemini 2.5 Pro)
Your usage is light to moderate
You don’t want to babysit hardware
You need the latest models immediately

Key Takeaways

Memory bandwidth is king for local LLM inference, not CPU speed or core count. That’s why the M2 Pro (200 GB/s) feels 2x faster than the M2 (100 GB/s) on the same model.
Quantization is your friend. Q4 models are ~7x smaller than FP32 with surprisingly small quality loss. Always use quantized models locally.
The M4 Pro 48 GB is the best bang-for-buck local LLM machine right now. It unlocks 70B models, which is a massive quality jump over 7B-14B.
Don’t sleep on DeepSeek’s API pricing. R1 reasoning at $0.55/million input tokens is absurdly cheap compared to running a 70B model on dedicated hardware.
Match the model to the machine. A fast 8B model you’ll actually enjoy using beats a 70B model that takes 10 seconds per response.

Resources

Ollama — The easiest way to run LLMs locally on Mac
PricePerToken.com — Live API pricing comparison across 300+ models
Helicone LLM Cost Calculator — Another great pricing comparison tool

ai, hardware

ollama llm mac apple-silicon quantization local-ai

This post is licensed under CC BY 4.0 by the author.