Apple CEO Succession: A Hardware Engineer Takes Over. Three Months Running a 40GB AI Model on an M1 Max Tells Me Why That's the Right Call.

Published Apr 23, 2026 ยท Notes from one solo enthusiast, one Mac ยท โ† All posts

Apple CEO Succession: A Hardware Engineer Takes Over. Three Months Running a 40GB AI Model on an M1 Max Tells Me Why That's the Right Call.

The Succession Finally Happened โ€” And It's the Right Shape

Three days ago (April 20, 2026), Apple announced that John Ternus, SVP of Hardware Engineering, will succeed Tim Cook as CEO effective September 1, 2026. Cook moves to Executive Chairman. Ternus becomes Apple's 8th CEO.

The first market reaction on X/finance Twitter split cleanly. "Bold pick, a real builder" on one side. "Hardware engineer in a services-dominant era, multiple risk" on the other.

I'm drafting this on an M1 Max MacBook Pro. Qwen 3.6 35B-A3B MoE Q8 โ€” about 40GB of weights โ€” is pinned in Metal memory right now. The fan is quiet. From where I sit, a Hardware Engineering CEO isn't a compromise pick. It's the only pick that makes sense if you take Apple Silicon seriously as a strategic asset.

This post is a three-month dev diary on that setup. Not a product review. Not a "10x your AI productivity" take. Just what I've learned about the hardware underneath โ€” and why I think the Board got this one right.

The Setup: 64GB of Unified Memory, One Model, Zero Cloud

Hardware is an M1 Max MacBook Pro with the full 64GB unified memory. Yes, it's a $3k-class setup. That's the first honest thing to say.

The model is Qwen 3.6 35B-A3B MoE, Q8 quantization. Weights are ~40GB in Metal memory via mx.metal.set_wired_limit(45GB). That pin is load-bearing โ€” without it the macOS memory compressor will happily try to page out the model while you're mid-inference.

Hard ceiling at set_memory_limit(48GB). Scratch buffers capped at set_cache_limit(512MB). Buffer left for OS + apps: ~14-16GB, tight but stable. Everything runs offline. No cloud fallback. No API key. Just the laptop.

For that ~14-16GB buffer to actually hold: no Docker, no 30-tab Chrome session. I used to keep Chrome open with dozens of tabs; the memory pressure during long inference was noticeable enough that I stopped. My background load during heavy generation is Xcode (SwiftUI work) + terminal + editor. That's it.

The Q8 Tax: Trading Speed for Sanity

I moved from Q4 to Q8 on April 17. The motivation was pure quality. Q4 output was noticeably more muddled on longer reasoning tasks, especially anything requiring numerical precision or sustained argument.

Q8 runs in the 35-50 tok/s range depending on context length. Q4 was faster โ€” probably 10-15% more tok/s โ€” but the output just wasn't as good. When you're generating content you'll actually publish, that tradeoff isn't close.

The honest take: if your use case is chat-style short responses, Q4 might be fine. For long-form drafting, research synthesis, or anything that has to be correct-ish without a human checking every sentence, Q8 earns its extra memory.

The fp16 Moment: 21.18 to 26.22 tok/s From One Env Var

Running MLX on M1 Max defaults to bf16 for many kernels. For Qwen 3.6 MoE specifically, that was costing real throughput.

Setting MLX_FORCE_FP16=1 in the LaunchAgent environment bumped tok/s from 21.18 to 26.22. That's +24% from one flag. No recompile. No re-quantization. No weight re-download.

I don't know the full story of why bf16 is the default if fp16 wins here โ€” the MLX team almost certainly has a good reason at the kernel level. But empirically, on this hardware with this model, the flag is free speed.

Persisted it in the LaunchAgent plist, restarted, never looked back.

What Metal Memory Actually Wants: 45GB Wired, 48GB Ceiling, 512MB Scratch

Out of the box, Apple's memory compressor is aggressive. It will look at your 40GB model sitting in RAM, decide some of it is "idle," and start compressing pages. Every decompression on a subsequent inference is thrash.

The fix for MLX on M1 Max is a three-line config (pseudo-code โ€” real calls take bytes, I'm using GB suffixes for readability):

Before this, compressed swap on my machine was 19.69GB. After, it sits at 1.7GB. That's a 10x improvement on memory pressure from three lines of config. The buffer for macOS + Chrome + everything else stays at ~14-16GB, which survives a full day of normal laptop use. (I wrote up the full debugging path for the memory compression issue here โ€” it took me longer than I'd like to admit to figure out.)

The MoE Saturation Wall at 500 Tokens (The Thing Nobody Warns You About)

Qwen 3.6 is a Mixture-of-Experts model. On paper, sparse activation means you're only touching a fraction of weights per token, which is why it fits in 40GB at all.

What the papers don't emphasize: MoE models have a soft quality ceiling on single generation length. For Qwen 3.6 specifically, output degrades past roughly 500 tokens. Past 800 you start getting word salad. Past 1500 you get paragraphs that apologize to themselves mid-sentence.

The workaround is sectional generation. Split long outputs into 250-400 token sections, generate each independently, concatenate. State resets between calls. The model stays coherent the whole way through.

I automated it: a FastAPI endpoint that takes a research brief plus an ordered list of sections (heading + 1-sentence instruction + target word count) and fires one MLX call per section with max_tokens hard-capped under the degen zone. No shared context across calls. Outputs concatenate into a full draft. Maybe 40 lines of Python. If there's interest I'll clean it up and drop it as part of a small OSS package alongside the memory-safe runtime config.

This isn't an MLX issue. It's how MoE attention routing behaves under sustained sampling. Took me a while to isolate the variable.

The 4 AM Ghost: Managing Metal's Memory Drift

Even with wired_limit pinning, Metal accumulates scratch buffers over time. Long inference sessions leave compile cache and intermediate allocations that don't always free cleanly. After a couple of days of uptime, tok/s drifts down 5-10%.

The fix is a scheduled restart. I have a LaunchAgent KeepAlive set up to kill and relaunch the backend every day at 4 AM local time. Takes about 60 seconds end-to-end โ€” roughly 40 of those are MLX warmup.

It's not elegant. A properly designed memory system wouldn't need this. But it works, it's invisible because it runs while I sleep, and the next morning tok/s is back at baseline. I'll take a cron job over a memory leak any day.

What I Actually Lose vs Cloud (And What I Don't)

Honest comparison. What you lose going local:

What you don't lose:

For a solo dev with one user (me), the tradeoff leans local hard. Mileage varies if you're serving an API.

The Thing Nobody Prices About Apple Silicon: Unified Memory

Here's the structural point most Apple Silicon takes miss.

On x86 + Nvidia, VRAM is separate from system RAM. A $3k gaming laptop ships with at most 16GB of VRAM โ€” physically cannot hold Qwen 35B Q8, period. To match the 40GB I'm using here, you'd need two RTX 3090s (24GB each, NVLink bridge to share weights): ~$1,400-1,800 used for the cards alone, plus PSU, case, cooling, CPU. Easily another $1,500 before you have a running machine. And even then each forward pass is sharding across PCIe โ€” not unified memory. Two 4090s don't even solve it cleanly because Nvidia dropped NVLink on the 4090 line.

Meanwhile this thing fits in a backpack and runs at a quiet coffee shop.

On Apple Silicon, the 40GB of model weights live in the same physical RAM the OS and Chrome use. No PCIe bottleneck between CPU and GPU compute โ€” they literally share memory. That's not a Metal-is-faster-than-CUDA claim (per-op, it usually isn't). It's an architecture claim.

Which is why this MacBook runs models that most gaming desktops physically cannot. The chip speed is a subplot. The memory layout is the actual moat. (I made a longer version of this argument here, back when I was still surprised it was working at all.)

32 Watts During Inference (Yes, I Measured)

While drafting this post, I ran sudo powermetrics --samplers thermal,cpu_power during an active MLX generation. Under sustained load โ€” 35B MoE mid-response โ€” the M1 Max reported:

Nominal means the OS isn't throttling, fans are only occasionally audible, and the chassis is warm but not hot to the touch. Battery was at 80% charge while all of this ran โ€” I could have unplugged and kept going.

For context, a single RTX 4090 pulls 300-450W under ML load, and that's before you count the rest of the PC around it. This entire laptop, running the same class of model, is drawing roughly one-tenth that power.

Three Months In, I'm Long the Hardware Engineer

Which brings me back to Ternus.

Ternus joined Apple in 2001, became SVP Hardware Engineering in 2021, and has been the organizational owner of the physical devices Apple Silicon runs in. He didn't design the M-series chips themselves (Johny Srouji's team does that). But he shipped the iPads, MacBooks, and the unified-memory architecture decisions that make all of the numbers above possible.

Put another way: the CEO who's taking over in September has been running the team whose work lets me pin a 40GB model in Metal memory at 32 watts. That's not a marketing flywheel. That's a bet on the thing that's actually hard to replicate.

I could be wrong on the broader empire thesis. Services growth could decelerate. China revenue could weaken. iPhone upgrade cycles could stretch. Any of those hitting simultaneously resets the multiple regardless of who's in the chair.

But the floor โ€” the laptop on my desk that still runs the model โ€” doesn't move. And the Board just put the person who was running the team responsible for that floor in the CEO seat. From a solo-dev perspective, that reads as the Board knowing exactly which flywheel they need to defend for the next decade.

Come along for the ride โ€” see me fall or thrive, whichever comes first.