Daniel Isaac
@danpacary
Followers
711
Following
3K
Media
206
Statuses
2K
idk what Iβm doing half the time. space, drones, AI & physics
Earth
Joined September 2023
I hijacked Apple's Neural Engine -- the chip built for Siri and photo filters. Reverse-engineered the private APIs and trained a full LLM on it. Zero fan noise. Zero GPU. Just the Neural Engine doing what nobody thought it could. Your Mac has one too.
34
116
1K
404 experiments. 105 hours. 1 Mac. 3 accelerators (MPS β ANE β MLX) 2 AI agents running simultaneously 63h wall time 77 keeps 19% keep rate 81% of experiments failed ANE agent still running. Count going up. more to come...
3
0
29
anyone can be a researcher, hacker, builder just do the work
Anyone can do this work. Even you, reading this, right now. Every M-series Mac has a Neural Engine. Doing nothing... We got it working with private APIs and Obj-C. Not pretty. Not easy. But it works. The code is open. The data is public. ncdrone/autoresearch-ANE
0
0
14
Anyone can do this work. Even you, reading this, right now. Every M-series Mac has a Neural Engine. Doing nothing... We got it working with private APIs and Obj-C. Not pretty. Not easy. But it works. The code is open. The data is public. ncdrone/autoresearch-ANE
0
2
7
Credits β the people who made this possible: maderix β ANE private APIs, dynamic weights, the entire foundation Karpathy β autoresearch, climbmix-400B, rustbpe tokenizer Vipul Divyanshu β 1x1 conv classifier, bridge APIs thebasedcapital β Rust+ANE+Metal, direct eval,
1
0
4
What we actually built (that didn't already exist): β 344 experiments across 3 accelerators (MPS β ANE β MLX) β systematic testing on a chip almost nobody trains on. Split LR scaling came from this grind. β First bridge from Karpathy's climbmix-400B data to ANE native
1
0
0
Then I read the literature. β Zero-init? maderix had DeepNet scaling. β Classifier bottleneck? Vipul proved 1x1 conv is 10x faster. β FP16 underflow? maderix documented the exact fix. β Dispatch overhead? thebasedcapital had direct eval + fused mega-kernels. β Conv2d
1
0
2
What I thought we discovered: I thought we found zero-init stabilizes training. Huge win. I thought we profiled the classifier β 22% of step time. Found the bottleneck. I thought we caught FP16 gradients silently dying. I thought we spotted ANE dispatch overhead stacking up. I
1
0
1
344 experiments. 2 AI agents. 1 chip nobody trains on. Here's what I built β and what I didn't: I trained a 48.8M param model on Apple Neural Engine using private APIs. val_bpb = 1.595. First comparable benchmark on ANE β same data and tokenizer as Karpathy's H100 baseline.
4
3
33
I'm actual going to continue this on for 24h... 12h isn't enough
Tonight's setup: two autonomous AI agents training GPT models simultaneously on the same M4 Max. One runs on Apple Neural Engine (native Obj-C, private APIs). The other on MLX (Python). They share a gossip file β each agent reads what the other discovered before running its
3
1
10
M4 Max, 128GB. Karpathy climbmix-400B, rustbpe 8192. ANE: native Obj-C, private APIs, 48.8M params. MLX: Python, Apple MLX, 15.7M params. Shared gossip file. Both log + read every experiment. Next: overnight with optimized config. Credits maderix, karpathy, trevin-creator,
0
0
2
The bull case for ANE: - 3.1x more params. More capacity, less optimization. - Adam β Muon could close half the gap. - Sweep dropped 0.354 bpb via cross-pollination. - Overnight curve still trending down at 72K. - No published ANE training metrics we can find.
1
0
1
Cross-pollination is real. Tonight: 98 ANE experiments, autonomous agent reading MLX gossip before each run. embed_lr insight from MLX? Applied. Softcap removal? Confirmed. Short warmup? Validated. Result: 2.490 β 2.136. That's β0.354 bpb in one session.
1
0
1
Before importing MLX findings, the agent calibrated fundamentals: learning rate, batch accumulation, warmup. These knobs were set for 72K-step overnights. 3K-step sweeps need different settings. Result: β0.012 bpb. Small, but necessary groundwork.
1
0
0
The gap isn't hardware. It's everything else. Optimizer: pure Adam vs Muon+AdamW (~half the gap alone*) Research: 55 ANE experiments vs 259 MLX Architecture: 2 features vs 5+ Language: compiled Obj-C vs Python *Muon estimate based on published ablations
1
0
2
Two very different paths to convergence. ANE: one 8-hour overnight run, 72K steps. Still trending down β not plateaued. MLX: 259 five-minute experiments, 30 improvements. Rapid iteration in Python. ANE iterates 60x slower. That compounds.
1
0
1
ANE vs MLX. Same chip. Same data. Same tokenizer. Same eval. ANE: 1.5949 bpb (48.8M params, pure Adam) MLX: 1.2661 bpb (15.7M params, Muon+AdamW) Gap: 0.329. MLX wins β but ANE is under-optimized.
4
2
68
Thanks @grok but I stand on the shoulders of giantsβ¦
@danpacary @FloridaMannnnnn No, Apple didn't release public API endpoints for direct Neural Engine training or custom graphs in macOS Tahoe 26.4 (or any recent beta). Core ML still limits it to inference only. Your reverse-engineering of the private _ANEClient/_ANECompiler APIs (and those benchmarks showing
1
0
17
Built on research from maderix, Vipul Divyanshu, thebasedcapital, Anemll, Karpathy's autoresearch framework, and HyperspaceAI's gossip concept. Full attribution with exactly what came from where: .github.com/ncdrone/autoresearch-ANE/blob/autoresearch/mar9-ane/CREDITS.md
0
1
10