Testing Apple's On-Device LLM: What apfel Gets Right (And Where It Falls Apart)
Apple’s Foundation Models framework ships with macOS 26+, silently powering Apple Intelligence features. But what if you want to use it directly? Not through Siri. Not through Writing Tools. Through an API you control.
apfel is an open-source OpenAI-compatible server that exposes Apple’s on-device LLM to any HTTP client. No API keys. No usage limits. No network calls. Just a local server running on your Mac.
I spent a week testing it. Fifteen distinct tasks. Every supported parameter. Real latency measurements.
The short version: Fast and accurate for classification, entity extraction, and privacy-sensitive tasks. Painfully slow for anything complex. Zero network calls - genuinely private. Missing logprobs, embeddings, and context depth. If you can tolerate 7-30 second response times and don’t need confidence scores, it’s a legitimate tool. Otherwise, stick to cloud APIs.
Here’s the full picture.
What It Actually Is
apfel v0.9.4 is a Swift-based CLI that wraps Apple’s Foundation Models framework. Out of the box it gives you an interactive chat. Add --serve and it spins up an OpenAI-compatible HTTP server on your machine.
brew tap Arthur-Ficial/tap
brew install apfel
apfel --serve # localhost:11434
Alternatively: git clone the repo and make install. No Xcode required - Command Line Tools are enough.
The model isn’t configurable. You get apple-foundationmodel - whatever Apple shipped with your macOS version - and that’s it. One model. One behavior. No temperature variants, no size options.
Context window: 4096 tokens total (input + output). Roughly three pages of text.
One thing to note: port 11434 is the same default as Ollama. If you run both, you’ll have a conflict. Plan accordingly.
The Testing Method
I tested 15 distinct task categories across three dimensions:
- Capability - Does it produce correct output?
- Latency - How long does it take?
- API compatibility - Which OpenAI parameters work?
All tests ran on an M1 MacBook Air with 8 GB RAM - the lowest-spec Apple Silicon machine you can find - using Python’s OpenAI SDK pointing at the local apfel server. No network calls. No caching. Each test measured from request to final token. On beefier hardware, expect faster results.
What Works Well
Speed on Simple Tasks
| Task | Time | Quality |
|---|---|---|
| Entity extraction | 3.8s | Found all names correctly |
| Classification | 6.6s | Perfect sentiment detection |
| Instruction following | 6.7s | Exact format compliance |
| Creative writing | 6.9s | Decent haikus |
| Context coherence | 6.9s | Remembered facts across conversation |
| Summarization | 6-7s | Condensed repeated content well |
| JSON output | 7.1s | Valid structured data |
| Translation | 7.4s | Accurate German, French, Spanish |
Entity extraction was the standout - fastest test, highest accuracy. The model correctly identified “John, Mary, and Dr. Smith” from unstructured text without hallucinating additional entities.
Function Calling
This one surprised me. Full tools and tool_choice support works. Local function calling with zero network dependency. If you’re building automation pipelines that handle sensitive data, this is significant - your tool definitions and responses never leave the machine.
Hallucination Resistance
When asked about future events (2027 Oscars), it refused. No guessing. No fabricated winners. Just a clean statement that it cannot answer questions about future events.
Instruction Precision
Ask for numbered lists, get numbered lists. Ask for lowercase output, get lowercase. The model follows formatting instructions precisely - useful for structured data extraction pipelines.
Privacy Architecture
Zero external HTTP calls. No analytics SDKs. No telemetry endpoints. No error reporting services. The update check runs brew info locally, not via network API. Logs go to stderr only, stored in a 1000-entry in-memory ring buffer.
For workflows handling sensitive data, this architecture eliminates third-party risk entirely.
Where It Struggles
Latency on Complex Tasks
| Task | Time | Notes |
|---|---|---|
| Simple code generation | 9.4s | Clean Python functions |
| Basic reasoning | 11.5s | Math problems, showed work |
| Comparison/analysis | 12.4s | Solid but slow |
| Complex code (thread-safe LRU) | 30.5s | Correct but painful |
Thirty seconds for a single code completion. This isn’t “slightly slower than cloud APIs.” This is a different category of tool entirely.
Average across all tests: 9.4 seconds. Cloud APIs run sub-second. The tradeoff is real.
Hard Limitations
No logprobs. Cannot get confidence scores for classification tasks. If your pipeline needs “how sure are you?” - common for automated decision thresholds - this data isn’t available.
4096 token context. Input plus output. For document analysis, code review, or long conversations, you hit the ceiling fast.
One model only. No switching between fast/cheap and slow/capable variants. You get what Apple shipped.
No embeddings. Vector search, semantic similarity, RAG pipelines - none of it. You’ll need a separate solution.
JSON formatting quirk. Sometimes wraps response_format: json_object outputs in markdown code blocks. Your parser needs to handle this.
Cloud vs. On-Device
The comparison everyone wants to make:
| Factor | Cloud (OpenAI/Anthropic/DeepSeek) | apfel (On-Device) |
|---|---|---|
| Cost | Per-token | Free |
| Latency | <1s | 7-30s |
| Privacy | Data leaves device | 100% local |
| Context | 128K+ tokens | 4K tokens |
| Logprobs | Yes | No |
| Embeddings | Yes | No |
| Internet required | Yes | No |
| Model selection | Multiple | One |
| GDPR transfer risk | Yes (foreign providers) | None |
For EU companies operating under GDPR, that last row matters. No third-country transfer. No CLOUD Act exposure. No DPA negotiations. The data never leaves your machine because there’s nowhere for it to go.
API Compatibility
apfel promises OpenAI API compatibility. Here’s what actually works:
| Parameter | Status |
|---|---|
temperature | ✓ Works (0.0 = deterministic) |
max_tokens | ✓ Works |
seed | ✓ Reproducible outputs |
stream | ✓ Works |
tools / tool_choice | ✓ Full function calling |
response_format: json_object | ✓ Works |
x_context_strategy | ✓ apfel extension (5 strategies) |
x_context_max_turns | ✓ apfel extension |
x_context_output_reserve | ✓ apfel extension |
logprobs, n, stop, presence_penalty, frequency_penalty | ✗ 400 error |
Streaming quirk: The final chunk returns empty choices[] with usage statistics only. Client code needs to handle this or it crashes on the last packet.
Architecture Notes
apfel is built in Swift 6 with strict concurrency checking. HTTP layer uses Hummingbird, bound to localhost only. Test suite: 203 unit tests plus 174 integration tests.
The context manager offers five strategies: newest-first, oldest-first, sliding-window, summarize, and strict. The summarize option compresses old turns via the on-device model itself - useful for long conversations within the 4K limit.
Where It Fits
Use it for:
- Privacy-sensitive entity extraction - fast, accurate, no data leaves machine
- Local classification - sentiment, intent, routing (without confidence scores)
- Function calling pipelines - tool definitions stay local
- Shell automation - command generation where 7s latency is acceptable
- Air-gapped environments - zero network requirements
- GDPR-sensitive workflows - no third-country transfer risk
Acceptable for:
- Simple code generation - single functions, not system architecture
- Reasoning tasks - if you can tolerate the wait
- Creative writing - haikus, short-form content
Don’t use it for:
- High-throughput applications - speed makes this impossible
- Classification needing confidence scores - no logprobs available
- Long-document analysis - 4K context limit
- Interactive user interfaces - 7-30s latency kills the experience
- Embeddings or RAG - not supported
Bottom Line
apfel isn’t competing with GPT-4o or Claude. It’s offering something different: local inference for the privacy-conscious, the compliance-bound, and the network-restricted.
The tradeoffs are real. You sacrifice speed, context length, and advanced features for privacy and control. For some workflows - particularly entity extraction, classification, and anything touching sensitive data - the tradeoff makes sense. For most, cloud APIs remain the pragmatic choice.
But apfel adds something important: a working example of what on-device AI looks like when you strip away the marketing and measure it honestly. And if you’ve been following the GDPR and CLOUD Act developments, “the data never leaves your machine” is becoming less of a nice-to-have and more of a legal requirement.
Three questions before you commit:
- Can your workflow handle 7-30 second response times?
- Will inputs plus outputs stay under 4K tokens?
- Do you need logprobs or embeddings?
Yes to the first, yes to the second, no to the third - give apfel a try. Otherwise, you know the answer.
TL;DR
- apfel v0.9.4 - OpenAI-compatible server for Apple’s on-device LLM
- Install:
brew tap Arthur-Ficial/tap && brew install apfel - Context: 4096 tokens total (input + output)
- Speed: 3.8s (entity extraction) to 30.5s (complex code), averaging 9.4s
- Privacy: Zero network calls, zero telemetry, 100% local
- Works: Function calling, streaming, JSON output, seed reproducibility
- Missing: logprobs, embeddings, model selection, stop sequences, penalties
- Best for: Privacy-sensitive extraction, classification, local automation, GDPR compliance
- Skip if: You need speed, confidence scores, long context, or embeddings
Tested on apfel v0.9.4, macOS 26+, M1 MacBook Air with 8 GB RAM - the entry-level Apple Silicon machine. If anything, these numbers are the floor; beefier hardware will do better. Measurements are averages across multiple runs. This article was brainstormed in collaboration with DeepSeek, Kimi K2.5 and Claude. The benchmarks, opinions, and 30-second wait times are entirely real.