Testing Apple's On-Device LLM: What apfel Gets Right (And Where It Falls Apart)

09 Apr, 2026

Apple’s Foundation Models framework ships with macOS 26+, silently powering Apple Intelligence features. But what if you want to use it directly? Not through Siri. Not through Writing Tools. Through an API you control.

apfel is an open-source OpenAI-compatible server that exposes Apple’s on-device LLM to any HTTP client. No API keys. No usage limits. No network calls. Just a local server running on your Mac.

I spent a week testing it. Fifteen distinct tasks. Every supported parameter. Real latency measurements.

The short version: Fast and accurate for classification, entity extraction, and privacy-sensitive tasks. Painfully slow for anything complex. Zero network calls - genuinely private. Missing logprobs, embeddings, and context depth. If you can tolerate 7-30 second response times and don’t need confidence scores, it’s a legitimate tool. Otherwise, stick to cloud APIs.

Here’s the full picture.

What It Actually Is

apfel v0.9.4 is a Swift-based CLI that wraps Apple’s Foundation Models framework. Out of the box it gives you an interactive chat. Add --serve and it spins up an OpenAI-compatible HTTP server on your machine.

brew tap Arthur-Ficial/tap
brew install apfel
apfel --serve  # localhost:11434

Alternatively: git clone the repo and make install. No Xcode required - Command Line Tools are enough.

The model isn’t configurable. You get apple-foundationmodel - whatever Apple shipped with your macOS version - and that’s it. One model. One behavior. No temperature variants, no size options.

Context window: 4096 tokens total (input + output). Roughly three pages of text.

One thing to note: port 11434 is the same default as Ollama. If you run both, you’ll have a conflict. Plan accordingly.

The Testing Method

I tested 15 distinct task categories across three dimensions:

Capability - Does it produce correct output?
Latency - How long does it take?
API compatibility - Which OpenAI parameters work?

All tests ran on an M1 MacBook Air with 8 GB RAM - the lowest-spec Apple Silicon machine you can find - using Python’s OpenAI SDK pointing at the local apfel server. No network calls. No caching. Each test measured from request to final token. On beefier hardware, expect faster results.

What Works Well

Speed on Simple Tasks

Task	Time	Quality
Entity extraction	3.8s	Found all names correctly
Classification	6.6s	Perfect sentiment detection
Instruction following	6.7s	Exact format compliance
Creative writing	6.9s	Decent haikus
Context coherence	6.9s	Remembered facts across conversation
Summarization	6-7s	Condensed repeated content well
JSON output	7.1s	Valid structured data
Translation	7.4s	Accurate German, French, Spanish

Entity extraction was the standout - fastest test, highest accuracy. The model correctly identified “John, Mary, and Dr. Smith” from unstructured text without hallucinating additional entities.

Function Calling

This one surprised me. Full tools and tool_choice support works. Local function calling with zero network dependency. If you’re building automation pipelines that handle sensitive data, this is significant - your tool definitions and responses never leave the machine.

Hallucination Resistance

When asked about future events (2027 Oscars), it refused. No guessing. No fabricated winners. Just a clean statement that it cannot answer questions about future events.

Instruction Precision

Ask for numbered lists, get numbered lists. Ask for lowercase output, get lowercase. The model follows formatting instructions precisely - useful for structured data extraction pipelines.

Privacy Architecture

Zero external HTTP calls. No analytics SDKs. No telemetry endpoints. No error reporting services. The update check runs brew info locally, not via network API. Logs go to stderr only, stored in a 1000-entry in-memory ring buffer.

For workflows handling sensitive data, this architecture eliminates third-party risk entirely.

Where It Struggles

Latency on Complex Tasks

Task	Time	Notes
Simple code generation	9.4s	Clean Python functions
Basic reasoning	11.5s	Math problems, showed work
Comparison/analysis	12.4s	Solid but slow
Complex code (thread-safe LRU)	30.5s	Correct but painful

Thirty seconds for a single code completion. This isn’t “slightly slower than cloud APIs.” This is a different category of tool entirely.

Average across all tests: 9.4 seconds. Cloud APIs run sub-second. The tradeoff is real.

Hard Limitations

No logprobs. Cannot get confidence scores for classification tasks. If your pipeline needs “how sure are you?” - common for automated decision thresholds - this data isn’t available.

4096 token context. Input plus output. For document analysis, code review, or long conversations, you hit the ceiling fast.

One model only. No switching between fast/cheap and slow/capable variants. You get what Apple shipped.

No embeddings. Vector search, semantic similarity, RAG pipelines - none of it. You’ll need a separate solution.

JSON formatting quirk. Sometimes wraps response_format: json_object outputs in markdown code blocks. Your parser needs to handle this.

Cloud vs. On-Device

The comparison everyone wants to make:

Factor	Cloud (OpenAI/Anthropic/DeepSeek)	apfel (On-Device)
Cost	Per-token	Free
Latency	<1s	7-30s
Privacy	Data leaves device	100% local
Context	128K+ tokens	4K tokens
Logprobs	Yes	No
Embeddings	Yes	No
Internet required	Yes	No
Model selection	Multiple	One
GDPR transfer risk	Yes (foreign providers)	None

For EU companies operating under GDPR, that last row matters. No third-country transfer. No CLOUD Act exposure. No DPA negotiations. The data never leaves your machine because there’s nowhere for it to go.

API Compatibility

apfel promises OpenAI API compatibility. Here’s what actually works:

Parameter	Status
`temperature`	✓ Works (0.0 = deterministic)
`max_tokens`	✓ Works
`seed`	✓ Reproducible outputs
`stream`	✓ Works
`tools` / `tool_choice`	✓ Full function calling
`response_format: json_object`	✓ Works
`x_context_strategy`	✓ apfel extension (5 strategies)
`x_context_max_turns`	✓ apfel extension
`x_context_output_reserve`	✓ apfel extension
`logprobs`, `n`, `stop`, `presence_penalty`, `frequency_penalty`	✗ 400 error

Streaming quirk: The final chunk returns empty choices[] with usage statistics only. Client code needs to handle this or it crashes on the last packet.

Architecture Notes

apfel is built in Swift 6 with strict concurrency checking. HTTP layer uses Hummingbird, bound to localhost only. Test suite: 203 unit tests plus 174 integration tests.

The context manager offers five strategies: newest-first, oldest-first, sliding-window, summarize, and strict. The summarize option compresses old turns via the on-device model itself - useful for long conversations within the 4K limit.

Where It Fits

Use it for:

Privacy-sensitive entity extraction - fast, accurate, no data leaves machine
Local classification - sentiment, intent, routing (without confidence scores)
Function calling pipelines - tool definitions stay local
Shell automation - command generation where 7s latency is acceptable
Air-gapped environments - zero network requirements
GDPR-sensitive workflows - no third-country transfer risk

Acceptable for:

Simple code generation - single functions, not system architecture
Reasoning tasks - if you can tolerate the wait
Creative writing - haikus, short-form content

Don’t use it for:

High-throughput applications - speed makes this impossible
Classification needing confidence scores - no logprobs available
Long-document analysis - 4K context limit
Interactive user interfaces - 7-30s latency kills the experience
Embeddings or RAG - not supported

Bottom Line

apfel isn’t competing with GPT-4o or Claude. It’s offering something different: local inference for the privacy-conscious, the compliance-bound, and the network-restricted.

The tradeoffs are real. You sacrifice speed, context length, and advanced features for privacy and control. For some workflows - particularly entity extraction, classification, and anything touching sensitive data - the tradeoff makes sense. For most, cloud APIs remain the pragmatic choice.

But apfel adds something important: a working example of what on-device AI looks like when you strip away the marketing and measure it honestly. And if you’ve been following the GDPR and CLOUD Act developments, “the data never leaves your machine” is becoming less of a nice-to-have and more of a legal requirement.

Three questions before you commit:

Can your workflow handle 7-30 second response times?
Will inputs plus outputs stay under 4K tokens?
Do you need logprobs or embeddings?

Yes to the first, yes to the second, no to the third - give apfel a try. Otherwise, you know the answer.

TL;DR

apfel v0.9.4 - OpenAI-compatible server for Apple’s on-device LLM
Install: brew tap Arthur-Ficial/tap && brew install apfel
Context: 4096 tokens total (input + output)
Speed: 3.8s (entity extraction) to 30.5s (complex code), averaging 9.4s
Privacy: Zero network calls, zero telemetry, 100% local
Works: Function calling, streaming, JSON output, seed reproducibility
Missing: logprobs, embeddings, model selection, stop sequences, penalties
Best for: Privacy-sensitive extraction, classification, local automation, GDPR compliance
Skip if: You need speed, confidence scores, long context, or embeddings

Tested on apfel v0.9.4, macOS 26+, M1 MacBook Air with 8 GB RAM - the entry-level Apple Silicon machine. If anything, these numbers are the floor; beefier hardware will do better. Measurements are averages across multiple runs. This article was brainstormed in collaboration with DeepSeek, Kimi K2.5 and Claude. The benchmarks, opinions, and 30-second wait times are entirely real.

#En #Ai #Privacy #Apple #Local-Llm