Digitaliziran si

Testing Apple's On-Device LLM: What apfel Gets Right (And Where It Falls Apart)

Apple’s Foundation Models framework ships with macOS 26+, silently powering Apple Intelligence features. But what if you want to use it directly? Not through Siri. Not through Writing Tools. Through an API you control.

apfel is an open-source OpenAI-compatible server that exposes Apple’s on-device LLM to any HTTP client. No API keys. No usage limits. No network calls. Just a local server running on your Mac.

I spent a week testing it. Fifteen distinct tasks. Every supported parameter. Real latency measurements.

The short version: Fast and accurate for classification, entity extraction, and privacy-sensitive tasks. Painfully slow for anything complex. Zero network calls - genuinely private. Missing logprobs, embeddings, and context depth. If you can tolerate 7-30 second response times and don’t need confidence scores, it’s a legitimate tool. Otherwise, stick to cloud APIs.

Here’s the full picture.


What It Actually Is

apfel v0.9.4 is a Swift-based CLI that wraps Apple’s Foundation Models framework. Out of the box it gives you an interactive chat. Add --serve and it spins up an OpenAI-compatible HTTP server on your machine.

brew tap Arthur-Ficial/tap
brew install apfel
apfel --serve  # localhost:11434

Alternatively: git clone the repo and make install. No Xcode required - Command Line Tools are enough.

The model isn’t configurable. You get apple-foundationmodel - whatever Apple shipped with your macOS version - and that’s it. One model. One behavior. No temperature variants, no size options.

Context window: 4096 tokens total (input + output). Roughly three pages of text.

One thing to note: port 11434 is the same default as Ollama. If you run both, you’ll have a conflict. Plan accordingly.


The Testing Method

I tested 15 distinct task categories across three dimensions:

  1. Capability - Does it produce correct output?
  2. Latency - How long does it take?
  3. API compatibility - Which OpenAI parameters work?

All tests ran on an M1 MacBook Air with 8 GB RAM - the lowest-spec Apple Silicon machine you can find - using Python’s OpenAI SDK pointing at the local apfel server. No network calls. No caching. Each test measured from request to final token. On beefier hardware, expect faster results.


What Works Well

Speed on Simple Tasks

TaskTimeQuality
Entity extraction3.8sFound all names correctly
Classification6.6sPerfect sentiment detection
Instruction following6.7sExact format compliance
Creative writing6.9sDecent haikus
Context coherence6.9sRemembered facts across conversation
Summarization6-7sCondensed repeated content well
JSON output7.1sValid structured data
Translation7.4sAccurate German, French, Spanish

Entity extraction was the standout - fastest test, highest accuracy. The model correctly identified “John, Mary, and Dr. Smith” from unstructured text without hallucinating additional entities.

Function Calling

This one surprised me. Full tools and tool_choice support works. Local function calling with zero network dependency. If you’re building automation pipelines that handle sensitive data, this is significant - your tool definitions and responses never leave the machine.

Hallucination Resistance

When asked about future events (2027 Oscars), it refused. No guessing. No fabricated winners. Just a clean statement that it cannot answer questions about future events.

Instruction Precision

Ask for numbered lists, get numbered lists. Ask for lowercase output, get lowercase. The model follows formatting instructions precisely - useful for structured data extraction pipelines.

Privacy Architecture

Zero external HTTP calls. No analytics SDKs. No telemetry endpoints. No error reporting services. The update check runs brew info locally, not via network API. Logs go to stderr only, stored in a 1000-entry in-memory ring buffer.

For workflows handling sensitive data, this architecture eliminates third-party risk entirely.


Where It Struggles

Latency on Complex Tasks

TaskTimeNotes
Simple code generation9.4sClean Python functions
Basic reasoning11.5sMath problems, showed work
Comparison/analysis12.4sSolid but slow
Complex code (thread-safe LRU)30.5sCorrect but painful

Thirty seconds for a single code completion. This isn’t “slightly slower than cloud APIs.” This is a different category of tool entirely.

Average across all tests: 9.4 seconds. Cloud APIs run sub-second. The tradeoff is real.

Hard Limitations

No logprobs. Cannot get confidence scores for classification tasks. If your pipeline needs “how sure are you?” - common for automated decision thresholds - this data isn’t available.

4096 token context. Input plus output. For document analysis, code review, or long conversations, you hit the ceiling fast.

One model only. No switching between fast/cheap and slow/capable variants. You get what Apple shipped.

No embeddings. Vector search, semantic similarity, RAG pipelines - none of it. You’ll need a separate solution.

JSON formatting quirk. Sometimes wraps response_format: json_object outputs in markdown code blocks. Your parser needs to handle this.


Cloud vs. On-Device

The comparison everyone wants to make:

FactorCloud (OpenAI/Anthropic/DeepSeek)apfel (On-Device)
CostPer-tokenFree
Latency<1s7-30s
PrivacyData leaves device100% local
Context128K+ tokens4K tokens
LogprobsYesNo
EmbeddingsYesNo
Internet requiredYesNo
Model selectionMultipleOne
GDPR transfer riskYes (foreign providers)None

For EU companies operating under GDPR, that last row matters. No third-country transfer. No CLOUD Act exposure. No DPA negotiations. The data never leaves your machine because there’s nowhere for it to go.


API Compatibility

apfel promises OpenAI API compatibility. Here’s what actually works:

ParameterStatus
temperature✓ Works (0.0 = deterministic)
max_tokens✓ Works
seed✓ Reproducible outputs
stream✓ Works
tools / tool_choice✓ Full function calling
response_format: json_object✓ Works
x_context_strategy✓ apfel extension (5 strategies)
x_context_max_turns✓ apfel extension
x_context_output_reserve✓ apfel extension
logprobs, n, stop, presence_penalty, frequency_penalty✗ 400 error

Streaming quirk: The final chunk returns empty choices[] with usage statistics only. Client code needs to handle this or it crashes on the last packet.


Architecture Notes

apfel is built in Swift 6 with strict concurrency checking. HTTP layer uses Hummingbird, bound to localhost only. Test suite: 203 unit tests plus 174 integration tests.

The context manager offers five strategies: newest-first, oldest-first, sliding-window, summarize, and strict. The summarize option compresses old turns via the on-device model itself - useful for long conversations within the 4K limit.


Where It Fits

Use it for:

Acceptable for:

Don’t use it for:


Bottom Line

apfel isn’t competing with GPT-4o or Claude. It’s offering something different: local inference for the privacy-conscious, the compliance-bound, and the network-restricted.

The tradeoffs are real. You sacrifice speed, context length, and advanced features for privacy and control. For some workflows - particularly entity extraction, classification, and anything touching sensitive data - the tradeoff makes sense. For most, cloud APIs remain the pragmatic choice.

But apfel adds something important: a working example of what on-device AI looks like when you strip away the marketing and measure it honestly. And if you’ve been following the GDPR and CLOUD Act developments, “the data never leaves your machine” is becoming less of a nice-to-have and more of a legal requirement.

Three questions before you commit:

  1. Can your workflow handle 7-30 second response times?
  2. Will inputs plus outputs stay under 4K tokens?
  3. Do you need logprobs or embeddings?

Yes to the first, yes to the second, no to the third - give apfel a try. Otherwise, you know the answer.


TL;DR


Tested on apfel v0.9.4, macOS 26+, M1 MacBook Air with 8 GB RAM - the entry-level Apple Silicon machine. If anything, these numbers are the floor; beefier hardware will do better. Measurements are averages across multiple runs. This article was brainstormed in collaboration with DeepSeek, Kimi K2.5 and Claude. The benchmarks, opinions, and 30-second wait times are entirely real.

#En #Ai #Privacy #Apple #Local-Llm