Apple Bypasses On-Device AI Memory Limits with New Architecture
Apple has introduced a groundbreaking architecture at WWDC26 for on-device AI, overcoming the long-standing DRAM memory limit. Its new AFM 3 Core Advanced model stores 20 billion parameters in NAND flash, using a unique Instruction-Following Pruning (IFP) method to dynamically load expert modules into DRAM. This innovation significantly boosts local AI capabilities for agentic workloads.

Apple has unveiled a significant architectural breakthrough for on-device artificial intelligence, addressing a persistent challenge that has constrained the capabilities of local AI agents. Announced at WWDC26, the company's third-generation Apple Foundation Models (AFM 3) introduce a novel approach that stores large model weights in NAND flash memory instead of the much more limited DRAM, effectively enabling a 20-billion-parameter model to run directly on consumer devices. This innovation, particularly with the AFM 3 Core Advanced model, bypasses the traditional memory wall that has forced enterprises to choose between cloud-dependent, powerful models and less capable local alternatives.
The core problem for on-device AI has been the necessity for an entire model's weight set to reside in DRAM, severely capping parameter counts. This limitation has historically kept practical on-device models significantly smaller than their server-side counterparts. Apple's new architecture, developed with its research team, redefines the storage paradigm, treating flash memory as the model's permanent home and DRAM as a dynamic buffer for active "experts" needed for a specific task.
Redefining On-Device AI Architecture
The AFM 3 family, a collaborative effort with Google, includes both on-device and server-based models operating within Apple's Private Cloud Compute boundary. While server-side models like AFM 3 Cloud Pro leverage Nvidia GPUs in Google Cloud for complex agentic tasks, Apple’s on-device architecture is distinct. The AFM 3 Core Advanced model, boasting 20 billion parameters, fundamentally alters how these parameters are managed on a device.
Awni Hannun, a researcher at Anthropic and former Apple scientist, highlighted the exotic nature of this approach, noting that a 20-billion-parameter model cannot fit into device RAM at reasonable precision. Apple’s solution involves a smaller model predicting which expert modules to load from NAND into RAM based on the user's query or prompt. This prediction-and-load mechanism is meticulously designed around the hardware limitations of consumer silicon.
How Apple's IFP Architecture Works
Apple’s Instruction-Following Pruning (IFP) architecture hinges on three critical components to circumvent memory constraints:
Firstly, the complete 20-billion-parameter weight set is stored in NAND flash memory, not DRAM. Unlike standard on-device deployments requiring models to fit entirely within DRAM, Apple’s method uses flash as the primary repository. DRAM then serves as a working buffer, temporarily holding only the specific expert modules required by a given prompt.
Secondly, expert routing occurs once per prompt rather than token by token. Conventional Mixture of Experts (MoE) models typically select different experts for each token generated. However, the bandwidth between NAND flash and DRAM is too slow to support such continuous weight transfers at inference speeds. AFM 3 Core Advanced addresses this by making a single routing decision at prompt time, loading a fixed set of experts into DRAM for the entire token generation process. This ensures all tokens for that query are generated using the same configuration.
Finally, the architecture dynamically adjusts the active parameter count based on task complexity. Instead of utilizing a fixed model size for every request, AFM 3 Core Advanced can activate anywhere from 1 billion parameters for simpler operations up to 4 billion for more demanding tasks. All these parameters are drawn from the larger 20-billion-parameter pool residing in flash memory.
Unanswered Questions and Enterprise Implications
While Apple's research team has detailed the memory design and sparse activation mechanism, certain practical deployment constraints remain undisclosed. Marco Abis, creator of the local AI profiler Ziraph, pointed out the absence of key metrics like energy consumption, memory bandwidth, or thermal performance in Apple’s documentation. These factors are crucial for determining the viability of on-device performance at scale.
Furthermore, Apple has not publicly specified when an on-device request might transparently offload to its Private Cloud Compute or whether this routing is visible to developers or end-users. This lack of clarity poses a direct compliance challenge for regulated enterprises that must meticulously document where their AI inference processes occur.
For enterprise architects evaluating agentic AI deployments, Apple's new architecture presents significant shifts:
The DRAM barrier for on-device agents has been substantially lowered, offering a 20-billion-parameter local option previously unavailable. The primary constraint now shifts from inherent model capability to the specific hardware on the device. The private/cloud boundary is now a conscious architectural decision, not a default. Simple requests can stay on-device, while complex agentic tasks can route to AFM 3 Cloud Pro. However, the transparency of this routing mechanism is still a critical missing piece for policy decisions. Lastly, the server-side agentic tier, AFM 3 Cloud Pro, remains dependent on Google Cloud, even with Apple's Private Cloud Compute ensuring data privacy.
AFM 3 Core Advanced marks a significant leap, providing enterprises with a powerful on-device AI option. Its widespread deployability, however, awaits further details, with Apple indicating a full technical report with benchmarks is expected later this summer.
FAQ
Q: What is the primary problem Apple's new architecture solves for on-device AI? A: Apple's new architecture, particularly with the AFM 3 Core Advanced model, solves the problem of on-device AI models being limited by DRAM capacity. Previously, the entire model's weight set had to fit into DRAM, restricting parameter counts and model complexity. Apple now stores the 20-billion-parameter model's weights in NAND flash memory, enabling larger, more capable AI to run locally.
Q: How does Apple's architecture manage to use NAND flash instead of DRAM for model weights? A: The architecture, called Instruction-Following Pruning (IFP), stores the full 20-billion-parameter model in NAND flash. It then uses a small model to predict which specific "expert" modules from that larger pool are needed for a given prompt. These selected experts are loaded into DRAM for processing, making the overall system efficient despite NAND-to-DRAM bandwidth limitations by routing only once per prompt.
Q: What are the key implications for enterprises considering Apple's new on-device AI? A: Enterprises now have access to a 20-billion-parameter local AI option, shifting the constraint from model capability to device hardware. The decision between on-device and cloud processing becomes an explicit architectural choice. However, details regarding energy consumption, offload transparency to the cloud, and developer visibility into routing decisions are still pending, which can impact compliance and operational planning.
Related articles
startups: Apple investors are running out of patience with its AI
Apple investors are losing patience with the tech giant's artificial intelligence strategy, especially after a largely disappointing Worldwide Developers Conference (WWDC). The company's stock is significantly
Anthropic Overhauls Claude Design: Fixes Tokens, Adds Design System
Anthropic has released a major overhaul of Claude Design, addressing its initial token-burning problem with shared usage limits and efficiency gains. The update also introduces design system imports for enterprise brand compliance and bidirectional integration with Claude Code to streamline the design-to-engineering workflow. This strategic move positions Claude Design as a critical component in Anthropic's broader vision to embed AI across the enterprise stack.
Social media’s next evolution: user-controlled algorithms: User
Social media platforms like Threads, Instagram, and TikTok are launching AI tools for users to control feed algorithms. This shift enables personalized content, boosting engagement through tailored experiences.
in-depth: Interactive. Violent. Gross. Inside Fishtank, the Unhinged
Police responded to a distress call reporting a gunman and a gunshot at the Atlanta mansion of "Fishtank," a controversial reality TV show, on March 16, 2026. Described as "Big Brother without limits," the program's unhinged nature is thrust into the spotlight by this alarming incident. This event raises serious questions about the safety and ethical boundaries of extreme, interactive reality television.
Robinhood’s 10% Layoffs Signal Shift Away From Blaming AI
Robinhood announced a 10% workforce reduction, impacting 290 employees, but CEO Vlad Tenev notably avoided blaming AI for the cuts. This marks a departure from other tech companies, with Robinhood instead emphasizing a move to leaner, flatter organizational structures despite strong financial performance.
Home Chef Unveils Major Promo Codes for June 2026
Home Chef rolls out significant June 2026 promo codes, offering new customers up to 75% off their first box and 18 free meals. Special discounts are also available for families and essential workers, enhancing access to its popular, user-friendly meal kits.





