Local-first: AI leaves the cloud and runs on your PC thanks to NPUs

For years, “doing AI” basically meant “sending data to a server” (Microsoft, Google, OpenAI, and others). It made sense: models were huge, GPUs were expensive, and latency was acceptable.

But that’s changing. Hardware has caught up with software, and not only through GPUs. NPUs (Neural Processing Units) are becoming common in PCs and smartphones, designed to run neural networks with better energy efficiency. For example, Microsoft is pushing a new category of machines, Copilot+ PCs, with explicit hardware requirements on the NPU side. The result: local-first AI that runs on your device, and only “goes out” to the cloud when necessary.

Local-first AI: what actually changes

“Local-first” isn’t just “offline.” It’s a product philosophy: your data lives on the device first, and sync/the cloud becomes a bonus, not a prerequisite. With AI, the stakes are even more concrete: the prompt, documents, your screen, your voice… everything you feed an assistant can be sensitive.

Why it’s becoming possible now

1) NPUs are becoming a standard

PC and chip makers are finally standardizing via NPUs, a dedicated “AI accelerator,” on the same level as a CPU or GPU. Microsoft lists clear requirements to use Copilot on a PC:

40+ TOPS NPU (40 TOPS = the NPU can execute ~40,000 billion operations/s)
16 GB of RAM
256 GB of storage
and compatible chip families (Snapdragon X, AMD Ryzen AI, Intel Core Ultra 200V).

Apple, on its side, is highlighting increasingly strong Neural Engines (e.g., the M4 announced at 38 TOPS).

2) “Small” models are getting good

Small Language Models (SLMs) are no longer toys. For example, Microsoft’s Phi-3-mini (3.8B) is designed to run locally, with benchmark scores presented as close to much larger models. It can even run on a phone.

Mistral also released “edge/on-device” models (Ministral 3B and 8B) explicitly positioned for this kind of execution.

3) Vendors are pushing “on-device by default”

Microsoft highlights features that run locally on Copilot+ PCs (e.g., Live Captions, Photos super resolution, etc.). Apple, with Apple Intelligence, also proposes an on-device-by-default approach, and switches to cloud infrastructure only for heavy requests. Finally, Google announced an approach called “Private AI Compute.”

Are we heading toward the end of the SaaS subscription model for AI?

Probably not. But the “all subscription, all cloud” model is going to lose ground.

What is likely to move to local

Simple writing assistance, rewriting, summarizing personal notes
Local search across your files & emails
Personal automations (meeting summaries, action extraction) when data stays on-device

What will remain SaaS or hybrid for a while

Tasks that require a lot of context, external tools, or “premium” reliability
Anything that depends on always up-to-date data (news, web, pricing, …)

The most likely scenario is a hybrid model: local for 60–80% of everyday actions, and cloud for “big requests” (complex reasoning, huge contexts, agent orchestration).

In this article, Apple formalizes exactly this logic: on-device when possible, otherwise switching to Private Cloud Compute for heavier queries.

“Privacy by design”: local-first as a competitive advantage

Sending data to a server often means:

expanding the attack surface
making compliance more complex
creating user distrust (especially in B2B, healthcare, finance, legal)

By contrast, local-first naturally aligns with the principle of data protection.

Concretely, “privacy by design” becomes simpler

Prompts and documents never leave the device
Fewer contracts/subprocessors to manage
The ability to offer a true offline mode (useful while traveling, in industrial sites, etc.)

A quick warning: “local” doesn’t mean “magically secure.” You still need encryption, access control, and careful thinking about what’s stored (e.g., indexes, embeddings, histories). But reducing outbound data reduces risk.

Local SLMs vs cloud giants: a real performance comparison

When we say “performance,” we should talk about four things: quality, latency, cost, and product constraints.

1) Output quality

Cloud (large models): higher average level, better long-form reasoning, stronger robustness across domains.
Local SLMs: very good on targeted tasks (writing, extraction, classification, Q&A within a defined scope), but they can fall off on ambiguous cases.

2) Latency and user experience

Cloud: variable latency (network + load), but can still be fast on optimized infrastructure.
Local: instant start, no network round-trip, smoother for micro-actions (rewrite, summarize, fix).

On mobile, Google publishes very high tokens/second speeds for Gemini Nano depending on version/device, which hints at the potential of on-device.

3) Cost

Cloud: pay-per-use (API) or subscription. At scale, costs can blow up… or become unpredictable.
Local: mainly hardware + energy. And most importantly, the marginal cost per request trends toward zero.

4) Constraints

Cloud = privacy/compliance constraints and vendor dependency.
Local = memory/energy/thermal constraints, so you often need quantization (compressing a model so it’s smaller and faster).

What this implies for products and businesses

The new “must-haves” for an AI model/product

An offline mode (even partial) becomes a selling point.
Transparency: “what stays on-device” vs “what goes to the cloud.”
A hybrid-ready UX: automatic fallback, user control, predictable costs.

A different stack for local AI vs cloud AI

When AI runs locally, you’re executing a model on the user’s hardware. That requires runtimes that can leverage available hardware (CPU, GPU, especially NPU) and handle constraints like memory limits or device heat. That’s the role of building blocks like Core ML on Apple’s side or the Windows/NPU ecosystem on Microsoft’s side: standardize on-device inference with real optimizations so it stays smooth day to day.

And on the usage side, tools like Ollama make local far more accessible: before, running a model locally was complex. Today, it’s much simpler.

Conclusion

The real change isn’t “we won’t use the cloud anymore.” It’s that, more and more, the cloud becomes the exception, not the rule.

NPUs are making local inference mainstream.
SLMs are becoming good enough for a large share of use cases.
Privacy is becoming a major product criterion, and local-first naturally aligns with “privacy by design.”