ยท4 min read

Edge AI in 2024: Why On-Device Inference Changes Everything

edge-aiopinion

Back in January 2020, I wrote a post called "Why Edge ML Is the Future." I was at Myelin, squeezing models onto phones and Raspberry Pis, arguing that the best model isn't the one with the highest benchmark score but the one that actually runs where your users are.

Four years later, I just finished a co-op at Honeywell building computer vision systems. And the edge AI landscape is almost unrecognizable from where it was when I started.

The edge AI deployment pipeline: from cloud training to on-device inference, with export and optimization in between.

What Actually Changed

NPUs are now standard hardware. Apple's Neural Engine has been in every iPhone since the A11, but in 2024 it's processing 35 TOPS. Qualcomm's Hexagon NPU ships in every flagship Android. Google's Tensor chips run on-device LLMs. Samsung has its own NPU. The dedicated neural processing silicon I was dreaming about in 2020 is now a checkbox feature on spec sheets.

Small models got dramatically better. This is the real story. Distillation techniques, advanced quantization (GPTQ, AWQ -- a huge leap from the INT8 quantization workflows I was using at Myelin), and architectural innovations like depth-wise separable convolutions have pushed the quality ceiling for small models way up. A 4-bit quantized model in 2024 can match what a full-precision model did in 2020. That's not incremental progress. That's a paradigm shift in what's possible on constrained hardware.

On-device LLMs are real. Google's Gemini Nano runs on Pixel phones. Apple Intelligence processes requests locally. Meta released Llama models small enough for mobile. The idea of running a language model on a phone would have sounded absurd in 2020. Now it's a product feature.

What Hasn't Changed

Runtime fragmentation is still a mess. CoreML for Apple, NNAPI for Android, TensorRT for NVIDIA, OpenVINO for Intel, ONNX Runtime trying to bridge everything. I did a detailed comparison of TFLite vs ONNX Runtime that captures the practical trade-offs. If you want to deploy one model across platforms, you're still maintaining multiple conversion pipelines and dealing with operator compatibility issues. This was my biggest complaint in 2020 and it's still my biggest complaint.

Profiling and debugging on-device is still painful. You can profile a cloud model with a few clicks. Profiling inference on an actual edge device? You're dealing with thermal throttling, memory pressure, OS-level scheduling, and tools that are either proprietary or half-baked. The developer experience gap between cloud and edge ML is still wide.

The Longitudinal View

Having worked on edge AI since 2020, first at Myelin, then through my MS research, then at Honeywell, I've watched this space go from "interesting niche" to "default deployment strategy" for a growing class of applications. The trajectory is clear.

At Honeywell, the CV systems I worked on had to run in environments where cloud connectivity was unreliable. That's not a theoretical constraint. It's a factory floor reality. Edge inference wasn't a nice-to-have. It was the only option.

Ambient Intelligence

Here's the bigger picture. We're entering the era of ambient intelligence, where AI processing happens continuously on the devices around you. Your phone understands your photos without uploading them. Your car processes sensor data locally. Your smart home makes decisions without phoning a server.

This isn't science fiction. Every piece of the stack is in place: capable hardware, efficient models, mature frameworks. The bottleneck now is software integration and developer tooling, not fundamental capability.

I wrote in 2020 that "where does inference happen?" would become the first question every ML team asks. It took longer than I expected, but we're finally there.