Blog
Benchmarking TurboQuant+ KV Cache Compression on Apple Silicon
I tested TurboQuant+ KV cache compression across 1.5B, 7B, and 14B models on an M4 MacBook Air. The speed gains are real, but there are sharp cliffs you need to know about.
2026
40 postsClaude Code Isn't a Code Editor. It's a New Way to Use a Computer.
After a month of writing about Claude Code, here's the thing I keep coming back to: this isn't a developer tool. It's a new interface for computing.
Permissions, Security, and Trusting an AI with Your Codebase
Claude Code can edit files, run commands, and push to GitHub. The permission model determines what it can do and when. Here's how I think about trusting an AI agent with my code.
What 400+ Sessions Taught Me About Working with Claude Code
After hundreds of Claude Code sessions across personal projects and production codebases, here are the lessons that took the longest to learn.
Custom Commands and Slash Commands: Building Your Own Claude Code CLI
Slash commands turn Claude Code into a personalized CLI. A markdown file becomes a reusable workflow you invoke with a single slash. Here's how to build them.
Subagents and Parallel Execution: Making Claude Code 5x Faster
Claude Code can spawn autonomous worker agents that run in parallel. Here's how subagents work, when to use them, and why they make complex tasks dramatically faster.
Shipping a Feature in 45 Minutes: My Claude Code Workflow End to End
From memory recall to brainstorm to plan to execution to review to commit. Here's every step of building a real feature with Claude Code, with the actual workflow that makes it fast.
Hooks, Statuslines, and the Automation Layer Nobody Talks About
Hooks let you run shell commands when Claude Code starts, stops, or uses tools. Combined with a custom statusline, they turn Claude Code into a self-monitoring, self-correcting system.
MCP Servers Are the Glue Between Claude Code and the Real World
Model Context Protocol turns Claude Code from a code editor into something that can read Slack, control browsers, and talk to any service. Here's how it works in practice.
NotebookLM from the Terminal: Querying Your Docs with Claude Code
A Claude Code skill that queries Google NotebookLM notebooks directly from the terminal. Source-grounded answers from Gemini, with citations, without opening a browser.
I Track Calories and Plan Groceries from My Terminal
Claude Code isn't just for writing software. I built skills that track nutrition and automate grocery shopping at Wegmans, all from the terminal.
Skills Are Just Markdown Files and That's What Makes Them Powerful
Claude Code skills have no SDK, no build step, no runtime. They're markdown files with instructions. That simplicity is exactly why they work.
The PR Review Toolkit: Five Agents Reviewing Your Code at Once
One command spawns five specialized review agents that check your PR for code quality, silent failures, type design, test coverage, and comment accuracy, all in parallel.
Ralph Loop: Running Claude Code Autonomously for Hours
The Ralph Wiggum technique turns Claude Code into an autonomous agent that keeps working until the job is done. Here's how it works, when to use it, and when it's a terrible idea.
Superpowers and GSD: How Two Plugins Gave Claude Code a Development Methodology
Without structure, Claude Code is a fast but chaotic coder. Superpowers and GSD impose a methodology (brainstorm, plan, execute, review) that makes the output dramatically better.
Claude-Mem Gave Claude Code a Memory and It Changed Everything
Every Claude Code session starts from zero. Claude-mem fixes that by capturing observations, compressing them with AI, and injecting relevant context into future sessions.
My Exact Claude Code Setup: Plugins, Skills, and Config
A full walkthrough of my .claude/ directory: 19 plugins, 23 skills, custom hooks, and the config that ties it all together. This is the setup I use every day.
Plan Mode, Context Windows, and Not Wasting Tokens
The 200k context window is your most valuable resource in Claude Code. Here's how to manage it, why plan mode is the most important habit, and what happens when you run out.
CLAUDE.md Is the Most Important File in Your Repo
A single markdown file determines whether Claude Code understands your project or fumbles through it. Here's how to write one that actually works.
Getting Started with Claude Code: Installation to First Real Output
The getting started guide I wish existed when I first installed Claude Code. From npm install to your first real feature, with the mental model that makes everything click.
A Month of Claude Code: Why I'm Writing This Series
I've been using Claude Code daily for months. This is the first of 20 posts breaking down everything I've learned, from setup to skills to running autonomous agents from my terminal.
Red/Green TDD with Coding Agents: Why Test-First Matters More
When AI writes your code, tests become the spec. Red/green TDD isn't just a practice anymore. It's the interface between intent and implementation.
LangGraph vs CrewAI vs AutoGen: Building the Same Pipeline Three Ways
Three agent frameworks, one task. I built a research-and-report pipeline in each to compare developer experience, flexibility, and production readiness.
LLM Guardrails in Practice: Input Validation to Output Filtering
A three-layer guardrail pipeline: validate inputs, constrain execution, filter outputs. Here's what each layer catches and how to build them.
Evaluating AI Agents: Beyond 'Does It Work?'
Only 52% of organizations run offline evals for their agents. Here's the multi-layered evaluation strategy that production teams actually use.
Function Calling Patterns for Production LLM Agents
Function calling connects LLMs to the real world. Here are the patterns that survive production: permission models, error handling, and human-in-the-loop checkpoints.
Self-Hosting Qdrant: From Docker Compose to Production
Qdrant gives you the fastest open-source vector search. Here's how to go from docker-compose up to production-ready deployment.
Pinecone vs Qdrant vs Weaviate: An Engineer's Decision Framework
Not another feature matrix. Here are three real deployment scenarios and which vector database fits each one.
Reranking: The 20-Line Fix for Bad RAG Results
If your RAG pipeline retrieves the wrong chunks, adding a cross-encoder reranker between retrieval and generation can fix it in 20 lines of code.
Hybrid Search RAG with Weaviate: Vectors + BM25
Pure vector search misses exact matches. Pure keyword search misses semantics. Hybrid search combines both, and Weaviate makes it native.
Chunking Strategies That Actually Matter for RAG
Your RAG pipeline is only as good as your chunks. Recursive, semantic, and late chunking each have trade-offs that most tutorials skip.
The LLM Inference Stack in 2026: From API Call to Response
The stack for serving LLMs has matured dramatically. Here's the full picture from API gateway to GPU, and where each layer is heading.
Context Engineering Is Not Prompt Engineering
Prompt engineering was the 2023 skill. Context engineering is the 2026 skill. The difference matters more than you think.
Structured Output That Actually Works: JSON Mode vs Function Calling
Getting reliable JSON from LLMs has been a pain point since GPT-3. Here's the current state of the art and what actually works in production.
Prompt Caching: How Anthropic and OpenAI Cut Costs by 90%
Prompt caching reuses pre-computed KV tensors for identical prompt prefixes. It's the easiest cost reduction you're not using yet.
Tracking Token Costs Before They Blow Up Your Bill
Output tokens cost 4-8x more than input tokens. If you're not tracking usage by query type and user segment, you're flying blind.
LangSmith vs Langfuse vs Braintrust: Picking Your LLM Observability Stack
Three platforms, three philosophies. Here's how to choose between LangSmith, Langfuse, and Braintrust for your LLM observability stack.
OpenTelemetry for LLM Apps: Tracing Prompts and Tokens
You wouldn't run a web service without tracing. LLM apps -- especially those with [guardrails pipelines](/blog/llm-guardrails-practice) and multi-step agent loops -- shouldn't be different. Here's how OpenTelemetry's GenAI conventions make it work.
Building an LLM Gateway with LiteLLM
One API to call OpenAI, Anthropic, and self-hosted models. LiteLLM handles routing, fallbacks, and cost tracking so you don't have to.
Self-Hosting LLMs with Ollama: When It Makes Sense
Ollama makes running LLMs locally dead simple. But simple and production-ready are different things. Here's where it shines and where it doesn't.
vLLM PagedAttention: Why It's the Default for LLM Serving
vLLM's PagedAttention manages GPU memory like an OS manages virtual memory. Here's why it's become the standard for serving LLMs.
2025
10 postsVibe Coding Is Real but Not What You Think
Everyone's talking about vibe coding. After years of using AI to write code, here's what it actually is, what it isn't, and why understanding the code still matters.
From Hackathon to Production: What Changes When Prototypes Get Real
After years of hackathons and production systems, I've learned the gap between a winning demo and a reliable product is mostly about what you choose to worry about.
How We Won 1st Place at the MIT LLM Hackathon
Building Catalyze, a multi-agent system for chemistry research, and winning first place at MIT.
MCP Is the USB of AI: Why Model Context Protocol Matters
Anthropic's Model Context Protocol is doing for AI integrations what USB did for hardware. If you're building agents, this changes everything.
DeepSeek Shocked Everyone: What Open-Source AI Means Now
A Chinese lab just matched GPT-4 performance with open weights at a fraction of the cost. The implications go way beyond model benchmarks.
The FastAPI + vLLM + Docker Stack for Serving LLMs
The production stack for self-hosted LLM serving is maturing fast. Here's the architecture I've landed on after putting models into production at BulkMagic.
Sub-200ms Voice AI: The Engineering Behind Real-Time Agents
A technical deep-dive into achieving sub-200ms response times in voice AI. Where the latency budget goes and how to claw back every millisecond.
Voice AI Architecture: Building Conversational Agents at Scale
The full architecture behind voice AI systems. Pipeline design, latency budgets, and why voice is a fundamentally different engineering challenge than chat.
Multi-Agent Systems in Production: What Nobody Tells You
Lessons from building multi-agent systems that actually run in production. What works, what doesn't, and what the hype skips over.
Building an LLM Microservice with FastAPI and Llama 3.2 on AWS ECS
How I built a production LLM microservice for product summarization at BulkMagic. FastAPI, Llama 3.2, Docker, and AWS ECS.
2024
10 postsMultimodal Models Are the New Default: GPT-4V, Gemini, and Beyond
In 2024, the best AI models understand text, images, audio, and video natively. As someone with a CV background, this convergence feels like a turning point.
The EU AI Act Is Here: What Developers Need to Know
The EU AI Act was finalized this year. As an engineer who builds CV and AI systems, here's my practical take on what it actually means for us.
OpenAI o1 and Reasoning Models: A New Paradigm?
OpenAI o1 doesn't just generate text. It thinks first. That distinction might be more important than any scaling breakthrough since GPT-3.
What an MS in CS Taught Me About the Gap Between Research and Production
With my MS at Northeastern nearly done, here's what I actually learned about the space between reading papers and shipping models.
Edge AI in 2024: Why On-Device Inference Changes Everything
Four years after I called edge ML the future, on-device inference is finally mainstream. Here's what changed, what didn't, and where we're headed.
YOLOv9 vs RT-DETR: The Transformer Takeover in Object Detection
YOLO's anchor-based speed against DETR's end-to-end elegance. As someone deploying detection models in production, here's how I see the landscape.
From PyTorch to Production: The Optimization Pipeline Nobody Talks About
Research papers stop at accuracy metrics. Production starts at deployment constraints. Here's the pipeline that bridges the gap.
TFLite vs ONNX Runtime: A Practical Edge AI Comparison
I deploy models with both TFLite and ONNX Runtime. Here's an honest comparison from someone who deals with the rough edges daily.
Transformers for Image Enhancement: Beyond Classification
Vision Transformers aren't just for classification anymore. They're rewriting the rules for low-level vision tasks like enhancement and restoration.
Pushing Object Detection to 98.6% mAP: Lessons from Production CV
The last 2% of accuracy is where 80% of the engineering effort goes. Here's what that actually looks like.
2023
10 postsWhat Teaching 200 Students Taught Me About Explaining Complex Ideas
A semester as a TA for Intro to Data Science changed how I think about communication, patience, and what it really means to understand something.
The Academic Integrity Crisis Nobody Knows How to Solve
As a TA grading 200+ students, I've seen the full spectrum of how ChatGPT is reshaping academic honesty. The problem isn't cheating. It's that we're testing the wrong things.
Vector Databases Explained: Pinecone, Chroma, and Beyond
Vector databases are becoming as fundamental as relational databases. Here's what they are, how they work, and which one to pick for your project.
LangChain from Scratch: Building Your First LLM App
A step-by-step guide to building a document Q&A app with LangChain. Full code, honest opinions, and a look at where LLM app development is heading.
5 Python Libraries Every Data Science Student Should Know in 2023
The Python data science stack is evolving fast. Some sacred cows are being challenged, and your coursework might not cover the tools that actually matter.
The Open-Source LLM Revolution: Why Llama 2 Matters
Meta is about to release Llama 2 with a commercial license. This changes the game for anyone building with LLMs.
LoRA Fine-Tuning on a Student Budget: Llama on a Single GPU
You don't need a GPU cluster to fine-tune an LLM anymore. LoRA makes it possible on a single GPU, and I did it on a grad student's budget.
A Beginner's Guide to RAG: Making LLMs Actually Useful
LLMs hallucinate because they don't know your data. Retrieval-Augmented Generation fixes that. Here's how it works and how to build one.
Teaching EDA in the Age of ChatGPT: What Still Matters
ChatGPT can generate a pandas plot in seconds. It cannot tell you which plot to generate. That distinction matters more than people think.
How ChatGPT Changed My Data Science Classroom Overnight
I'm TAing a 200-student data science course and ChatGPT just rewrote the rules. Watching it happen in real time is something else.
2022
10 postsHow to Win Hackathons: Lessons from 3 Wins in 2 Months
I won Best Google Cloud at HackHarvard, Best AI/ML at HackUMass, and mentored at HackGT9 in two months. Here's everything I learned about winning hackathons.
ChatGPT Just Dropped and Everything Is About to Change
ChatGPT launched five days ago and the Northeastern CS Slack hasn't calmed down since. As someone who wrote about GPT-3 two years ago, this feels like the sequel.
Whisper by OpenAI: Finally Good Open-Source Speech Recognition
OpenAI released Whisper and suddenly open-source speech recognition is actually good. I tried it on Hindi and English and here's what I found.
Why I Mentor at Hackathons: Lessons from HackGT9
After back-to-back hackathon wins, I flew to Atlanta to mentor at Georgia Tech's HackGT9. It taught me more than competing ever did.
Best AI/ML Hack at HackUMass: Building Meta-Identity
We built a system that clones your voice and face to create a digital twin for the metaverse. Then we won Best AI/ML hack with it.
How We Won Best Google Cloud at HackHarvard: Building ReAlive
36 hours, a team of four, and an AI that brings old photographs to life with sound. Here's how we built ReAlive.
Computer Vision in 2022: The Year Transformers Won
From ViT curiosity to Swin dominance, how transformers overtook CNNs as the default backbone for vision in a single year.
Building AI Prototypes Fast: My Hackathon Tech Stack
The exact tools and libraries I use to go from idea to working AI demo in 24 hours.
Diffusion Models Demystified: From DALL-E 2 to Stable Diffusion
Breaking down how diffusion models actually work, from the math to the magic, as someone who spent two years building CV models.
From Industry ML Engineer to Grad Student: What Changes
After two years shipping models at a startup, I'm going back to school. Here's what I think will change.
2021
10 postsWhy I'm Leaving Industry for Grad School
After two years as an ML engineer in Bangalore, I'm going back to being a student. Here's the honest version of how I got to this decision.
MLOps Is Not Optional Anymore: Lessons from Production
After two years of shipping ML models, I'm convinced that most ML projects fail not because of bad models but because of bad infrastructure around them.
Real-ESRGAN Changed Super-Resolution Forever
Real-ESRGAN handles real-world degradation in a way previous models never could. As someone who built SR models at Myelin, this one hit different.
GitHub Copilot Is Wild: First Impressions from a Working ML Engineer
I got early access to GitHub Copilot and spent a week using it for actual ML work. Here's what it's like when the AI writes the AI code.
CLIP and the Vision-Language Revolution
OpenAI connected text and images in a way that makes zero-shot classification actually work. This changes everything about how we think about vision models.
Two Years as an ML Engineer: From Research to Production
What I've learned going from fresh graduate to production ML engineer. Spoiler: the models are the easy part.
Building a Real-Time Anomaly Detection Pipeline for IoT
From sensor data to alerts in under 2 seconds. Here's the full architecture we built at Myelin for industrial monitoring.
Model Quantization in Practice: 4x Speedup Without Losing Accuracy
Our super-resolution model went from 45MB to 11MB. Here's exactly how, with code and real numbers.
FPGA vs GPU vs Edge TPU: Choosing the Right ML Hardware
I tried deploying ML models to all three. Here's an honest comparison from someone who actually suffered through FPGA toolchains.
Deploying Anomaly Detection Models on Raspberry Pi
Running anomaly detection on a tiny board with 1GB RAM. Here's what worked, what crashed, and what I learned at 2am over SSH.
2020
14 postsAlphaFold Solved Protein Folding and I Can't Stop Thinking About It
DeepMind just cracked a 50-year-old biology problem with deep learning. This might be the most important ML result of the decade.
Vision Transformers Are Coming for CNNs
Google just showed that a pure transformer, no convolutions at all, can match the best CNNs on image classification. The implications are huge.
GPT-3 Just Dropped and I Have Thoughts
OpenAI released a 175 billion parameter language model and the demos are unreal. But as someone who deploys models to phones for a living, I have a slightly different take.
From Python to Production: Deploying ML Models with TensorFlow.js
The gap between a trained model in a Jupyter notebook and a working product in someone's browser is bigger than you think. Here's how to bridge it.
Docker 101
A beginner-friendly introduction to Docker: what containers are, why they matter, and how to start using them today.
Getting Started with Jekyll
A beginner's guide to Jekyll, the static site generator that turns Markdown into beautiful websites without the complexity.
Oh-my-zsh!
Transform your terminal from boring to beautiful with Oh My Zsh. Autosuggestions, syntax highlighting, and history search in minutes.
Best Coding Practices
Foundational development practices every programmer should adopt early, from choosing the right editor to writing proper documentation.
Running Super-Resolution in the Browser with TensorFlow.js
How to take a trained super-resolution model and run it at interactive speeds in the browser, no server required.
Demystifying WebGL for ML Engineers
You're running ML models in the browser and it's fast, but do you know why? A look at how WebGL makes GPU-accelerated inference possible on the web.
A Practical Guide to Model Optimization for Mobile
Your model works great on a V100. Now make it run on a phone. Here's what actually works for shrinking and speeding up neural networks.
How COVID Is Accelerating On-Device AI
A pandemic that shut down the world is quietly pushing the ML industry towards on-device intelligence faster than any roadmap planned.
Image Super-Resolution in 2020: From SRCNN to ESRGAN
A practitioner's overview of how image super-resolution evolved from a 3-layer CNN to photorealistic upscaling with GANs.
Why Edge ML Is the Future
Cloud inference is great until it isn't. Here's why running ML models on-device is going to matter way more than people think.