Writing

Blog

Latest·5 min read

Benchmarking TurboQuant+ KV Cache Compression on Apple Silicon

I tested TurboQuant+ KV cache compression across 1.5B, 7B, and 14B models on an M4 MacBook Air. The speed gains are real, but there are sharp cliffs you need to know about.

llmbenchmarksapple-siliconquantization

2026

40 posts
·17 min read

Claude Code Isn't a Code Editor. It's a New Way to Use a Computer.

After a month of writing about Claude Code, here's the thing I keep coming back to: this isn't a developer tool. It's a new interface for computing.

claude-codeopinionpersonal
·18 min read

Permissions, Security, and Trusting an AI with Your Codebase

Claude Code can edit files, run commands, and push to GitHub. The permission model determines what it can do and when. Here's how I think about trusting an AI agent with my code.

claude-codesecurityopinion
·18 min read

What 400+ Sessions Taught Me About Working with Claude Code

After hundreds of Claude Code sessions across personal projects and production codebases, here are the lessons that took the longest to learn.

claude-codeopinionpersonal
·15 min read

Custom Commands and Slash Commands: Building Your Own Claude Code CLI

Slash commands turn Claude Code into a personalized CLI. A markdown file becomes a reusable workflow you invoke with a single slash. Here's how to build them.

claude-codetutorialtools
·16 min read

Subagents and Parallel Execution: Making Claude Code 5x Faster

Claude Code can spawn autonomous worker agents that run in parallel. Here's how subagents work, when to use them, and why they make complex tasks dramatically faster.

claude-codeengineeringtutorial
·18 min read

Shipping a Feature in 45 Minutes: My Claude Code Workflow End to End

From memory recall to brainstorm to plan to execution to review to commit. Here's every step of building a real feature with Claude Code, with the actual workflow that makes it fast.

claude-codepersonalengineering
·17 min read

Hooks, Statuslines, and the Automation Layer Nobody Talks About

Hooks let you run shell commands when Claude Code starts, stops, or uses tools. Combined with a custom statusline, they turn Claude Code into a self-monitoring, self-correcting system.

claude-codetutorialengineering
·18 min read

MCP Servers Are the Glue Between Claude Code and the Real World

Model Context Protocol turns Claude Code from a code editor into something that can read Slack, control browsers, and talk to any service. Here's how it works in practice.

claude-codemcptutorial
·15 min read

NotebookLM from the Terminal: Querying Your Docs with Claude Code

A Claude Code skill that queries Google NotebookLM notebooks directly from the terminal. Source-grounded answers from Gemini, with citations, without opening a browser.

claude-codetoolstutorial
·16 min read

I Track Calories and Plan Groceries from My Terminal

Claude Code isn't just for writing software. I built skills that track nutrition and automate grocery shopping at Wegmans, all from the terminal.

claude-codetoolspersonal
·19 min read

Skills Are Just Markdown Files and That's What Makes Them Powerful

Claude Code skills have no SDK, no build step, no runtime. They're markdown files with instructions. That simplicity is exactly why they work.

claude-codetutorialtools
·15 min read

The PR Review Toolkit: Five Agents Reviewing Your Code at Once

One command spawns five specialized review agents that check your PR for code quality, silent failures, type design, test coverage, and comment accuracy, all in parallel.

claude-codetoolsengineering
·15 min read

Ralph Loop: Running Claude Code Autonomously for Hours

The Ralph Wiggum technique turns Claude Code into an autonomous agent that keeps working until the job is done. Here's how it works, when to use it, and when it's a terrible idea.

claude-codetoolsopinion
·19 min read

Superpowers and GSD: How Two Plugins Gave Claude Code a Development Methodology

Without structure, Claude Code is a fast but chaotic coder. Superpowers and GSD impose a methodology (brainstorm, plan, execute, review) that makes the output dramatically better.

claude-codetoolsengineering
·17 min read

Claude-Mem Gave Claude Code a Memory and It Changed Everything

Every Claude Code session starts from zero. Claude-mem fixes that by capturing observations, compressing them with AI, and injecting relevant context into future sessions.

claude-codetoolsopinion
·18 min read

My Exact Claude Code Setup: Plugins, Skills, and Config

A full walkthrough of my .claude/ directory: 19 plugins, 23 skills, custom hooks, and the config that ties it all together. This is the setup I use every day.

claude-codetoolspersonal
·13 min read

Plan Mode, Context Windows, and Not Wasting Tokens

The 200k context window is your most valuable resource in Claude Code. Here's how to manage it, why plan mode is the most important habit, and what happens when you run out.

claude-codetutorialtools
·14 min read

CLAUDE.md Is the Most Important File in Your Repo

A single markdown file determines whether Claude Code understands your project or fumbles through it. Here's how to write one that actually works.

claude-codetutorialopinion
·15 min read

Getting Started with Claude Code: Installation to First Real Output

The getting started guide I wish existed when I first installed Claude Code. From npm install to your first real feature, with the mental model that makes everything click.

claude-codetutorialbeginner
·11 min read

A Month of Claude Code: Why I'm Writing This Series

I've been using Claude Code daily for months. This is the first of 20 posts breaking down everything I've learned, from setup to skills to running autonomous agents from my terminal.

claude-codepersonalopinion
·8 min read

Red/Green TDD with Coding Agents: Why Test-First Matters More

When AI writes your code, tests become the spec. Red/green TDD isn't just a practice anymore. It's the interface between intent and implementation.

aicodingopinion
·8 min read

LangGraph vs CrewAI vs AutoGen: Building the Same Pipeline Three Ways

Three agent frameworks, one task. I built a research-and-report pipeline in each to compare developer experience, flexibility, and production readiness.

agentstools
·8 min read

LLM Guardrails in Practice: Input Validation to Output Filtering

A three-layer guardrail pipeline: validate inputs, constrain execution, filter outputs. Here's what each layer catches and how to build them.

safetypatterns
·9 min read

Evaluating AI Agents: Beyond 'Does It Work?'

Only 52% of organizations run offline evals for their agents. Here's the multi-layered evaluation strategy that production teams actually use.

agentsevaluation
·11 min read

Function Calling Patterns for Production LLM Agents

Function calling connects LLMs to the real world. Here are the patterns that survive production: permission models, error handling, and human-in-the-loop checkpoints.

agentspatterns
·7 min read

Self-Hosting Qdrant: From Docker Compose to Production

Qdrant gives you the fastest open-source vector search. Here's how to go from docker-compose up to production-ready deployment.

vector-databasesinfrastructure
·9 min read

Pinecone vs Qdrant vs Weaviate: An Engineer's Decision Framework

Not another feature matrix. Here are three real deployment scenarios and which vector database fits each one.

vector-databasestools
·6 min read

Reranking: The 20-Line Fix for Bad RAG Results

If your RAG pipeline retrieves the wrong chunks, adding a cross-encoder reranker between retrieval and generation can fix it in 20 lines of code.

ragpatterns
·6 min read

Hybrid Search RAG with Weaviate: Vectors + BM25

Pure vector search misses exact matches. Pure keyword search misses semantics. Hybrid search combines both, and Weaviate makes it native.

ragvector-databases
·7 min read

Chunking Strategies That Actually Matter for RAG

Your RAG pipeline is only as good as your chunks. Recursive, semantic, and late chunking each have trade-offs that most tutorials skip.

ragpatterns
·11 min read

The LLM Inference Stack in 2026: From API Call to Response

The stack for serving LLMs has matured dramatically. Here's the full picture from API gateway to GPU, and where each layer is heading.

infrastructureopinion
·10 min read

Context Engineering Is Not Prompt Engineering

Prompt engineering was the 2023 skill. Context engineering is the 2026 skill. The difference matters more than you think.

aiopinion
·9 min read

Structured Output That Actually Works: JSON Mode vs Function Calling

Getting reliable JSON from LLMs has been a pain point since GPT-3. Here's the current state of the art and what actually works in production.

llm-servingpatterns
·7 min read

Prompt Caching: How Anthropic and OpenAI Cut Costs by 90%

Prompt caching reuses pre-computed KV tensors for identical prompt prefixes. It's the easiest cost reduction you're not using yet.

costllm-serving
·8 min read

Tracking Token Costs Before They Blow Up Your Bill

Output tokens cost 4-8x more than input tokens. If you're not tracking usage by query type and user segment, you're flying blind.

costinfrastructure
·9 min read

LangSmith vs Langfuse vs Braintrust: Picking Your LLM Observability Stack

Three platforms, three philosophies. Here's how to choose between LangSmith, Langfuse, and Braintrust for your LLM observability stack.

observabilitytools
·6 min read

OpenTelemetry for LLM Apps: Tracing Prompts and Tokens

You wouldn't run a web service without tracing. LLM apps -- especially those with [guardrails pipelines](/blog/llm-guardrails-practice) and multi-step agent loops -- shouldn't be different. Here's how OpenTelemetry's GenAI conventions make it work.

observabilityinfrastructure
·6 min read

Building an LLM Gateway with LiteLLM

One API to call OpenAI, Anthropic, and self-hosted models. LiteLLM handles routing, fallbacks, and cost tracking so you don't have to.

llm-servinginfrastructure
·9 min read

Self-Hosting LLMs with Ollama: When It Makes Sense

Ollama makes running LLMs locally dead simple. But simple and production-ready are different things. Here's where it shines and where it doesn't.

llm-servingtools
·6 min read

vLLM PagedAttention: Why It's the Default for LLM Serving

vLLM's PagedAttention manages GPU memory like an OS manages virtual memory. Here's why it's become the standard for serving LLMs.

llm-servinginfrastructure

2025

10 posts
·5 min read

Vibe Coding Is Real but Not What You Think

Everyone's talking about vibe coding. After years of using AI to write code, here's what it actually is, what it isn't, and why understanding the code still matters.

aicodingopinion
·5 min read

From Hackathon to Production: What Changes When Prototypes Get Real

After years of hackathons and production systems, I've learned the gap between a winning demo and a reliable product is mostly about what you choose to worry about.

engineeringarchitecturepersonal
·5 min read

How We Won 1st Place at the MIT LLM Hackathon

Building Catalyze, a multi-agent system for chemistry research, and winning first place at MIT.

hackathonmulti-agentproject
·5 min read

MCP Is the USB of AI: Why Model Context Protocol Matters

Anthropic's Model Context Protocol is doing for AI integrations what USB did for hardware. If you're building agents, this changes everything.

mcpagentsopinion
·4 min read

DeepSeek Shocked Everyone: What Open-Source AI Means Now

A Chinese lab just matched GPT-4 performance with open weights at a fraction of the cost. The implications go way beyond model benchmarks.

deepseekopen-sourceopinion
·4 min read

The FastAPI + vLLM + Docker Stack for Serving LLMs

The production stack for self-hosted LLM serving is maturing fast. Here's the architecture I've landed on after putting models into production at BulkMagic.

fastapivllmarchitecture
·5 min read

Sub-200ms Voice AI: The Engineering Behind Real-Time Agents

A technical deep-dive into achieving sub-200ms response times in voice AI. Where the latency budget goes and how to claw back every millisecond.

voice-ailatencyengineering
·5 min read

Voice AI Architecture: Building Conversational Agents at Scale

The full architecture behind voice AI systems. Pipeline design, latency budgets, and why voice is a fundamentally different engineering challenge than chat.

voice-aiarchitecturedeep-dive
·5 min read

Multi-Agent Systems in Production: What Nobody Tells You

Lessons from building multi-agent systems that actually run in production. What works, what doesn't, and what the hype skips over.

agentsarchitectureengineering
·5 min read

Building an LLM Microservice with FastAPI and Llama 3.2 on AWS ECS

How I built a production LLM microservice for product summarization at BulkMagic. FastAPI, Llama 3.2, Docker, and AWS ECS.

fastapillamaawstutorial

2024

10 posts
·5 min read

Multimodal Models Are the New Default: GPT-4V, Gemini, and Beyond

In 2024, the best AI models understand text, images, audio, and video natively. As someone with a CV background, this convergence feels like a turning point.

multimodalllmsurvey
·4 min read

The EU AI Act Is Here: What Developers Need to Know

The EU AI Act was finalized this year. As an engineer who builds CV and AI systems, here's my practical take on what it actually means for us.

regulationaiopinion
·4 min read

OpenAI o1 and Reasoning Models: A New Paradigm?

OpenAI o1 doesn't just generate text. It thinks first. That distinction might be more important than any scaling breakthrough since GPT-3.

o1reasoningllmopinion
·4 min read

What an MS in CS Taught Me About the Gap Between Research and Production

With my MS at Northeastern nearly done, here's what I actually learned about the space between reading papers and shipping models.

personalcareereducation
·4 min read

Edge AI in 2024: Why On-Device Inference Changes Everything

Four years after I called edge ML the future, on-device inference is finally mainstream. Here's what changed, what didn't, and where we're headed.

edge-aiopinion
·6 min read

YOLOv9 vs RT-DETR: The Transformer Takeover in Object Detection

YOLO's anchor-based speed against DETR's end-to-end elegance. As someone deploying detection models in production, here's how I see the landscape.

yolodetectionsurvey
·5 min read

From PyTorch to Production: The Optimization Pipeline Nobody Talks About

Research papers stop at accuracy metrics. Production starts at deployment constraints. Here's the pipeline that bridges the gap.

deploymentoptimizationengineering
·6 min read

TFLite vs ONNX Runtime: A Practical Edge AI Comparison

I deploy models with both TFLite and ONNX Runtime. Here's an honest comparison from someone who deals with the rough edges daily.

tfliteonnxedge-ai
·4 min read

Transformers for Image Enhancement: Beyond Classification

Vision Transformers aren't just for classification anymore. They're rewriting the rules for low-level vision tasks like enhancement and restoration.

transformerscomputer-visiondeep-learning
·5 min read

Pushing Object Detection to 98.6% mAP: Lessons from Production CV

The last 2% of accuracy is where 80% of the engineering effort goes. Here's what that actually looks like.

object-detectioncomputer-visionengineering

2023

10 posts
·5 min read

What Teaching 200 Students Taught Me About Explaining Complex Ideas

A semester as a TA for Intro to Data Science changed how I think about communication, patience, and what it really means to understand something.

teachingpersonaldata-science
·5 min read

The Academic Integrity Crisis Nobody Knows How to Solve

As a TA grading 200+ students, I've seen the full spectrum of how ChatGPT is reshaping academic honesty. The problem isn't cheating. It's that we're testing the wrong things.

aieducationopinion
·6 min read

Vector Databases Explained: Pinecone, Chroma, and Beyond

Vector databases are becoming as fundamental as relational databases. Here's what they are, how they work, and which one to pick for your project.

vector-dbragtutorial
·5 min read

LangChain from Scratch: Building Your First LLM App

A step-by-step guide to building a document Q&A app with LangChain. Full code, honest opinions, and a look at where LLM app development is heading.

langchainllmtutorial
·5 min read

5 Python Libraries Every Data Science Student Should Know in 2023

The Python data science stack is evolving fast. Some sacred cows are being challenged, and your coursework might not cover the tools that actually matter.

pythondata-scienceresources
·5 min read

The Open-Source LLM Revolution: Why Llama 2 Matters

Meta is about to release Llama 2 with a commercial license. This changes the game for anyone building with LLMs.

llamaopen-sourceopinion
·5 min read

LoRA Fine-Tuning on a Student Budget: Llama on a Single GPU

You don't need a GPU cluster to fine-tune an LLM anymore. LoRA makes it possible on a single GPU, and I did it on a grad student's budget.

llamafine-tuningtutorial
·6 min read

A Beginner's Guide to RAG: Making LLMs Actually Useful

LLMs hallucinate because they don't know your data. Retrieval-Augmented Generation fixes that. Here's how it works and how to build one.

ragllmtutorial
·5 min read

Teaching EDA in the Age of ChatGPT: What Still Matters

ChatGPT can generate a pandas plot in seconds. It cannot tell you which plot to generate. That distinction matters more than people think.

data-scienceteachingopinion
·4 min read

How ChatGPT Changed My Data Science Classroom Overnight

I'm TAing a 200-student data science course and ChatGPT just rewrote the rules. Watching it happen in real time is something else.

chatgpteducationteaching

2022

10 posts
·5 min read

How to Win Hackathons: Lessons from 3 Wins in 2 Months

I won Best Google Cloud at HackHarvard, Best AI/ML at HackUMass, and mentored at HackGT9 in two months. Here's everything I learned about winning hackathons.

hackathonadvice
·4 min read

ChatGPT Just Dropped and Everything Is About to Change

ChatGPT launched five days ago and the Northeastern CS Slack hasn't calmed down since. As someone who wrote about GPT-3 two years ago, this feels like the sequel.

chatgptllmopinion
·4 min read

Whisper by OpenAI: Finally Good Open-Source Speech Recognition

OpenAI released Whisper and suddenly open-source speech recognition is actually good. I tried it on Hindi and English and here's what I found.

whisperspeechtools
·4 min read

Why I Mentor at Hackathons: Lessons from HackGT9

After back-to-back hackathon wins, I flew to Atlanta to mentor at Georgia Tech's HackGT9. It taught me more than competing ever did.

hackathonmentoringpersonal
·4 min read

Best AI/ML Hack at HackUMass: Building Meta-Identity

We built a system that clones your voice and face to create a digital twin for the metaverse. Then we won Best AI/ML hack with it.

hackathonaiproject
·4 min read

How We Won Best Google Cloud at HackHarvard: Building ReAlive

36 hours, a team of four, and an AI that brings old photographs to life with sound. Here's how we built ReAlive.

hackathongoogle-cloudproject
·4 min read

Computer Vision in 2022: The Year Transformers Won

From ViT curiosity to Swin dominance, how transformers overtook CNNs as the default backbone for vision in a single year.

computer-visiontransformerssurvey
·4 min read

Building AI Prototypes Fast: My Hackathon Tech Stack

The exact tools and libraries I use to go from idea to working AI demo in 24 hours.

hackathonstreamlittools
·4 min read

Diffusion Models Demystified: From DALL-E 2 to Stable Diffusion

Breaking down how diffusion models actually work, from the math to the magic, as someone who spent two years building CV models.

diffusion-modelsdeep-learningtutorial
·4 min read

From Industry ML Engineer to Grad Student: What Changes

After two years shipping models at a startup, I'm going back to school. Here's what I think will change.

personalcareereducation

2021

10 posts
·4 min read

Why I'm Leaving Industry for Grad School

After two years as an ML engineer in Bangalore, I'm going back to being a student. Here's the honest version of how I got to this decision.

personalcareereducation
·5 min read

MLOps Is Not Optional Anymore: Lessons from Production

After two years of shipping ML models, I'm convinced that most ML projects fail not because of bad models but because of bad infrastructure around them.

mlopsengineeringopinion
·4 min read

Real-ESRGAN Changed Super-Resolution Forever

Real-ESRGAN handles real-world degradation in a way previous models never could. As someone who built SR models at Myelin, this one hit different.

super-resolutiondeep-learning
·4 min read

GitHub Copilot Is Wild: First Impressions from a Working ML Engineer

I got early access to GitHub Copilot and spent a week using it for actual ML work. Here's what it's like when the AI writes the AI code.

copilotaitools
·5 min read

CLIP and the Vision-Language Revolution

OpenAI connected text and images in a way that makes zero-shot classification actually work. This changes everything about how we think about vision models.

clipmultimodaldeep-learning
·5 min read

Two Years as an ML Engineer: From Research to Production

What I've learned going from fresh graduate to production ML engineer. Spoiler: the models are the easy part.

personalcareerml
·6 min read

Building a Real-Time Anomaly Detection Pipeline for IoT

From sensor data to alerts in under 2 seconds. Here's the full architecture we built at Myelin for industrial monitoring.

iotanomaly-detectionarchitecture
·5 min read

Model Quantization in Practice: 4x Speedup Without Losing Accuracy

Our super-resolution model went from 45MB to 11MB. Here's exactly how, with code and real numbers.

quantizationoptimizationtutorial
·5 min read

FPGA vs GPU vs Edge TPU: Choosing the Right ML Hardware

I tried deploying ML models to all three. Here's an honest comparison from someone who actually suffered through FPGA toolchains.

fpgaedge-aihardware
·5 min read

Deploying Anomaly Detection Models on Raspberry Pi

Running anomaly detection on a tiny board with 1GB RAM. Here's what worked, what crashed, and what I learned at 2am over SSH.

raspberry-piedge-aianomaly-detection

2020

14 posts
·4 min read

AlphaFold Solved Protein Folding and I Can't Stop Thinking About It

DeepMind just cracked a 50-year-old biology problem with deep learning. This might be the most important ML result of the decade.

deep-learningscienceopinion
·4 min read

Vision Transformers Are Coming for CNNs

Google just showed that a pure transformer, no convolutions at all, can match the best CNNs on image classification. The implications are huge.

computer-visiontransformersdeep-learning
·3 min read

GPT-3 Just Dropped and I Have Thoughts

OpenAI released a 175 billion parameter language model and the demos are unreal. But as someone who deploys models to phones for a living, I have a slightly different take.

nlpgpt-3deep-learningopinion
·4 min read

From Python to Production: Deploying ML Models with TensorFlow.js

The gap between a trained model in a Jupyter notebook and a working product in someone's browser is bigger than you think. Here's how to bridge it.

tensorflow-jsdeploymentmachine-learningweb
·3 min read

Docker 101

A beginner-friendly introduction to Docker: what containers are, why they matter, and how to start using them today.

dockerdevopscoding
·2 min read

Getting Started with Jekyll

A beginner's guide to Jekyll, the static site generator that turns Markdown into beautiful websites without the complexity.

jekyllweb-developmentcodingbeginner
·2 min read

Oh-my-zsh!

Transform your terminal from boring to beautiful with Oh My Zsh. Autosuggestions, syntax highlighting, and history search in minutes.

terminallinuxmacoscoding
·2 min read

Best Coding Practices

Foundational development practices every programmer should adopt early, from choosing the right editor to writing proper documentation.

codingbest-practicesbeginner
·4 min read

Running Super-Resolution in the Browser with TensorFlow.js

How to take a trained super-resolution model and run it at interactive speeds in the browser, no server required.

tensorflow-jssuper-resolutionwebgldeep-learning
·4 min read

Demystifying WebGL for ML Engineers

You're running ML models in the browser and it's fast, but do you know why? A look at how WebGL makes GPU-accelerated inference possible on the web.

webgltensorflow-jsdeep-learningweb
·4 min read

A Practical Guide to Model Optimization for Mobile

Your model works great on a V100. Now make it run on a phone. Here's what actually works for shrinking and speeding up neural networks.

machine-learningoptimizationmobiletflite
·3 min read

How COVID Is Accelerating On-Device AI

A pandemic that shut down the world is quietly pushing the ML industry towards on-device intelligence faster than any roadmap planned.

edge-aimachine-learningopinion
·4 min read

Image Super-Resolution in 2020: From SRCNN to ESRGAN

A practitioner's overview of how image super-resolution evolved from a 3-layer CNN to photorealistic upscaling with GANs.

deep-learningcomputer-visionsuper-resolution
·3 min read

Why Edge ML Is the Future

Cloud inference is great until it isn't. Here's why running ML models on-device is going to matter way more than people think.

machine-learningedge-aiopinion