Things we figured out the hard way

Practical, technical writing from engineers who ship production AI systems every week.

🤖
AI Engineering

Reasoning Loops in Production: What Actually Fails

We've shipped 50+ agent systems. Here's a taxonomy of the failure modes nobody talks about — infinite loops, hallucinated tool calls, and tool schema drift.

S
Sajesh·Jan 5, 2026·12 min read
🐳
Cloud

Kubernetes Autoscaling for AI Workloads: KEDA vs HPA

KEDA's event-driven scaling changed how we handle inference workloads. Side-by-side comparison with real latency numbers from a GPU node pool.

A
Arjun·Dec 28, 2025·8 min read
🛡️
Security

Threat Modelling LLM Applications with STRIDE

Applying STRIDE to AI systems surfaces threats that traditional AppSec misses entirely. Here's our full threat model template, open for use.

P
Priya·Dec 19, 2025·11 min read
⚙️
AI Engineering

Building Tool Context for LLMs with Matimo YAML Schemas

How we define tool schemas once in YAML and auto-generate OpenAI function definitions, LangChain tools, and Claude tool_use blocks — with zero duplication.

S
Sajesh·Dec 10, 2025·7 min read
💸
Cloud

Multi-Region Active-Active on AWS: Our $0-downtime Playbook

Route 53 latency routing + Aurora Global DB + DynamoDB Global Tables — the exact Terraform modules we use for enterprise clients who can't afford 60-second RTO.

A
Arjun·Nov 28, 2025·14 min read
🚀
Product

Why We Open-Sourced Matimo (And What We Learnt)

We debated keeping Matimo proprietary. Here's the business and philosophical reasoning behind the MTI licence decision — and the numbers 12 months later.

S
Sajesh·Nov 14, 2025·6 min read

Get new posts in your inbox

No spam. Every post is technical, practical, and worth your time. Unsubscribe any time.