I Built an AI That Audits My AI Infrastructure. Its Best Find Was My Own Blind Spots.

TL;DR — I run a pile of personal AI infrastructure: crawlers, a RAG knowledge base, scheduled jobs, small skills. It rots quietly. So I built fukuro (Japanese for owl) — a tool that audits AI infrastructure: it scores plans, flags risks, and tells me where I’m about to step on a rake. This is a week of hardening it, built around three ideas — audit itself, improve itself on a schedule, teach itself daily. The punchline: the tool’s most valuable output wasn’t catching dramatic bugs. It was catching the quiet gap between “I did that” and “that is actually true” — five times, every one a thing I’d missed.

I run a growing pile of personal AI infrastructure: crawlers that track model pricing, a RAG knowledge base, scheduled jobs, a dozen small skills. Like any pile of infra, it rots. A price page moves. A model gets deprecated. A path breaks after a refactor. None of it screams when it fails — it just quietly stops being true.

So I built a tool to watch it. I call it fukuro (Japanese for owl — it watches in the dark). It’s an auditor for AI infrastructure: point it at an idea, a plan, or a running system, and it scores the design, flags the risks, and tells me where I’m about to step on a rake.

This is a log of a week spent hardening fukuro. The interesting part wasn’t the code. It was watching the auditor turn on itself — and catch mistakes that both I and the AI helping me had missed.

Three ideas did most of the work. All three are variations on the same theme: a system that maintains itself has to be able to look at itself.

1. The meta move: a tool that audits its own plans

Fukuro scores plans on a rubric — job-fit, scope creep, dependency conflicts, ROI, citations, and so on. Normally you point it at someone else’s plan.

That week, every time I wrote a plan to upgrade fukuro, I ran fukuro on that plan first.

It was uncomfortable in the way good feedback is. My first migration plan scored 63/100 — a yellow. It caught that I was re-deriving a hosting decision I’d already made weeks earlier and forgotten, and that a “$0/month” cost claim was uncited and wrong. I fixed those and it re-scored 83. A later plan went 84 → 91 the same way.

The number isn’t the point. The point is the loop: write → let the tool grade it → fix the specific things it flags → rebuild. An auditor that can’t survive being pointed at its own homework isn’t an auditor; it’s a vibe. Making fukuro grade its own plans forced its rubric to be honest, because the first victim of a sloppy rubric was me.

Lesson: if you build an evaluator, dogfood it on your own work before you trust it on anything else. The friction is the feature.

2. Self-improvement on a schedule — and the punchline

Manual audits don’t scale. So fukuro got a monthly “radar”: a scheduled job that runs six health checks over the whole system — is new research changing the best practices? are the crawlers alive? are any file paths dead? is the scoring drifting? are the model choices still current? is technical debt piling up?

The critical design decision was auto-detect, human-approve. The radar finds things and files a report. It does not patch anything on its own. I was very deliberate here: a tool that can silently rewrite its own rules is a tool that can silently drift away from what you wanted. The moment you let an AI system modify itself without a gate, you’ve built something you can no longer reason about. So the radar proposes; a human disposes.

Then I triggered it end-to-end to make sure it actually worked. It did. And here’s the punchline that made the whole week worth it:

The radar caught five real problems — and every single one was something I had missed.

A set of data sources I’d added were silently crawling empty — the pages were JavaScript-rendered, so my scraper was saving whitespace and I hadn’t noticed.
A pricing source was pointed at the wrong URL, so one vendor’s prices were never actually being tracked.
A batch of documents I thought I’d committed to git were sitting uncommitted — the references to them were dead.
A permission scope I believed I’d fixed hadn’t actually been granted.
A pile of files I’d deleted locally were never pushed, so they still existed in the repo.

Not one of these was exotic. They were the ordinary, boring failures that pile up when you move fast — the kind no human or AI notices in the moment, because in the moment everything looks done. The radar noticed because noticing is the only thing it does.

Lesson: the highest-value thing a self-checking system does is not catch dramatic bugs. It’s catch the quiet, embarrassing gap between “I did that” and “that is actually true.”

3. Self-learning: a tool that reads the news so its judgments don’t go stale

An auditor is only as good as its knowledge of the world. A model-pricing check written in January is wrong by March. So fukuro crawls its own knowledge base daily — vendor pricing and model pages, plus fresh research papers — and feeds that into every judgment. When it evaluates a design that says “use model X,” it’s checking against this week’s prices and capabilities, not last quarter’s.

Two things surprised me here.

First, the crawler itself had been failing silently for sixteen days. A path had broken during an earlier refactor, and the error was being swallowed instead of raised — so the job ran “green” every morning while writing exactly zero files. It looked healthy right up until I actually inspected the output. The fix was philosophical, not technical: fail loud. A job that can’t do its work should crash and page you, not smile and return success. Silent success is the most expensive kind of failure, because you pay for it in weeks.

Second, before expanding what fukuro learns from, I ran a deep research pass over candidate sources — and it found that two of the “authoritative” benchmarks I was about to trust had quietly died. One had entered maintenance mode; another had been retired over a year earlier and was frozen. If I’d wired them in from memory, I’d have been feeding my auditor stale rankings for months and never known. Research before build isn’t ceremony — it’s how you avoid pouring garbage into a system that then confidently repeats it.

The thread that ties it together

Self-audit, self-improve, self-learn. Written out, they sound like three features. They’re really one commitment: the checker has to be checkable.

The recursive nature of it is what stuck with me. I built an AI tool to audit AI infrastructure. I then used an AI assistant to build and harden that tool. And the tool’s most useful output was a list of the things the human-plus-AI process that built it had gotten wrong. Layers all the way down — and the value came precisely from the layers being able to look at each other.

A few lessons I’m taking forward, none of them specific to my setup:

Dogfood your evaluator on your own work first. If it can’t grade your homework, it can’t grade anyone’s.
Separate finding from scoring. Let the fuzzy AI part find issues; let a fixed formula score them. Otherwise the same input gives you different numbers on different days, and a number you can’t reproduce is a number you can’t trust.
Auto-detect, human-approve. A self-modifying system without a gate is a system you’ve stopped understanding.
Fail loud. Silent success is a bug that bills you weekly.
The best self-check catches the gap between “done” and “true.” That gap is where real systems actually break.

The owl’s job was never to be smarter than me. It was to be awake when I wasn’t. That week, that was more than enough.