How a Retool Sheet Became an AI System

QC Audit · MainStory · 2025

100% automation · 98% time reduction (3h → 15min per center) · 98% audit compliance (target: 95%)

The Setup

In early 2025, MainStory had 10 daycare centers and a quality problem. Every morning before the children arrived, three operations people would open a Retool dashboard and spend the next three hours per center reviewing CCTV footage, validating timeline posts, and chasing Center Managers about missing photos and meal entries.

Then an incident happened. The kind that makes everyone in the room go quiet. Manual audits had missed something they should have caught.

We were planning to scale. We couldn't scale this way — not safely, not affordably, not at all.

The brief I gave myself was simple:

100% automation, no manual auditing
≥50% reduction in audit time
≥95% compliance rate across all centers
Zero compromise on safety signal

This is how we built it.

The Insight

The Retool process wasn't slow because it was manual. It was slow because it was checking the wrong thing.

Auditors were validating whether a meal had been logged. The actual quality signal was whether the meal log was meaningful — whether "bubur ayam" matched what the child actually ate, whether the caption "sunbathing dengan teman-teman, Aira sangat antusias" reflected real observation versus the lazy "ok" that signals a tired CM at 4:55pm.

You can write rules to validate timestamps. You can't write rules to validate meaning.

So we wouldn't try to.

The Architecture

I designed the audit logic as three layers:

Layer 1 — Deterministic checks. Timestamps (created_at <= 13:00 for Session 1), required fields present, mandatory measurements on Fridays.

Layer 2 — Pattern matching. Does the logged meal exist in our menu config? Does the post caption contain the expected keyword (sunbathing, kinderactivity, baby massage)?

Layer 3 — KinderGPT validation. If a meal isn't in config: is it still a valid food item? (Yes for "kerupuk", yes for "bubur ayam", no for "asdf".) If a caption exists: is it meaningful, or is it rubbish like "ok", "-", "gpp", "good"?

The scoring was progressive, not binary. A perfect Session 1 audit scored +100 across attendance, meals, snacks, and the 1pm post — but the components of those scores were weighted (+50 for criterion 1, +30 for criterion 2, +20 for criterion 3).

This was deliberate. Binary pass/fail tells you that something is wrong; weighted scoring tells you what's wrong. A CM improving from a 60 to an 85 isn't a victory in a binary system — but it's exactly the signal we wanted to celebrate. The scoring framework was my own — I built it from first principles, then validated each weight against the failure modes Ops had been catching manually.

The Human-in-the-Loop Decision

We had the option to run fully autonomously from day one. I chose not to.

For the first phase, every audit result still routed through CM review. Not because the AI couldn't decide — but because the AI hadn't yet earned the right to. Manual approval was how we generated the training signal: every CM correction, every flag for a false positive, every time KinderGPT decided "kerupuk" wasn't a valid food and we said yes it is — that was data that made the next month's audit smarter.

The plan was always to taper this. Start with 100% review, then sample-based, then exception-only. Trust compounds. So does training data.

This is a pattern I now apply to every AI-integrated system I design.

The Alignment Work

The KinderGPT bet itself was easy. Engineering, AI, and leadership all aligned quickly that LLM-assisted validation was the only way to scale efficiently. There was no argument about whether to do it.

The harder alignment was with Ops.

Ops cares about exception handling. Where designers see "98% accuracy," Ops sees "the 2% that becomes a parent complaint." Mapping edge cases together — what the system would and wouldn't catch, and what human escalation looked like for the things it wouldn't — was the work that took the longest.

[TODO: insert the specific moment with Ops — the disagreement and how it resolved. Will fill in.]

The output was a better system. Ops pushed me to design the audit not as a verdict but as a triage tool — surface what's likely problematic, let humans confirm. That framing made the product more honest than what I'd originally proposed.

The Interface

The dashboard was designed around three views:

Audit Per Center. A grid showing every center's compliance score for the day, with red/yellow/green status. Ops's morning glance.

Audit Per Child. Drill into a center, see each child's audit breakdown across both sessions. Find anomalies fast.

Detail Child. The full scoring breakdown — every criterion, every weighted point, every KinderGPT validation result. The breakdown is the design. CMs trust the system because they can see exactly why a score landed where it did.

[Visual: dashboard screens to be inserted — Audit Per Center, Audit Per Child, Detail Child, Info Popup]

Outcomes

Metric	Target	Actual
Automation	100%	100%
Audit time per center	≥50% reduction	98% reduction (3h → 15min)
Audit compliance rate	≥95%	98%

Three hours became fifteen minutes. Three people became zero people. Compliance went up, not down.

The Ripple Effects

The numbers above are the easy story. The harder story is what changed downstream:

It became the foundation for Monthly Wrapped. The KinderGPT integration we built for QC Audit — recognizing meaningful captions, parsing nanny observations, distinguishing valid food items from junk — was reused almost entirely for our AI-generated parent updates. Infrastructure compounds.

It changed how Ops thinks about quality. Quality stopped being "did someone check?" and started being "what's our compliance distribution this week?" The shift from binary auditing to scored compliance changed the operational vocabulary.

It unlocked center expansion. Centers we couldn't have audited safely under the manual model became cheap to monitor. The QC system removed itself as a scaling constraint.

It freed CMs for higher-value work. Three hours of audit time per CM per day was redirected toward the things only humans can do — parent relationships, edge case handling, training new caregivers.

The Pattern I Now Apply

Trust is built in transparent breakdowns.

The unlock for QC Audit wasn't the AI — it was the scoring breakdown. When CMs could see exactly why an audit passed or failed (+50 for valid food, +20 for meaningful caption, etc.), they trusted the system.

I now treat "show your work" as a non-negotiable for any system that affects someone's livelihood — which is why payslip transparency, tipping breakdowns, and overtime attribution all use the same pattern at MainStory. The math is always visible, always traceable, always explainable in one screen.

What I'd Do Differently

I'd ship the KinderGPT SPIKE before finalizing the design.

We sequenced it traditionally: design first, validate the AI assumptions during build. That worked, but it would have worked better the other way. LLM behavior surprises you — what holds in pilot prompts doesn't always hold at production scale, and what feels like a clean validation rule on paper can fall apart against real CM input. If I were starting over, I'd treat the SPIKE as a discovery artifact, not an engineering ticket. The design would have been sharper, and we would have shipped faster.

MainStory · Indonesia's premium childcare platform · 10+ centers · 200+ caregivers

← Back to all work