Case Study · ADR-001
Blueface: 6-Agentic Email Response System
- Status:
- Accepted, Ready to be Deployed
- Built:
- 2026 Q1/Q2
- Client:
- Blueface
- Sector:
- B2B SaaS · AI/ML Services
- Platform:
- Cassidy AI
- Author:
- Joseph Iyofor
The problem
30 to 60 emails per day, no consistent response system
Blueface receives 30 to 60 customer emails daily spanning four categories:
- Sales inquiries: 40%
- Technical questions: 30%
- Support tickets: 20%
- General inquiries: 10%
Manual response time averaged 2 to 4 hours per email, causing delayed sales cycles, inconsistent quality, and significant context switching across the team.
The challenge was not simply automating a response pipeline. It required a system that could draft accurately, maintain brand voice, allow human oversight on every single email, and handle revision feedback intelligently without routing every edit back through the full pipeline. And once built, the system needed a continuous quality layer that could evaluate every run automatically without relying on manual spot checks.
The solution
Two communicating workflows plus independent evaluation
- Workflow 1 (W1) handles email ingestion and initial draft generation across six specialised agents.
- Workflow 2 (W2) handles the human feedback loop through four conditional paths.
- Evy is a third independent evaluation workflow, triggered by webhook after each run, providing Step Eval scoring per agent and Run Eval scoring across the full six-agent output.
Workflow 1
Email Handler agents
| Agent | Role |
|---|---|
| GabbySenti | Sentiment analysis and initial email characterisation |
| Router Coach | Orchestration gatekeeper, classifies, routes, briefs downstream agents |
| Hatches (shared) | Internal KB Expert, deep knowledge base research, case studies, pricing, technical specs |
| Scratches (shared) | Web Researcher, fills knowledge gaps that Hatches cannot resolve from the internal KB |
| Pen Pusher (shared) | Final draft generation, used in W1 and all W2 REVISE paths |
| Jesses | QA Validator, checks draft quality, accuracy, tone, and completeness before human review |
W1 output: Draft + QA score → Slack notification → Sheets log (40-column record).
Workflow 2
Human Feedback Loop
| Path | Trigger | What happens |
|---|---|---|
| Path 1, GO | No revision required | Extracts context, sends approved email via Gmail, updates W1 status, confirms in Slack, logs to W2 sheet |
| Path 2, Revise A | Tone/wording change | Router Coach briefs → Pen Pusher V2 redrafts → present in Slack |
| Path 3, Revise B | New KB content needed | Router Coach → Hatches V2 → Pen Pusher V2 → present + log |
| Path 4, Revise C | KB + external research needed | Router Coach → Hatches V2 → Scratches V2 → Pen Pusher V2 |
| Path 5, IGNORE | Bot message / irrelevant input | Immediate termination |
Evy
Continuous Evaluation Pipeline
Independent 9-step Evy workflow triggered by webhook after each run. Performs two evaluation types: Step Eval (per agent) scoring accuracy, completeness, format, Evy confidence, and a Bridge Recommendation; and Run Eval (full six-agent output) scoring overall quality, Ship Decision, accuracy, completeness, deliverability, Bridge Recommendation, and revision count.
Risk mitigations
Built into the architecture
| Risk | Mitigation |
|---|---|
| Cascading errors from one agent's bad output | Bounded responsibilities limit blast radius. Human-in-the-loop gate on every email, zero dispatches without human approval. |
| Prompt injection via inbound email content | Hardened system instructions at every agent layer. Injection scenarios built into Evals regression suite as standing automated check. |
| Sensitive customer data leakage | Output scanning for sensitive content before draft enters dispatch queue. Encryption in transit. |
| HIPAA / compliance violation | Compliance review on every draft. Any compliance flag triggers immediate human escalation, zero-tolerance threshold. |
| Silent failures producing valid-looking but wrong outputs | Evy Step Eval catches per-agent failures on every production run. Bridge Recommendation generates specific fix instruction for every below-pass score. |
| Repeated mistakes on a specific email category | Run Eval surfaces whether the full pipeline output is shippable. Monitoring dashboards with real-time alerts on score degradation. |
Target outcomes
What the system is built to deliver
| Metric | Target |
|---|---|
| Draft approval rate on first human review | 80%+ |
| System latency (email receipt to draft in Slack) | Under 90 seconds |
| Factual accuracy (scored by Evy vs. KB) | 98%+ |
| Cost per email across all agents and evaluation | Under $0.65 |
Key principle
Intelligence supports people, it does not replace them. Zero emails are dispatched without human approval.