Initial screening calls were consuming 40% of recruiter time — 45 minutes per candidate, 6 to 8 calls per day, available only during business hours, in one time zone. As sourcing volume grew, the bottleneck became acute: qualified candidates were aging in the pipeline while waiting for a slot.
This is how GoGloby replaced that bottleneck with a production real-time voice AI system — and what it took to build it right.
Achievements After Partnering With GoGloby
| Metric | Result |
|---|---|
| Screening Time Per Candidate | ↓ 85% |
| Cost Per Screening | $35.00 → $1.38 (↓ 96%) |
| Screening Availability | 8 hrs/day → 24/7 across all time zones |
| Evaluation Consistency Score | 65% → 94% |
| False Positive Rate on Shortlists | 35% → 2.1% |
| Information Extraction Accuracy | 94% |
| Candidate Satisfaction Score | 8.9 / 10 |
| System Uptime | 99.6% |
The Situation at a Glance
| Service | Agentic SDLC |
| Stack | LiveKit · Deepgram · GPT-4 · spaCy · OpenAI TTS |
| Cost Per Screening | $1.38 total |
| Availability | 24/7 · All time zones |
| Core Problem | Manual phone screening consuming 40% of recruiter time with no path to scale |
| Solution | End-to-end real-time voice AI pipeline with hybrid LLM + NLP architecture |
The Problem
GoGloby’s initial candidate screening process relied entirely on human recruiters conducting phone calls. Every candidate required 45 minutes of direct recruiter time before a single qualified profile reached a client.
3 structural problems made this unsustainable at scale.
1. Volume and Capacity
Screening was available only during business hours in the recruiter’s time zone. Candidates in different regions waited days for a slot. As open roles multiplied, the bottleneck became acute — the team couldn’t screen fast enough to keep up with sourcing output, so qualified candidates aged in the pipeline while waiting.
2. Inconsistent Evaluation Quality
2 recruiters running the same screening script would arrive at different conclusions 35% of the time. Fatigue, subjectivity, and variations in probing technique meant the signal quality of each call was inherently unreliable. Clients received inconsistent shortlists as a result.
3. A Cost Structure That Didn’t Scale
At $35 per screening — recruiter time fully loaded — and with hundreds of candidates moving through the pipeline monthly, screening was one of the largest operational cost lines in the business. There was no mechanism to bring that cost down without reducing headcount or cutting quality.
The Engineering Challenge
Building a voice AI system that conducts real recruiting conversations is fundamentally different from building a chatbot. The requirements were strict and non-negotiable.
- Sub-second STT latency: Any pause longer than 300ms breaks conversational flow. Candidates notice immediately — the experience degrades from a screening call to a broken phone line.
- Natural TTS voice: A robotic voice triggers disengagement within the first 10 seconds. Candidates form a judgment about the company before the first question is asked.
- Accurate intent classification:The system must know in real time when a candidate has finished answering, when they’re asking for clarification, and when they’re deflecting — and respond accordingly.
- Multi-turn context retention: A screening runs 12–15 questions. The system must remember what was said 8 turns ago to ask coherent follow-ups and avoid repeating itself.
- Graceful interruption handling: Candidates interrupt. They go off-script. They ask questions mid-answer. The system must handle this without breaking state.
- 24/7 availability at low cost: The economic case depends on keeping per-screening cost well below the $35 manual baseline — at any hour, any volume.
Technology Stack Decisions
5 independent technology decisions shaped the architecture. Every selection was tested in production-like conditions before being committed.
1. Real-Time Communication Platform
The foundation of the entire system. Instability here cascades into every other layer.
| Platform | Stability | Audio Quality | Dev Experience | Status |
|---|---|---|---|---|
| LiveKit | 9.5/10 | 9.2/10 | 9.5/10 | ✓ Selected |
| Agora | 9.0/10 | 9.0/10 | 8.5/10 | Rejected |
| Twilio Video | 8.5/10 | 8.8/10 | 9.0/10 | Rejected |
| Jitsi | 7.5/10 | 7.0/10 | 7.0/10 | Rejected |
| Mediasoup | 8.0/10 | 8.5/10 | 6.5/10 | Rejected |
LiveKit was open-source with no vendor lock-in, native SDKs for web, mobile, and server, and built-in Voice Activity Detection that the team extended with custom energy-based filtering. Self-hosted deployment gave full infrastructure control. No other platform came close on the combination of developer experience and configurability.
2. Speech-to-Text (STT)
The most latency-sensitive component. Every millisecond here is felt by the candidate.
| Provider | Accuracy | Latency | Cost / min | Statys |
|---|---|---|---|---|
| Deepgram | 94.5% | 150ms | $0.0059 | ✓ Selected |
| OpenAI Whisper | 93.2% | 300ms | $0.006 | Rejected |
| Google Speech-to-Text | 92.8% | 200ms | $0.024 | Rejected |
| Azure Speech | 92.5% | 180ms | $0.020 | Rejected |
Deepgram’s 150ms average latency was the decisive factor — 100ms faster than the nearest competitor. That gap is the difference between a conversation that feels natural and one that feels broken. Its accuracy on accented English was also superior, which matters when screening candidates across Latin America and Eastern Europe.
3. Text-to-Speech (TTS)
Voice quality determines whether candidates take the screening seriously within the first 10 seconds.
| Provider | Naturalness | Speed | Configurability | Status |
|---|---|---|---|---|
| OpenAI TTS | 9.2/10 | 8.5/10 | 8.0/10 | ✓ Selected |
| Azure TTS | 8.8/10 | 7.8/10 | 8.0/10 | Rejected |
| Google WaveNet | 9.0/10 | 7.5/10 | 7.5/10 | Rejected |
| Amazon Polly | 8.5/10 | 8.0/10 | 8.5/10 | Rejected |
Naturalness scored highest across all test panels — candidates in blind tests could not reliably distinguish the voice from a human in the first 30 seconds of a call. Integration synergy with the existing OpenAI stack simplified the architecture, and multi-language voice switching worked reliably out of the box.
The Critical Optimisation: Routing LLM to NLP
This was the most consequential engineering decision in the entire project — and the one that made the economics work.
The initial architecture used GPT-4 for everything: response generation, intent classification, turn management, and clarification detection. The quality was high, but the system had a hard ceiling: GPT-4 at 1.2 seconds per call, running for every intent classification, meant conversations felt slightly off — too much silence between turns, and a hard cap of 50 concurrent sessions before infrastructure costs became prohibitive.
The team identified that intent classification — deciding whether a candidate has finished speaking, is asking for clarification, or is giving an incomplete answer — didn’t require the full reasoning power of a large language model. It was a classification problem. spaCy’s en_core_web_md model could do it faster, cheaper, and more accurately.
| Metric | GPT-4 | spaCy | Improvement |
|---|---|---|---|
| Response time | 1.2 seconds | 10 milliseconds | 99.2% faster |
| Memory usage | 5 GB | 70 MB | 98.6% reduction |
| Accuracy | 78% | 99% | +27 points |
| Cost per request | $0.002 | $0.000001 | 99.95% cheaper |
| Concurrent sessions | 50 | 1,000+ | 20× capacity |
The routing rule: GPT-4 handles what requires reasoning — generating responses, asking follow-up questions, summarizing candidate answers. spaCy handles what requires classification — detecting intent, managing turn transitions, identifying when a candidate has finished speaking. Routing accuracy between the two: 96%. The result is a system that delivers GPT-4 quality at near-spaCy cost.
System Architecture
7 components, each with a single responsibility, composing into a real-time conversation system that processes audio, classifies intent, generates responses, and manages turn-taking — all within the latency budget of a natural conversation.
| Layer | Component | Technology | Responsibility |
|---|---|---|---|
| 1. Audio I/O | Real-Time Transport | LiveKit + custom VAD | Audio streaming, voice activity detection, turn management, interruption handling |
| 2. Transcription | Speech-to-Text | Deepgram streaming API | Continuous real-time transcription at 150ms latency — streaming, no wait for utterance completion |
| 3. Classification | Intent Detection | spaCy en_core_web_md | Classifies candidate intent in 8ms: experience response, clarification request, incomplete answer, deflection |
| 4. Context | Conversation State | In-memory + PostgreSQL | Full conversation history, candidate profile in progress, current question position |
| 5. Generation | Response Engine | GPT-4 + fine-tuned prompts | Contextual responses, follow-up questions, rephrasing on clarification requests |
| 6. Voice | Text-to-Speech | OpenAI TTS | Natural-sounding audio in real time. Voice profile locked per session for consistency |
| 7. Output | Screening Report | Structured JSON → PDF | Auto-generated candidate assessment with score breakdown, key quotes, recruiter recommendation |
Cost Per Screening Breakdown
| Component | Monthly (400 screenings) | Per Screening |
|---|---|---|
| Infrastructure (LiveKit + AWS) | $180 | $0.45 |
| TTS Processing (OpenAI) | $120 | $0.30 |
| STT Processing (Deepgram) | $95 | $0.24 |
| LLM Processing (GPT-4) | $140 | $0.35 |
| NLP Processing (spaCy) | $15 | $0.04 |
| TOTAL | $550 | $1.38 |
$1.38 versus $35.00. The system runs 24 hours a day, in any time zone, with perfectly consistent evaluation criteria on every call — and paid for its entire development cost within the first 90 days of operation.
Results
| Metric | Before | After | Change |
|---|---|---|---|
| Average screening time | 45 minutes | 12 minutes | ↓ 73% |
| Recruiter time per candidate | 45 minutes | 7 minutes | ↓ 85% |
| Screening cost per candidate | $35.00 | $1.38 | ↓ 96% |
| Screening availability | 8 hrs/day | 24 hrs/day | ↑ 300% |
| Evaluation consistency score | 65% | 94% | ↑ 45% |
| False positive rate on shortlists | 35% | 2.1% | ↓ 94% |
| Candidate satisfaction score | — | 8.9 / 10 | — |
| System uptime | — | 99.6% | vs. 99% target |
Every quality target set before the build was met or exceeded in production.
What Clients Say
“GoGloby built a voice AI screening system for us and plugged it into our hiring pipeline. I was not sure how candidates would react. Turns out 94% of them said they appreciated the availability — they could screen at 9pm after work, not during a Tuesday morning slot they had to take off for. Our time-to-shortlist dropped by more than half and the assessment reports the system generates are more consistent than what our internal team was producing.”
— Founder & CEO, Professional Services Firm (120 employees)
“We were running 30 to 40 screening calls a month ourselves. GoGloby integrated their voice AI system into our workflow and within the first month we had cut that to 8 calls — only the ones that actually needed a human. The system handles everything else. The cost saving was immediate but the bigger thing was consistency. Every candidate gets the same quality of screening regardless of who’s having a bad day.”
— Co-Founder, Growth-Stage SaaS Startup (80 employees)
What We’d Tell Engineers Starting This
Use the right model for the right job
GPT-4 for intent classification is engineering overkill — like using an excavator to plant a seed. spaCy does intent classification faster, cheaper, and more accurately because it was built for that specific task. Decompose your AI pipeline into components and evaluate the right model class for each one independently. Don’t default to the most capable model across the board.
Latency compounds across the stack
Every component in a real-time voice pipeline contributes to perceived naturalness. A 150ms STT, an 8ms intent classifier, and a 1.2s response generator add up to a specific conversational experience. Optimise each layer individually — shaving 100ms from STT is worth more than shaving 100ms from response generation because it comes earlier in the interaction loop.
Voice quality is a first impression you can’t undo
Candidates form an opinion about the company within the first 10 seconds of audio. A robotic voice signals cheap automation. The additional cost of OpenAI TTS over cheaper alternatives is negligible per call — but the impact on candidate experience is substantial. Never compromise on the first thing a candidate hears.
Interruption handling is not an edge case
In a real screening call, candidates interrupt constantly. They ask for clarification mid-sentence. They circle back to earlier questions. Building robust interruption and context recovery logic from the start — not as a patch — was the difference between a system that felt natural and one that felt fragile.
Consistency is the product’s deepest value
The explicit selling point was speed and cost. The deeper value turned out to be consistency. Every candidate gets the same questions, the same probing follow-ups, the same evaluation framework — regardless of time zone, day of week, or recruiter mood. When pitching AI screening to clients, lead with consistency, not automation.
$1.38 changes the business model
At $35 per screening, volume was a constraint. At $1.38, it disappears. Instead of managing screening as a cost to minimize, you can afford to screen more candidates more thoroughly — which improves hire quality at every stage downstream. Cost reduction unlocks strategic options that weren’t available before.






