Updated on May 6, 2026

How GoGloby Engineered a Production Voice AI System That Screens Candidates 85% Faster at 96% Lower Cost

Initial screening calls were consuming 40% of recruiter time — 45 minutes per candidate, 6 to 8 calls per day, available only during business hours, in one time zone. As sourcing volume grew, the bottleneck became acute: qualified candidates were aging in the pipeline while waiting for a slot.

This is how GoGloby replaced that bottleneck with a production real-time voice AI system — and what it took to build it right.

Achievements After Partnering With GoGloby

Metric	Result
Screening Time Per Candidate	↓ 85%
Cost Per Screening	$35.00 → $1.38 (↓ 96%)
Screening Availability	8 hrs/day → 24/7 across all time zones
Evaluation Consistency Score	65% → 94%
False Positive Rate on Shortlists	35% → 2.1%
Information Extraction Accuracy	94%
Candidate Satisfaction Score	8.9 / 10
System Uptime	99.6%

The Situation at a Glance

Service	Agentic SDLC
Stack	LiveKit · Deepgram · GPT-4 · spaCy · OpenAI TTS
Cost Per Screening	$1.38 total
Availability	24/7 · All time zones
Core Problem	Manual phone screening consuming 40% of recruiter time with no path to scale
Solution	End-to-end real-time voice AI pipeline with hybrid LLM + NLP architecture

The Problem

GoGloby’s initial candidate screening process relied entirely on human recruiters conducting phone calls. Every candidate required 45 minutes of direct recruiter time before a single qualified profile reached a client.

3 structural problems made this unsustainable at scale.

1. Volume and Capacity

Screening was available only during business hours in the recruiter’s time zone. Candidates in different regions waited days for a slot. As open roles multiplied, the bottleneck became acute — the team couldn’t screen fast enough to keep up with sourcing output, so qualified candidates aged in the pipeline while waiting.

2. Inconsistent Evaluation Quality

2 recruiters running the same screening script would arrive at different conclusions 35% of the time. Fatigue, subjectivity, and variations in probing technique meant the signal quality of each call was inherently unreliable. Clients received inconsistent shortlists as a result.

3. A Cost Structure That Didn’t Scale

At $35 per screening — recruiter time fully loaded — and with hundreds of candidates moving through the pipeline monthly, screening was one of the largest operational cost lines in the business. There was no mechanism to bring that cost down without reducing headcount or cutting quality.

The Engineering Challenge

Building a voice AI system that conducts real recruiting conversations is fundamentally different from building a chatbot. The requirements were strict and non-negotiable.

Sub-second STT latency: Any pause longer than 300ms breaks conversational flow. Candidates notice immediately — the experience degrades from a screening call to a broken phone line.

Natural TTS voice: A robotic voice triggers disengagement within the first 10 seconds. Candidates form a judgment about the company before the first question is asked.

Accurate intent classification:The system must know in real time when a candidate has finished answering, when they’re asking for clarification, and when they’re deflecting — and respond accordingly.

Multi-turn context retention: A screening runs 12–15 questions. The system must remember what was said 8 turns ago to ask coherent follow-ups and avoid repeating itself.

Graceful interruption handling: Candidates interrupt. They go off-script. They ask questions mid-answer. The system must handle this without breaking state.

24/7 availability at low cost: The economic case depends on keeping per-screening cost well below the $35 manual baseline — at any hour, any volume.

Technology Stack Decisions

5 independent technology decisions shaped the architecture. Every selection was tested in production-like conditions before being committed.

1. Real-Time Communication Platform

The foundation of the entire system. Instability here cascades into every other layer.

Platform	Stability	Audio Quality	Dev Experience	Status
LiveKit	9.5/10	9.2/10	9.5/10	✓ Selected
Agora	9.0/10	9.0/10	8.5/10	Rejected
Twilio Video	8.5/10	8.8/10	9.0/10	Rejected
Jitsi	7.5/10	7.0/10	7.0/10	Rejected
Mediasoup	8.0/10	8.5/10	6.5/10	Rejected

LiveKit was open-source with no vendor lock-in, native SDKs for web, mobile, and server, and built-in Voice Activity Detection that the team extended with custom energy-based filtering. Self-hosted deployment gave full infrastructure control. No other platform came close on the combination of developer experience and configurability.

2. Speech-to-Text (STT)

The most latency-sensitive component. Every millisecond here is felt by the candidate.

Provider	Accuracy	Latency	Cost / min	Statys
Deepgram	94.5%	150ms	$0.0059	✓ Selected
OpenAI Whisper	93.2%	300ms	$0.006	Rejected
Google Speech-to-Text	92.8%	200ms	$0.024	Rejected
Azure Speech	92.5%	180ms	$0.020	Rejected

Deepgram’s 150ms average latency was the decisive factor — 100ms faster than the nearest competitor. That gap is the difference between a conversation that feels natural and one that feels broken. Its accuracy on accented English was also superior, which matters when screening candidates across Latin America and Eastern Europe.

3. Text-to-Speech (TTS)

Voice quality determines whether candidates take the screening seriously within the first 10 seconds.

Provider	Naturalness	Speed	Configurability	Status
OpenAI TTS	9.2/10	8.5/10	8.0/10	✓ Selected
Azure TTS	8.8/10	7.8/10	8.0/10	Rejected
Google WaveNet	9.0/10	7.5/10	7.5/10	Rejected
Amazon Polly	8.5/10	8.0/10	8.5/10	Rejected

Naturalness scored highest across all test panels — candidates in blind tests could not reliably distinguish the voice from a human in the first 30 seconds of a call. Integration synergy with the existing OpenAI stack simplified the architecture, and multi-language voice switching worked reliably out of the box.

The Critical Optimisation: Routing LLM to NLP

This was the most consequential engineering decision in the entire project — and the one that made the economics work.

The initial architecture used GPT-4 for everything: response generation, intent classification, turn management, and clarification detection. The quality was high, but the system had a hard ceiling: GPT-4 at 1.2 seconds per call, running for every intent classification, meant conversations felt slightly off — too much silence between turns, and a hard cap of 50 concurrent sessions before infrastructure costs became prohibitive.

The team identified that intent classification — deciding whether a candidate has finished speaking, is asking for clarification, or is giving an incomplete answer — didn’t require the full reasoning power of a large language model. It was a classification problem. spaCy’s en_core_web_md model could do it faster, cheaper, and more accurately.

Metric	GPT-4	spaCy	Improvement
Response time	1.2 seconds	10 milliseconds	99.2% faster
Memory usage	5 GB	70 MB	98.6% reduction
Accuracy	78%	99%	+27 points
Cost per request	$0.002	$0.000001	99.95% cheaper
Concurrent sessions	50	1,000+	20× capacity

The routing rule: GPT-4 handles what requires reasoning — generating responses, asking follow-up questions, summarizing candidate answers. spaCy handles what requires classification — detecting intent, managing turn transitions, identifying when a candidate has finished speaking. Routing accuracy between the two: 96%. The result is a system that delivers GPT-4 quality at near-spaCy cost.

System Architecture

7 components, each with a single responsibility, composing into a real-time conversation system that processes audio, classifies intent, generates responses, and manages turn-taking — all within the latency budget of a natural conversation.

Layer	Component	Technology	Responsibility
1. Audio I/O	Real-Time Transport	LiveKit + custom VAD	Audio streaming, voice activity detection, turn management, interruption handling
2. Transcription	Speech-to-Text	Deepgram streaming API	Continuous real-time transcription at 150ms latency — streaming, no wait for utterance completion
3. Classification	Intent Detection	spaCy en_core_web_md	Classifies candidate intent in 8ms: experience response, clarification request, incomplete answer, deflection
4. Context	Conversation State	In-memory + PostgreSQL	Full conversation history, candidate profile in progress, current question position
5. Generation	Response Engine	GPT-4 + fine-tuned prompts	Contextual responses, follow-up questions, rephrasing on clarification requests
6. Voice	Text-to-Speech	OpenAI TTS	Natural-sounding audio in real time. Voice profile locked per session for consistency
7. Output	Screening Report	Structured JSON → PDF	Auto-generated candidate assessment with score breakdown, key quotes, recruiter recommendation

Cost Per Screening Breakdown

Component	Monthly (400 screenings)	Per Screening
Infrastructure (LiveKit + AWS)	$180	$0.45
TTS Processing (OpenAI)	$120	$0.30
STT Processing (Deepgram)	$95	$0.24
LLM Processing (GPT-4)	$140	$0.35
NLP Processing (spaCy)	$15	$0.04
TOTAL	$550	$1.38

$1.38 versus $35.00. The system runs 24 hours a day, in any time zone, with perfectly consistent evaluation criteria on every call — and paid for its entire development cost within the first 90 days of operation.

Results

Metric	Before	After	Change
Average screening time	45 minutes	12 minutes	↓ 73%
Recruiter time per candidate	45 minutes	7 minutes	↓ 85%
Screening cost per candidate	$35.00	$1.38	↓ 96%
Screening availability	8 hrs/day	24 hrs/day	↑ 300%
Evaluation consistency score	65%	94%	↑ 45%
False positive rate on shortlists	35%	2.1%	↓ 94%
Candidate satisfaction score	—	8.9 / 10	—
System uptime	—	99.6%	vs. 99% target

Every quality target set before the build was met or exceeded in production.

What Clients Say

“GoGloby built a voice AI screening system for us and plugged it into our hiring pipeline. I was not sure how candidates would react. Turns out 94% of them said they appreciated the availability — they could screen at 9pm after work, not during a Tuesday morning slot they had to take off for. Our time-to-shortlist dropped by more than half and the assessment reports the system generates are more consistent than what our internal team was producing.”

— Founder & CEO, Professional Services Firm (120 employees)

“We were running 30 to 40 screening calls a month ourselves. GoGloby integrated their voice AI system into our workflow and within the first month we had cut that to 8 calls — only the ones that actually needed a human. The system handles everything else. The cost saving was immediate but the bigger thing was consistency. Every candidate gets the same quality of screening regardless of who’s having a bad day.”

— Co-Founder, Growth-Stage SaaS Startup (80 employees)

What We’d Tell Engineers Starting This

Use the right model for the right job

GPT-4 for intent classification is engineering overkill — like using an excavator to plant a seed. spaCy does intent classification faster, cheaper, and more accurately because it was built for that specific task. Decompose your AI pipeline into components and evaluate the right model class for each one independently. Don’t default to the most capable model across the board.

Latency compounds across the stack

Every component in a real-time voice pipeline contributes to perceived naturalness. A 150ms STT, an 8ms intent classifier, and a 1.2s response generator add up to a specific conversational experience. Optimise each layer individually — shaving 100ms from STT is worth more than shaving 100ms from response generation because it comes earlier in the interaction loop.

Voice quality is a first impression you can’t undo

Candidates form an opinion about the company within the first 10 seconds of audio. A robotic voice signals cheap automation. The additional cost of OpenAI TTS over cheaper alternatives is negligible per call — but the impact on candidate experience is substantial. Never compromise on the first thing a candidate hears.

Interruption handling is not an edge case

In a real screening call, candidates interrupt constantly. They ask for clarification mid-sentence. They circle back to earlier questions. Building robust interruption and context recovery logic from the start — not as a patch — was the difference between a system that felt natural and one that felt fragile.

Consistency is the product’s deepest value

The explicit selling point was speed and cost. The deeper value turned out to be consistency. Every candidate gets the same questions, the same probing follow-ups, the same evaluation framework — regardless of time zone, day of week, or recruiter mood. When pitching AI screening to clients, lead with consistency, not automation.

$1.38 changes the business model

At $35 per screening, volume was a constraint. At $1.38, it disappears. Instead of managing screening as a cost to minimize, you can afford to screen more candidates more thoroughly — which improves hire quality at every stage downstream. Cost reduction unlocks strategic options that weren’t available before.

Ready to Replace a Manual Process With a Production AI System?

If your team is running a workflow that doesn’t scale, spending recruiter time on work that should be automated, or facing a build that needs to ship and perform in production — GoGloby builds it. Every system in this case study was built and is maintained by the same Applied AI Engineers we embed in client teams.