Initial screening calls were consuming 40% of recruiter time — 45 minutes per candidate, 6 to 8 calls per day, available only during business hours, in one time zone. As sourcing volume grew, the bottleneck became acute: qualified candidates were aging in the pipeline while waiting for a slot.

This is how GoGloby replaced that bottleneck with a production real-time voice AI system — and what it took to build it right.

Achievements After Partnering With GoGloby

MetricResult
Screening Time Per Candidate↓ 85%
Cost Per Screening$35.00 → $1.38 (↓ 96%)
Screening Availability8 hrs/day → 24/7 across all time zones
Evaluation Consistency Score65% → 94%
False Positive Rate on Shortlists35% → 2.1%
Information Extraction Accuracy94%
Candidate Satisfaction Score8.9 / 10
System Uptime99.6%

The Situation at a Glance

ServiceAgentic SDLC
StackLiveKit · Deepgram · GPT-4 · spaCy · OpenAI TTS
Cost Per Screening$1.38 total
Availability24/7 · All time zones
Core ProblemManual phone screening consuming 40% of recruiter time with no path to scale
SolutionEnd-to-end real-time voice AI pipeline with hybrid LLM + NLP architecture

The Problem

GoGloby’s initial candidate screening process relied entirely on human recruiters conducting phone calls. Every candidate required 45 minutes of direct recruiter time before a single qualified profile reached a client.

3 structural problems made this unsustainable at scale.

1. Volume and Capacity

Screening was available only during business hours in the recruiter’s time zone. Candidates in different regions waited days for a slot. As open roles multiplied, the bottleneck became acute — the team couldn’t screen fast enough to keep up with sourcing output, so qualified candidates aged in the pipeline while waiting.

2. Inconsistent Evaluation Quality

2 recruiters running the same screening script would arrive at different conclusions 35% of the time. Fatigue, subjectivity, and variations in probing technique meant the signal quality of each call was inherently unreliable. Clients received inconsistent shortlists as a result.

3. A Cost Structure That Didn’t Scale

At $35 per screening — recruiter time fully loaded — and with hundreds of candidates moving through the pipeline monthly, screening was one of the largest operational cost lines in the business. There was no mechanism to bring that cost down without reducing headcount or cutting quality.

The Engineering Challenge

Building a voice AI system that conducts real recruiting conversations is fundamentally different from building a chatbot. The requirements were strict and non-negotiable.

  • Sub-second STT latency: Any pause longer than 300ms breaks conversational flow. Candidates notice immediately — the experience degrades from a screening call to a broken phone line.
  • Natural TTS voice: A robotic voice triggers disengagement within the first 10 seconds. Candidates form a judgment about the company before the first question is asked.
  • Accurate intent classification:The system must know in real time when a candidate has finished answering, when they’re asking for clarification, and when they’re deflecting — and respond accordingly.
  • Multi-turn context retention: A screening runs 12–15 questions. The system must remember what was said 8 turns ago to ask coherent follow-ups and avoid repeating itself.
  • Graceful interruption handling: Candidates interrupt. They go off-script. They ask questions mid-answer. The system must handle this without breaking state.
  • 24/7 availability at low cost: The economic case depends on keeping per-screening cost well below the $35 manual baseline — at any hour, any volume.

Technology Stack Decisions

5 independent technology decisions shaped the architecture. Every selection was tested in production-like conditions before being committed.

1. Real-Time Communication Platform

The foundation of the entire system. Instability here cascades into every other layer.

PlatformStabilityAudio QualityDev ExperienceStatus
LiveKit9.5/109.2/109.5/10✓ Selected
Agora9.0/109.0/108.5/10Rejected
Twilio Video8.5/108.8/109.0/10Rejected
Jitsi7.5/107.0/107.0/10Rejected
Mediasoup8.0/108.5/106.5/10Rejected

LiveKit was open-source with no vendor lock-in, native SDKs for web, mobile, and server, and built-in Voice Activity Detection that the team extended with custom energy-based filtering. Self-hosted deployment gave full infrastructure control. No other platform came close on the combination of developer experience and configurability.

2. Speech-to-Text (STT)

The most latency-sensitive component. Every millisecond here is felt by the candidate.

ProviderAccuracyLatencyCost / minStatys
Deepgram94.5%150ms$0.0059✓ Selected
OpenAI Whisper93.2%300ms$0.006Rejected
Google Speech-to-Text92.8%200ms$0.024Rejected
Azure Speech92.5%180ms$0.020Rejected

Deepgram’s 150ms average latency was the decisive factor — 100ms faster than the nearest competitor. That gap is the difference between a conversation that feels natural and one that feels broken. Its accuracy on accented English was also superior, which matters when screening candidates across Latin America and Eastern Europe.

3. Text-to-Speech (TTS)

Voice quality determines whether candidates take the screening seriously within the first 10 seconds.

ProviderNaturalnessSpeedConfigurabilityStatus
OpenAI TTS9.2/108.5/108.0/10✓ Selected
Azure TTS8.8/107.8/108.0/10Rejected
Google WaveNet9.0/107.5/107.5/10Rejected
Amazon Polly8.5/108.0/108.5/10Rejected

Naturalness scored highest across all test panels — candidates in blind tests could not reliably distinguish the voice from a human in the first 30 seconds of a call. Integration synergy with the existing OpenAI stack simplified the architecture, and multi-language voice switching worked reliably out of the box.

The Critical Optimisation: Routing LLM to NLP

This was the most consequential engineering decision in the entire project — and the one that made the economics work.

The initial architecture used GPT-4 for everything: response generation, intent classification, turn management, and clarification detection. The quality was high, but the system had a hard ceiling: GPT-4 at 1.2 seconds per call, running for every intent classification, meant conversations felt slightly off — too much silence between turns, and a hard cap of 50 concurrent sessions before infrastructure costs became prohibitive.

The team identified that intent classification — deciding whether a candidate has finished speaking, is asking for clarification, or is giving an incomplete answer — didn’t require the full reasoning power of a large language model. It was a classification problem. spaCy’s en_core_web_md model could do it faster, cheaper, and more accurately.

MetricGPT-4spaCyImprovement
Response time1.2 seconds10 milliseconds99.2% faster
Memory usage5 GB70 MB98.6% reduction
Accuracy78%99%+27 points
Cost per request$0.002$0.00000199.95% cheaper
Concurrent sessions501,000+20× capacity

The routing rule: GPT-4 handles what requires reasoning — generating responses, asking follow-up questions, summarizing candidate answers. spaCy handles what requires classification — detecting intent, managing turn transitions, identifying when a candidate has finished speaking. Routing accuracy between the two: 96%. The result is a system that delivers GPT-4 quality at near-spaCy cost.

System Architecture

7 components, each with a single responsibility, composing into a real-time conversation system that processes audio, classifies intent, generates responses, and manages turn-taking — all within the latency budget of a natural conversation.

LayerComponentTechnologyResponsibility
1. Audio I/OReal-Time TransportLiveKit + custom VADAudio streaming, voice activity detection, turn management, interruption handling
2. TranscriptionSpeech-to-TextDeepgram streaming APIContinuous real-time transcription at 150ms latency — streaming, no wait for utterance completion
3. ClassificationIntent DetectionspaCy en_core_web_mdClassifies candidate intent in 8ms: experience response, clarification request, incomplete answer, deflection
4. ContextConversation StateIn-memory + PostgreSQLFull conversation history, candidate profile in progress, current question position
5. GenerationResponse EngineGPT-4 + fine-tuned promptsContextual responses, follow-up questions, rephrasing on clarification requests
6. VoiceText-to-SpeechOpenAI TTSNatural-sounding audio in real time. Voice profile locked per session for consistency
7. OutputScreening ReportStructured JSON → PDFAuto-generated candidate assessment with score breakdown, key quotes, recruiter recommendation

Cost Per Screening Breakdown

ComponentMonthly (400 screenings)Per Screening
Infrastructure (LiveKit + AWS)$180$0.45
TTS Processing (OpenAI)$120$0.30
STT Processing (Deepgram)$95$0.24
LLM Processing (GPT-4)$140$0.35
NLP Processing (spaCy)$15$0.04
TOTAL$550$1.38

$1.38 versus $35.00. The system runs 24 hours a day, in any time zone, with perfectly consistent evaluation criteria on every call — and paid for its entire development cost within the first 90 days of operation.

Results

MetricBeforeAfterChange
Average screening time45 minutes12 minutes↓ 73%
Recruiter time per candidate45 minutes7 minutes↓ 85%
Screening cost per candidate$35.00$1.38↓ 96%
Screening availability8 hrs/day24 hrs/day↑ 300%
Evaluation consistency score65%94%↑ 45%
False positive rate on shortlists35%2.1%↓ 94%
Candidate satisfaction score8.9 / 10
System uptime99.6%vs. 99% target

Every quality target set before the build was met or exceeded in production.

What Clients Say

“GoGloby built a voice AI screening system for us and plugged it into our hiring pipeline. I was not sure how candidates would react. Turns out 94% of them said they appreciated the availability — they could screen at 9pm after work, not during a Tuesday morning slot they had to take off for. Our time-to-shortlist dropped by more than half and the assessment reports the system generates are more consistent than what our internal team was producing.”

— Founder & CEO, Professional Services Firm (120 employees)

“We were running 30 to 40 screening calls a month ourselves. GoGloby integrated their voice AI system into our workflow and within the first month we had cut that to 8 calls — only the ones that actually needed a human. The system handles everything else. The cost saving was immediate but the bigger thing was consistency. Every candidate gets the same quality of screening regardless of who’s having a bad day.”

— Co-Founder, Growth-Stage SaaS Startup (80 employees)

What We’d Tell Engineers Starting This

Use the right model for the right job

GPT-4 for intent classification is engineering overkill — like using an excavator to plant a seed. spaCy does intent classification faster, cheaper, and more accurately because it was built for that specific task. Decompose your AI pipeline into components and evaluate the right model class for each one independently. Don’t default to the most capable model across the board.

Latency compounds across the stack

Every component in a real-time voice pipeline contributes to perceived naturalness. A 150ms STT, an 8ms intent classifier, and a 1.2s response generator add up to a specific conversational experience. Optimise each layer individually — shaving 100ms from STT is worth more than shaving 100ms from response generation because it comes earlier in the interaction loop.

Voice quality is a first impression you can’t undo

Candidates form an opinion about the company within the first 10 seconds of audio. A robotic voice signals cheap automation. The additional cost of OpenAI TTS over cheaper alternatives is negligible per call — but the impact on candidate experience is substantial. Never compromise on the first thing a candidate hears.

Interruption handling is not an edge case

In a real screening call, candidates interrupt constantly. They ask for clarification mid-sentence. They circle back to earlier questions. Building robust interruption and context recovery logic from the start — not as a patch — was the difference between a system that felt natural and one that felt fragile.

Consistency is the product’s deepest value

The explicit selling point was speed and cost. The deeper value turned out to be consistency. Every candidate gets the same questions, the same probing follow-ups, the same evaluation framework — regardless of time zone, day of week, or recruiter mood. When pitching AI screening to clients, lead with consistency, not automation.

$1.38 changes the business model

At $35 per screening, volume was a constraint. At $1.38, it disappears. Instead of managing screening as a cost to minimize, you can afford to screen more candidates more thoroughly — which improves hire quality at every stage downstream. Cost reduction unlocks strategic options that weren’t available before.