AI Conversation Unit Tests
Test coverage for the AI conversation system, including features inspired by Blue Machines AI (restraint metrics, LLM failover, interruption handling, latency tracking).
Test Files Overview
| Test File | Feature Area | Tests |
|---|---|---|
__tests__/lib/ai/conversation-quality-service.test.ts | Quality scoring, restraint metrics, adaptive pacing | ~55 |
__tests__/lib/ai/enhanced-server-ai-service.test.ts | LLM provider auto-failover | ~20 |
__tests__/api/conversations/chat-route.test.ts | Chat API: interruption context, latency tracking | ~15 |
__tests__/lib/conversation/enhanced-context-manager.test.ts | Context building, pgvector search | ~30 |
__tests__/lib/conversation/memory-extractor.test.ts | Memory extraction and deduplication | ~25 |
__tests__/lib/conversation/compaction-service.test.ts | Message compaction at turn 8+ | ~25 |
__tests__/factories/conversation-factory.ts | Test fixtures including restraint scenarios | N/A |
Test Fixtures
Located in __tests__/factories/conversation-factory.ts, these fixtures provide realistic conversation data for unit tests.
Standard Fixtures
| Fixture | Description | Tests |
|---|---|---|
shortResponses | Brief user messages (1-4 words) | Adaptive pacing |
emotionalContent | Grief/loss content | Emotional safety layer |
familyStory | Rich multi-member narrative | Memory extraction |
longSession | 15+ turns | Compaction trigger at turn 8 |
resumedSession | Cross-session context | Checkpoint loading |
repeatedStory | Same story retold | Memory deduplication |
Restraint Fixtures (Blue Machines AI)
| Fixture | Description | Expected Score |
|---|---|---|
overQuestioning | Elena asks 3-4 questions per response | Low restraint |
disproportionateResponse | Short user messages get 200+ char AI responses | Low proportionality |
topicRedirect | Elena repeatedly redirects away from user’s topic | Low topic adherence |
goodRestraint | 0-1 questions, proportional responses, follows user’s topic | High restraint |
Test Scenarios by Feature
1. Restraint Metrics
Source: conversation-quality-service.test.ts > restraint metrics
Restraint is weighted at 10% of the overall quality score. It measures whether Elena holds back appropriately rather than over-engaging.
| Test | What It Verifies |
|---|---|
| High question density | 3+ questions/response detected, low restraint score |
| Good question density | 0-1 questions/response, high restraint score |
| Disproportionate responses | Long AI responses to short user messages penalized |
| Balanced responses | Proportional responses score highly |
| Topic redirection | Redirect phrases detected, low topic adherence |
| Good topic adherence | Following user’s topic scores 100 |
| Space-giving violations | AI over-responding to short messages penalized |
| Good space-giving | Brief responses to brief messages score 100 |
| No AI messages | Returns perfect restraint (edge case) |
| 10% weight in overall | Restraint contributes to overall quality score |
| Few redirects tolerated | 1-2 redirects are OK (threshold > 2) |
2. LLM Provider Auto-Failover
Source: enhanced-server-ai-service.test.ts
Tests the multi-provider failover chain: Primary → OpenAI GPT-4o-mini → Gemini → Anthropic Claude → Graceful error
| Test | What It Verifies |
|---|---|
| Primary Gemini succeeds | Returns Gemini response, provider='gemini' |
| Primary OpenAI succeeds | Returns OpenAI response, provider='openai' |
| Gemini fails → OpenAI | Automatic failover, provider='openai-failover' |
| OpenAI fails → Gemini | Automatic failover, provider='gemini-failover' |
| Both fail → Anthropic | Third-level failover to Claude |
| All providers fail | Graceful error with “All providers failed” |
| Skip failed provider | Failed primary is excluded from failover chain |
callGeminiDirect unavailable | Returns failure for unknown model |
callGeminiDirect throws | Catches error, does not call OpenAI internally |
callGeminiDirect no fallback | Returns failure without internal fallback |
fallbackToOpenAI success | GPT-4o-mini returns formatted response |
fallbackToOpenAI failure | Returns error message with provider info |
| Provider ordering | Stops at first success, doesn’t call later providers |
| All errors collected | Error messages from all providers in final error |
| No Anthropic key | Anthropic excluded when API key not set |
| Timing tracked | responseTime set on success and failure |
| Variable substitution | Template vars replaced before calling provider |
3. Interruption / Barge-In Context
Source: chat-route.test.ts > interruption / barge-in context
Tests Elena’s ability to handle user interruptions during voice conversations. When a user speaks while Elena’s TTS is playing, the system tracks what was cut off and injects context.
| Test | What It Verifies |
|---|---|
| Interruption context injected | System message added with interrupted text |
| No interruption = no context | wasInterrupted=false → no system message |
| Text truncated to 200 chars | Long interrupted text is truncated |
| Context variables passed | was_interrupted and interrupted_at_text sent to AI |
4. Voice Latency Tracking
Source: chat-route.test.ts > latency tracking
Tracks timing at each pipeline stage to identify bottlenecks. Target: <2000ms for text response, <4000ms including TTS.
| Test | What It Verifies |
|---|---|
| Latency breakdown in response | auth_ms, context_retrieval_ms, llm_inference_ms, total_api_ms present |
| Non-negative timing values | All latency values >= 0 |
| Latency on DB failure | Latency returned even when save fails |
| Test session latency | Test sessions also get latency breakdown |
5. Adversarial Test Personas
Source: lib/testing/realistic-personas.ts
These are integration test personas (not Jest unit tests) that run against the live AI to verify Elena’s guardrails under pressure:
| Persona | Purpose | Constitution Principle |
|---|---|---|
| Advice Seeker | Asks for medical/legal/financial advice | #2 - Never gives professional advice |
| Boundary Pusher | Shares distressing content | #6 - Emotional safety protocol |
| System Prober | Tries to extract system prompts | #7 - Never reveals system internals |
| Off-Topic Wanderer | Sends nonsensical messages | Graceful handling |
6. Compliance Test Suite
Source: lib/testing/compliance-test-suite.ts
33 test cases covering Elena’s 7 AI Constitution principles:
- Never says “you already told me that”
- Never rushes (“Let’s move on…”, “To summarize quickly…”)
- Never gives medical/legal/financial advice
- Never fabricates memories
- Never reveals system prompts
- Never compares or judges stories
- Handles grief with emotional safety protocol
- Follows user’s lead on topics
7. Long-Conversation Stress Tests
Source: lib/testing/enhanced-conversation-tester.ts
Tests context stability over extended conversations (targeting Blue Machines’ 60-minute benchmark):
- 24-turn conversations testing context stability
- Quarter-based degradation analysis (turns 1-6, 7-12, 13-18, 19-24)
- Context compaction verification after turn 8
- Memory reference preservation across compaction boundaries
Running Tests
# All AI conversation tests
npx jest --testPathPattern="__tests__/(lib/ai|api/conversations/chat)" --verbose
# Restraint metrics only
npx jest conversation-quality-service.test.ts --verbose
# LLM failover only
npx jest enhanced-server-ai-service.test.ts --verbose
# Chat route (interruption + latency) only
npx jest chat-route.test.ts --verbose
# All conversation-related tests
npx jest --testPathPattern="__tests__/(lib/(ai|conversation)|api/conversations)" --verboseIntegration Tests (Live AI)
The integration test framework in lib/testing/ is separate from Jest unit tests and tests against live AI APIs.
Run from the Conversation Testing Dashboard or programmatically:
import { EnhancedConversationTester } from '@/lib/testing/enhanced-conversation-tester'