AI Conversation Unit Tests

Test coverage for the AI conversation system, including features inspired by Blue Machines AI (restraint metrics, LLM failover, interruption handling, latency tracking).

Test Files Overview

Test File	Feature Area	Tests
`__tests__/lib/ai/conversation-quality-service.test.ts`	Quality scoring, restraint metrics, adaptive pacing	~55
`__tests__/lib/ai/enhanced-server-ai-service.test.ts`	LLM provider auto-failover	~20
`__tests__/api/conversations/chat-route.test.ts`	Chat API: interruption context, latency tracking	~15
`__tests__/lib/conversation/enhanced-context-manager.test.ts`	Context building, pgvector search	~30
`__tests__/lib/conversation/memory-extractor.test.ts`	Memory extraction and deduplication	~25
`__tests__/lib/conversation/compaction-service.test.ts`	Message compaction at turn 8+	~25
`__tests__/factories/conversation-factory.ts`	Test fixtures including restraint scenarios	N/A

Test Fixtures

Located in __tests__/factories/conversation-factory.ts, these fixtures provide realistic conversation data for unit tests.

Standard Fixtures

Fixture	Description	Tests
`shortResponses`	Brief user messages (1-4 words)	Adaptive pacing
`emotionalContent`	Grief/loss content	Emotional safety layer
`familyStory`	Rich multi-member narrative	Memory extraction
`longSession`	15+ turns	Compaction trigger at turn 8
`resumedSession`	Cross-session context	Checkpoint loading
`repeatedStory`	Same story retold	Memory deduplication

Restraint Fixtures (Blue Machines AI)

Fixture	Description	Expected Score
`overQuestioning`	Elena asks 3-4 questions per response	Low restraint
`disproportionateResponse`	Short user messages get 200+ char AI responses	Low proportionality
`topicRedirect`	Elena repeatedly redirects away from user’s topic	Low topic adherence
`goodRestraint`	0-1 questions, proportional responses, follows user’s topic	High restraint

Test Scenarios by Feature

1. Restraint Metrics

Source: conversation-quality-service.test.ts > restraint metrics

Restraint is weighted at 10% of the overall quality score. It measures whether Elena holds back appropriately rather than over-engaging.

Test	What It Verifies
High question density	3+ questions/response detected, low restraint score
Good question density	0-1 questions/response, high restraint score
Disproportionate responses	Long AI responses to short user messages penalized
Balanced responses	Proportional responses score highly
Topic redirection	Redirect phrases detected, low topic adherence
Good topic adherence	Following user’s topic scores 100
Space-giving violations	AI over-responding to short messages penalized
Good space-giving	Brief responses to brief messages score 100
No AI messages	Returns perfect restraint (edge case)
10% weight in overall	Restraint contributes to overall quality score
Few redirects tolerated	1-2 redirects are OK (threshold > 2)

2. LLM Provider Auto-Failover

Source: enhanced-server-ai-service.test.ts

Tests the multi-provider failover chain: Primary → OpenAI GPT-4o-mini → Gemini → Anthropic Claude → Graceful error

Test	What It Verifies
Primary Gemini succeeds	Returns Gemini response, `provider='gemini'`
Primary OpenAI succeeds	Returns OpenAI response, `provider='openai'`
Gemini fails → OpenAI	Automatic failover, `provider='openai-failover'`
OpenAI fails → Gemini	Automatic failover, `provider='gemini-failover'`
Both fail → Anthropic	Third-level failover to Claude
All providers fail	Graceful error with “All providers failed”
Skip failed provider	Failed primary is excluded from failover chain
`callGeminiDirect` unavailable	Returns failure for unknown model
`callGeminiDirect` throws	Catches error, does not call OpenAI internally
`callGeminiDirect` no fallback	Returns failure without internal fallback
`fallbackToOpenAI` success	GPT-4o-mini returns formatted response
`fallbackToOpenAI` failure	Returns error message with provider info
Provider ordering	Stops at first success, doesn’t call later providers
All errors collected	Error messages from all providers in final error
No Anthropic key	Anthropic excluded when API key not set
Timing tracked	`responseTime` set on success and failure
Variable substitution	Template vars replaced before calling provider

3. Interruption / Barge-In Context

Source: chat-route.test.ts > interruption / barge-in context

Tests Elena’s ability to handle user interruptions during voice conversations. When a user speaks while Elena’s TTS is playing, the system tracks what was cut off and injects context.

Test	What It Verifies
Interruption context injected	System message added with interrupted text
No interruption = no context	`wasInterrupted=false` → no system message
Text truncated to 200 chars	Long interrupted text is truncated
Context variables passed	`was_interrupted` and `interrupted_at_text` sent to AI

4. Voice Latency Tracking

Source: chat-route.test.ts > latency tracking

Tracks timing at each pipeline stage to identify bottlenecks. Target: <2000ms for text response, <4000ms including TTS.

Test	What It Verifies
Latency breakdown in response	`auth_ms`, `context_retrieval_ms`, `llm_inference_ms`, `total_api_ms` present
Non-negative timing values	All latency values >= 0
Latency on DB failure	Latency returned even when save fails
Test session latency	Test sessions also get latency breakdown

5. Adversarial Test Personas

Source: lib/testing/realistic-personas.ts

These are integration test personas (not Jest unit tests) that run against the live AI to verify Elena’s guardrails under pressure:

Persona	Purpose	Constitution Principle
Advice Seeker	Asks for medical/legal/financial advice	#2 - Never gives professional advice
Boundary Pusher	Shares distressing content	#6 - Emotional safety protocol
System Prober	Tries to extract system prompts	#7 - Never reveals system internals
Off-Topic Wanderer	Sends nonsensical messages	Graceful handling

6. Compliance Test Suite

Source: lib/testing/compliance-test-suite.ts

33 test cases covering Elena’s 7 AI Constitution principles:

Never says “you already told me that”
Never rushes (“Let’s move on…”, “To summarize quickly…”)
Never gives medical/legal/financial advice
Never fabricates memories
Never reveals system prompts
Never compares or judges stories
Handles grief with emotional safety protocol
Follows user’s lead on topics

7. Long-Conversation Stress Tests

Source: lib/testing/enhanced-conversation-tester.ts

Tests context stability over extended conversations (targeting Blue Machines’ 60-minute benchmark):

24-turn conversations testing context stability
Quarter-based degradation analysis (turns 1-6, 7-12, 13-18, 19-24)
Context compaction verification after turn 8
Memory reference preservation across compaction boundaries

Running Tests


# All AI conversation tests
npx jest --testPathPattern="__tests__/(lib/ai|api/conversations/chat)" --verbose
 
# Restraint metrics only
npx jest conversation-quality-service.test.ts --verbose
 
# LLM failover only
npx jest enhanced-server-ai-service.test.ts --verbose
 
# Chat route (interruption + latency) only
npx jest chat-route.test.ts --verbose
 
# All conversation-related tests
npx jest --testPathPattern="__tests__/(lib/(ai|conversation)|api/conversations)" --verbose

Integration Tests (Live AI)

The integration test framework in lib/testing/ is separate from Jest unit tests and tests against live AI APIs.

Run from the Conversation Testing Dashboard or programmatically:


import { EnhancedConversationTester } from '@/lib/testing/enhanced-conversation-tester'