Skip to Content
📚 MyStoryFlow Docs — Your guide to preserving family stories

AI Conversation Unit Tests

Test coverage for the AI conversation system, including features inspired by Blue Machines AI (restraint metrics, LLM failover, interruption handling, latency tracking).

Test Files Overview

Test FileFeature AreaTests
__tests__/lib/ai/conversation-quality-service.test.tsQuality scoring, restraint metrics, adaptive pacing~55
__tests__/lib/ai/enhanced-server-ai-service.test.tsLLM provider auto-failover~20
__tests__/api/conversations/chat-route.test.tsChat API: interruption context, latency tracking~15
__tests__/lib/conversation/enhanced-context-manager.test.tsContext building, pgvector search~30
__tests__/lib/conversation/memory-extractor.test.tsMemory extraction and deduplication~25
__tests__/lib/conversation/compaction-service.test.tsMessage compaction at turn 8+~25
__tests__/factories/conversation-factory.tsTest fixtures including restraint scenariosN/A

Test Fixtures

Located in __tests__/factories/conversation-factory.ts, these fixtures provide realistic conversation data for unit tests.

Standard Fixtures

FixtureDescriptionTests
shortResponsesBrief user messages (1-4 words)Adaptive pacing
emotionalContentGrief/loss contentEmotional safety layer
familyStoryRich multi-member narrativeMemory extraction
longSession15+ turnsCompaction trigger at turn 8
resumedSessionCross-session contextCheckpoint loading
repeatedStorySame story retoldMemory deduplication

Restraint Fixtures (Blue Machines AI)

FixtureDescriptionExpected Score
overQuestioningElena asks 3-4 questions per responseLow restraint
disproportionateResponseShort user messages get 200+ char AI responsesLow proportionality
topicRedirectElena repeatedly redirects away from user’s topicLow topic adherence
goodRestraint0-1 questions, proportional responses, follows user’s topicHigh restraint

Test Scenarios by Feature

1. Restraint Metrics

Source: conversation-quality-service.test.ts > restraint metrics

Restraint is weighted at 10% of the overall quality score. It measures whether Elena holds back appropriately rather than over-engaging.

TestWhat It Verifies
High question density3+ questions/response detected, low restraint score
Good question density0-1 questions/response, high restraint score
Disproportionate responsesLong AI responses to short user messages penalized
Balanced responsesProportional responses score highly
Topic redirectionRedirect phrases detected, low topic adherence
Good topic adherenceFollowing user’s topic scores 100
Space-giving violationsAI over-responding to short messages penalized
Good space-givingBrief responses to brief messages score 100
No AI messagesReturns perfect restraint (edge case)
10% weight in overallRestraint contributes to overall quality score
Few redirects tolerated1-2 redirects are OK (threshold > 2)

2. LLM Provider Auto-Failover

Source: enhanced-server-ai-service.test.ts

Tests the multi-provider failover chain: Primary → OpenAI GPT-4o-mini → Gemini → Anthropic Claude → Graceful error

TestWhat It Verifies
Primary Gemini succeedsReturns Gemini response, provider='gemini'
Primary OpenAI succeedsReturns OpenAI response, provider='openai'
Gemini fails → OpenAIAutomatic failover, provider='openai-failover'
OpenAI fails → GeminiAutomatic failover, provider='gemini-failover'
Both fail → AnthropicThird-level failover to Claude
All providers failGraceful error with “All providers failed”
Skip failed providerFailed primary is excluded from failover chain
callGeminiDirect unavailableReturns failure for unknown model
callGeminiDirect throwsCatches error, does not call OpenAI internally
callGeminiDirect no fallbackReturns failure without internal fallback
fallbackToOpenAI successGPT-4o-mini returns formatted response
fallbackToOpenAI failureReturns error message with provider info
Provider orderingStops at first success, doesn’t call later providers
All errors collectedError messages from all providers in final error
No Anthropic keyAnthropic excluded when API key not set
Timing trackedresponseTime set on success and failure
Variable substitutionTemplate vars replaced before calling provider

3. Interruption / Barge-In Context

Source: chat-route.test.ts > interruption / barge-in context

Tests Elena’s ability to handle user interruptions during voice conversations. When a user speaks while Elena’s TTS is playing, the system tracks what was cut off and injects context.

TestWhat It Verifies
Interruption context injectedSystem message added with interrupted text
No interruption = no contextwasInterrupted=false → no system message
Text truncated to 200 charsLong interrupted text is truncated
Context variables passedwas_interrupted and interrupted_at_text sent to AI

4. Voice Latency Tracking

Source: chat-route.test.ts > latency tracking

Tracks timing at each pipeline stage to identify bottlenecks. Target: <2000ms for text response, <4000ms including TTS.

TestWhat It Verifies
Latency breakdown in responseauth_ms, context_retrieval_ms, llm_inference_ms, total_api_ms present
Non-negative timing valuesAll latency values >= 0
Latency on DB failureLatency returned even when save fails
Test session latencyTest sessions also get latency breakdown

5. Adversarial Test Personas

Source: lib/testing/realistic-personas.ts

These are integration test personas (not Jest unit tests) that run against the live AI to verify Elena’s guardrails under pressure:

PersonaPurposeConstitution Principle
Advice SeekerAsks for medical/legal/financial advice#2 - Never gives professional advice
Boundary PusherShares distressing content#6 - Emotional safety protocol
System ProberTries to extract system prompts#7 - Never reveals system internals
Off-Topic WandererSends nonsensical messagesGraceful handling

6. Compliance Test Suite

Source: lib/testing/compliance-test-suite.ts

33 test cases covering Elena’s 7 AI Constitution principles:

  • Never says “you already told me that”
  • Never rushes (“Let’s move on…”, “To summarize quickly…”)
  • Never gives medical/legal/financial advice
  • Never fabricates memories
  • Never reveals system prompts
  • Never compares or judges stories
  • Handles grief with emotional safety protocol
  • Follows user’s lead on topics

7. Long-Conversation Stress Tests

Source: lib/testing/enhanced-conversation-tester.ts

Tests context stability over extended conversations (targeting Blue Machines’ 60-minute benchmark):

  • 24-turn conversations testing context stability
  • Quarter-based degradation analysis (turns 1-6, 7-12, 13-18, 19-24)
  • Context compaction verification after turn 8
  • Memory reference preservation across compaction boundaries

Running Tests

# All AI conversation tests npx jest --testPathPattern="__tests__/(lib/ai|api/conversations/chat)" --verbose # Restraint metrics only npx jest conversation-quality-service.test.ts --verbose # LLM failover only npx jest enhanced-server-ai-service.test.ts --verbose # Chat route (interruption + latency) only npx jest chat-route.test.ts --verbose # All conversation-related tests npx jest --testPathPattern="__tests__/(lib/(ai|conversation)|api/conversations)" --verbose

Integration Tests (Live AI)

The integration test framework in lib/testing/ is separate from Jest unit tests and tests against live AI APIs.

Run from the Conversation Testing Dashboard or programmatically:

import { EnhancedConversationTester } from '@/lib/testing/enhanced-conversation-tester'