Voice-to-Text Architecture Research
Research Date: November 2024 Status: Reference Documentation
Overview
This document captures research on how modern voice-to-text applications (Wispr Flow, Superwhisper, MacWhisper) achieve low latency and affordable pricing, with implications for StoryFlowβs conversational features.
How Wispr Flow Works
Architecture
Wispr Flow uses a hybrid cloud architecture with fine-tuned open-source models:
User Device Cloud (Baseten + AWS)
ββββββββββββββββ βββββββββββββββββββββββββββ
β Hotkey β β Speech Recognition β
β activation β ββββ audio βββββΊ β (ASR Model) β
β β βββββββββββββββββββββββββββ€
β β β Llama (fine-tuned) β
β β βββ formatted ββββ β - Remove filler words β
β Text output β text β - Grammar correction β
ββββββββββββββββ β - Context formatting β
βββββββββββββββββββββββββββKey Technical Details
| Metric | Value |
|---|---|
| End-to-end latency | < 700ms |
| LLM token processing | 100+ tokens in < 250ms |
| Languages supported | 100+ |
| Infrastructure | Baseten + AWS |
| LLM Framework | TensorRT-LLM |
| Base Model | Fine-tuned Llama |
Why Open-Source Llama (Not OpenAI)
- Complete customization - Can fine-tune for dictation-specific tasks
- No per-token costs - Fixed infrastructure cost only
- Latency control - Own the inference stack
- Full ownership - No dependency on third-party API changes
Cost Analysis: Why $10-12/month Works
Transcription Costs Comparison
| Approach | Cost per Audio Hour | Notes |
|---|---|---|
| OpenAI Whisper API | $0.36/hour | Managed, easy |
| Self-hosted GPU (T4) | ~$0.07/hour | Requires DevOps |
| SaladCloud (distributed) | ~$0.005/hour | Cheapest cloud |
| whisper.cpp on CPU | Near-zero | On-device |
| Modal (L40S serverless) | ~$0.01/hour | Serverless GPU |
Break-Even Analysis
- API vs Self-hosted: Self-hosting cheaper at ~460 hours/month
- With DevOps overhead: Cheaper at 3,000+ hours/month
- Power user (5 hrs/day): 150 hours/month = ~$10-15 self-hosted
Open-Source Speech Recognition Models
Top Models (2024-2025)
| Model | Speed (RTFx) | Size | Notes |
|---|---|---|---|
| NVIDIA Canary Qwen 2.5B | 418x | 2.5B | Hybrid ASR + LLM |
| Whisper Large V3 Turbo | 216x | Reduced decoder | 5.4x speedup |
| Distil-Whisper | 6x faster | 50% smaller | Within 1% WER |
| Moonshine | Edge-optimized | ~27M | For mobile/edge |
| Vosk | Lightweight | Small | Fully offline |
For Mobile/On-Device
- Distil-Whisper small.en (166M params) - Best for resource-constrained
- whisper.cpp - C++ port for native apps
- Whisper Android (TFLite) - Android-specific implementation
On-Device Implementation Examples
Superwhisper / MacWhisper Architecture
βββββββββββββββββββββββββββββββββββββββββββββββ
β macOS / iOS App β
β βββββββββββββββββββββββββββββββββββββββββ β
β β whisper.cpp β β
β β (Metal/GPU accelerated) β β
β βββββββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββ β
β β Model Options β β
β β - Nano (fastest, less accurate) β β
β β - Fast β β
β β - Pro β β
β β - Ultra (most accurate, slower) β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β
β 100% Local - No data leaves device β
βββββββββββββββββββββββββββββββββββββββββββββββHardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| CPU | Apple M1 / Intel i7 | M2/M3 or Ryzen 7+ |
| RAM | 8GB | 16GB+ |
| Storage | 5GB free | 10GB+ |
| iOS | iPhone 13+ | iPhone 15+ |
Browser-Based LLM (WebLLM)
Architecture
Browser (WebGPU enabled)
βββββββββββββββββββββββββββββββββββββββββββββββ
β ServiceWorkerMLCEngine β
β βββ OpenAI-compatible API β
β βββ Streaming responses β
β βββ JSON mode support β
β β
β MLCEngine (Web Worker) β
β βββ Background thread processing β
β βββ WebGPU kernel execution β
β β
β Compiled WebGPU Kernels (AOT) β
β βββ TVM-optimized operations β
βββββββββββββββββββββββββββββββββββββββββββββββPerformance
- Retains 80% of native GPU performance
- Models supported: Llama, Phi, Qwen, Gemma
- Realistic limit: ~7B parameter models
Use Cases
- Chrome extensions with persistent background
- Privacy-preserving local inference
- Offline-capable applications
Memory & Personalization Solutions
Open-Source Memory Layers
| Solution | Storage | Best For |
|---|---|---|
| Mem0 | Vector + any LLM | General personalization |
| MemMachine | Graph + SQL | Agent memory |
| Memary | Compressed context | Long conversations |
| Second Me | PEFT fine-tuning | Deep personalization |
Memory Architecture Pattern
βββββββββββββββββββββββββββββββββββββββββββββββ
β Memory System β
β β
β βββββββββββββββ ββββββββββββββββββββββββ β
β β STM β β LTM β β
β β (Short-term)β β (Long-term) β β
β β β β β β
β β Current β β Historical data β β
β β conversationβ β User preferences β β
β β context β β Past interactions β β
β βββββββββββββββ ββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β User Profile β β
β β - Dialect/language preferences β β
β β - Writing style β β
β β - Topic interests β β
β β - Conversational patterns β β
β ββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββRecommended Architecture for StoryFlow
Phase 1: Quick Wins (Current Priority: RAG/Memory)
Focus on contextual memory using existing infrastructure - see Contextual Memory & RAG
Phase 2: Voice Integration (Future)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Device β
β βββββββββββββββββββ ββββββββββββββββββββββββββββββββ β
β β Whisper (local) β β Small LLM (quick responses) β β
β β distil-small.en β β Phi-3-mini / Qwen-1.5B β β
β ββββββββββ¬βββββββββ ββββββββββββββ¬ββββββββββββββββββ β
βββββββββββββΌβββββββββββββββββββββββββΌβββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β StoryFlow Server β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Memory Layer (Supabase pgvector) ββ
β β ββββββββββββββββ ββββββββββββββββββββββββββ ββ
β β β User Profile β β Story Context (RAG) β ββ
β β ββββββββββββββββ ββββββββββββββββββββββββββ ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β LLM (Groq for speed / Local for privacy) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ