Voice-to-Text Architecture Research

Research Date: November 2024 Status: Reference Documentation

Overview

This document captures research on how modern voice-to-text applications (Wispr Flow, Superwhisper, MacWhisper) achieve low latency and affordable pricing, with implications for StoryFlow’s conversational features.

How Wispr Flow Works

Architecture

Wispr Flow uses a hybrid cloud architecture with fine-tuned open-source models:


User Device                          Cloud (Baseten + AWS)
┌──────────────┐                    ┌─────────────────────────┐
│ Hotkey       │                    │ Speech Recognition      │
│ activation   │ ──── audio ────►  │ (ASR Model)             │
│              │                    ├─────────────────────────┤
│              │                    │ Llama (fine-tuned)      │
│              │ ◄── formatted ──── │ - Remove filler words   │
│ Text output  │      text          │ - Grammar correction    │
└──────────────┘                    │ - Context formatting    │
                                    └─────────────────────────┘

Key Technical Details

Metric	Value
End-to-end latency	< 700ms
LLM token processing	100+ tokens in < 250ms
Languages supported	100+
Infrastructure	Baseten + AWS
LLM Framework	TensorRT-LLM
Base Model	Fine-tuned Llama

Why Open-Source Llama (Not OpenAI)

Complete customization - Can fine-tune for dictation-specific tasks
No per-token costs - Fixed infrastructure cost only
Latency control - Own the inference stack
Full ownership - No dependency on third-party API changes

Cost Analysis: Why $10-12/month Works

Transcription Costs Comparison

Approach	Cost per Audio Hour	Notes
OpenAI Whisper API	$0.36/hour	Managed, easy
Self-hosted GPU (T4)	~$0.07/hour	Requires DevOps
SaladCloud (distributed)	~$0.005/hour	Cheapest cloud
whisper.cpp on CPU	Near-zero	On-device
Modal (L40S serverless)	~$0.01/hour	Serverless GPU

Break-Even Analysis

API vs Self-hosted: Self-hosting cheaper at ~460 hours/month
With DevOps overhead: Cheaper at 3,000+ hours/month
Power user (5 hrs/day): 150 hours/month = ~$10-15 self-hosted

Open-Source Speech Recognition Models

Top Models (2024-2025)

Model	Speed (RTFx)	Size	Notes
NVIDIA Canary Qwen 2.5B	418x	2.5B	Hybrid ASR + LLM
Whisper Large V3 Turbo	216x	Reduced decoder	5.4x speedup
Distil-Whisper	6x faster	50% smaller	Within 1% WER
Moonshine	Edge-optimized	~27M	For mobile/edge
Vosk	Lightweight	Small	Fully offline

For Mobile/On-Device

Distil-Whisper small.en (166M params) - Best for resource-constrained
whisper.cpp - C++ port for native apps
Whisper Android (TFLite) - Android-specific implementation

On-Device Implementation Examples

Superwhisper / MacWhisper Architecture


┌─────────────────────────────────────────────┐
│              macOS / iOS App                │
│  ┌───────────────────────────────────────┐  │
│  │         whisper.cpp                   │  │
│  │   (Metal/GPU accelerated)             │  │
│  └───────────────────────────────────────┘  │
│  ┌───────────────────────────────────────┐  │
│  │      Model Options                    │  │
│  │  - Nano (fastest, less accurate)      │  │
│  │  - Fast                               │  │
│  │  - Pro                                │  │
│  │  - Ultra (most accurate, slower)      │  │
│  └───────────────────────────────────────┘  │
│                                             │
│  100% Local - No data leaves device         │
└─────────────────────────────────────────────┘

Hardware Requirements

Component	Minimum	Recommended
CPU	Apple M1 / Intel i7	M2/M3 or Ryzen 7+
RAM	8GB	16GB+
Storage	5GB free	10GB+
iOS	iPhone 13+	iPhone 15+

Browser-Based LLM (WebLLM)

Architecture


Browser (WebGPU enabled)
┌─────────────────────────────────────────────┐
│  ServiceWorkerMLCEngine                     │
│  ├── OpenAI-compatible API                  │
│  ├── Streaming responses                    │
│  └── JSON mode support                      │
│                                             │
│  MLCEngine (Web Worker)                     │
│  ├── Background thread processing           │
│  └── WebGPU kernel execution                │
│                                             │
│  Compiled WebGPU Kernels (AOT)              │
│  └── TVM-optimized operations               │
└─────────────────────────────────────────────┘

Performance

Retains 80% of native GPU performance
Models supported: Llama, Phi, Qwen, Gemma
Realistic limit: ~7B parameter models

Use Cases

Chrome extensions with persistent background
Privacy-preserving local inference
Offline-capable applications

Memory & Personalization Solutions

Open-Source Memory Layers

Solution	Storage	Best For
Mem0	Vector + any LLM	General personalization
MemMachine	Graph + SQL	Agent memory
Memary	Compressed context	Long conversations
Second Me	PEFT fine-tuning	Deep personalization

Memory Architecture Pattern


┌─────────────────────────────────────────────┐
│              Memory System                  │
│                                             │
│  ┌─────────────┐  ┌──────────────────────┐  │
│  │     STM     │  │        LTM           │  │
│  │ (Short-term)│  │    (Long-term)       │  │
│  │             │  │                      │  │
│  │ Current     │  │ Historical data      │  │
│  │ conversation│  │ User preferences     │  │
│  │ context     │  │ Past interactions    │  │
│  └─────────────┘  └──────────────────────┘  │
│                                             │
│  ┌──────────────────────────────────────┐   │
│  │          User Profile                │   │
│  │  - Dialect/language preferences      │   │
│  │  - Writing style                     │   │
│  │  - Topic interests                   │   │
│  │  - Conversational patterns           │   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Recommended Architecture for StoryFlow

Phase 1: Quick Wins (Current Priority: RAG/Memory)

Focus on contextual memory using existing infrastructure - see Contextual Memory & RAG

Phase 2: Voice Integration (Future)


┌─────────────────────────────────────────────────────────┐
│                    User Device                          │
│  ┌─────────────────┐  ┌──────────────────────────────┐  │
│  │ Whisper (local) │  │ Small LLM (quick responses)  │  │
│  │ distil-small.en │  │ Phi-3-mini / Qwen-1.5B       │  │
│  └────────┬────────┘  └────────────┬─────────────────┘  │
└───────────┼────────────────────────┼────────────────────┘
            │                        │
            ▼                        ▼
┌─────────────────────────────────────────────────────────┐
│                  StoryFlow Server                       │
│  ┌─────────────────────────────────────────────────────┐│
│  │         Memory Layer (Supabase pgvector)            ││
│  │  ┌──────────────┐  ┌────────────────────────┐      ││
│  │  │ User Profile │  │ Story Context (RAG)    │      ││
│  │  └──────────────┘  └────────────────────────┘      ││
│  └─────────────────────────────────────────────────────┘│
│  ┌─────────────────────────────────────────────────────┐│
│  │    LLM (Groq for speed / Local for privacy)        ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘

Voice-to-Text Architecture Research

Overview

How Wispr Flow Works

Architecture

Key Technical Details

Why Open-Source Llama (Not OpenAI)

Cost Analysis: Why $10-12/month Works

Transcription Costs Comparison

Break-Even Analysis

Open-Source Speech Recognition Models

Top Models (2024-2025)

For Mobile/On-Device

On-Device Implementation Examples

Superwhisper / MacWhisper Architecture

Hardware Requirements

Browser-Based LLM (WebLLM)

Architecture

Performance

Use Cases

Memory & Personalization Solutions

Open-Source Memory Layers

Memory Architecture Pattern

Recommended Architecture for StoryFlow

Phase 1: Quick Wins (Current Priority: RAG/Memory)

Phase 2: Voice Integration (Future)

References