Sonic Speech

Optimized speech models for Apple Silicon, powering Sonic — a local-first voice AI system. All models run entirely on-device using MLX. No cloud, no API keys, no data leaves your Mac.

ASR — Parakeet TDT (NVIDIA, ported to MLX)

SOTA English speech recognition with encoder-only mixed-precision quantization.

Model Size WER (LibriSpeech) WER (TED-LIUM) RTFx Peak Memory
parakeet-tdt-0.6b-v3 1,254 MB 0.82% 15.1% 73x 3,002 MB
parakeet-tdt-0.6b-v3-int8 755 MB 0.82% 15.1% 95x 1,268
MB
parakeet-tdt-0.6b-v3-int4 489 MB 0.82% 15.5% 98x 1,003
MB
parakeet-tdt-0.6b-v2 1,222 MB
parakeet-tdt-0.6b-v2-int8 736 MB
parakeet-tdt-0.6b-v2-int4 470 MB

v3 supports 25 languages. v2 is English-only. INT8 recommended — zero WER loss, 40% smaller, 30% faster.

TTS — Kokoro 82M (MLX)

Fast text-to-speech with 32+ voices (American, British, Japanese, Chinese).

Model Size Short Text Medium Text TTFC (streaming) RTFx
kokoro-82m-bf16 ~170 MB 47 ms 224 ms 126 ms 41x

Quantization Strategy

Only the Conformer encoder (~85% of params) is quantized — the decoder stays BF16 for token precision.

Variant Size Speed Memory WER Impact
INT8 -40% +30% -58% None
INT4 -61% +34% -67% +0.4pp on real speech

Quick Start

# ASR
from parakeet import from_pretrained
model = from_pretrained("sonic-speech/parakeet-tdt-0.6b-v3-int8")

# TTS
from sonic_tts import SonicTTS
tts = SonicTTS(voice="af_heart")

All benchmarks: Apple M3 Max 64 GB, macOS Sequoia, MLX 0.30.4. Built by https://huggingface.co/flight505.