Liquid Sound, Phase 3: From Characters to Phonemes

Liquid Sound, Phase 3: From Characters to Phonemes

Author: Bagzhan Karl
Organization: Manifestro
Date: February 22, 2026
Status: Phase 3 completed (phoneme acoustic modeling), transitioning to Phase 4

Abstract

The third phase of the Sanday project tested the hypothesis: would switching from character-level to phoneme-level targets improve recognition quality while maintaining model compactness? We moved from character CTC (28 classes) to ARPAbet (39 phonemes + blank + pad = 41 classes) and introduced word-level processing via a separate VAD.

After 23 experiments with various architectures (frontends, CfC configurations, processing strategies), we selected a hierarchical Residual CNN with dilation + 2 CfC layers. The large model: 2,002,425 parameters (7.65 MB state_dict). On LibriSpeech: PER = 42.4%, CER = 28.6% (greedy decoding, no LM). Full pipeline (VAD + acoustic model + experimental LM) on a 50‑hour podcast: WER = 32%.

The switch to phonemes yielded substantial accuracy improvement (Phase 2 CER: 57% → 28.6%). However, a critical limitation emerged: the absence of a space token and the rigid length constraint (~2 sec) make the model fragile to VAD errors, which occasionally output phrases instead of isolated words. Phase 4 will address this limitation.

LM and VAD are being developed separately and will be presented in independent technical reports.

1. Introduction: Why Phonemes

1.1 The Problem with Character‑Level Approach

Phase 2 ended with a CER of 57% at the character level. Error analysis revealed a fundamental limitation: the model hears correctly but confuses orthography. Examples from Phase 2:

Word	Prediction	Problem
existence	ekzistens / ekzistents	Ambiguity of "c" vs "s"
chlorophyll	clorifill	"ph" → "f" is phonetically correct but orthographically wrong
statistics	statistiks	Morpheme confusion

The word "existence" was recognized with only 11% accuracy — not because of acoustic weakness, but because of character‑level ambiguity. The model spent capacity memorizing spelling rules instead of pure acoustic modeling.

1.2 Phonemes as a Solution

English orthography is deeply inconsistent (42 phonemes, 26 letters, 500+ spelling patterns). ARPAbet provides a one‑to‑one mapping: one phoneme = one sound. Moving from 28 classes (characters) to 41 classes (39 phonemes + blank + pad) eliminates orthographic ambiguity.

Expected effect: the model learns pure acoustics, and the phoneme‑to‑text conversion is pushed outside the acoustic model (to a future LM).

1.3 Word‑Level Strategy

An additional vector was abandoning the phrase level. In Phase 2, the model processed sequences up to 12 seconds, causing long‑memory problems for CfC (degradation after 8 sec).

Word‑level processing:

An external VAD extracts word boundaries (0.2–1.2 sec)
The acoustic model sees only the isolated word plus a small context (0.05 sec on each side)
Reduces CfC's temporal memory requirements
Simplifies batching and augmentation

Risk: VAD errors, outputting phrases instead of words. This risk materialized fully and became the key limitation of Phase 3 (see Section 5).

2. Methodology: 23 Architectures and Selection

2.1 Selection Principle

Before choosing the final architecture, we ran 23 experiments with various configurations:

Frontends: SincConv + Attention, Residual CNN with different dilation (1,2,4), various kernel sizes
CfC configurations: 1‑3 layers, hidden size 256‑1024, backbone units 128‑512
Processing strategies: word‑level, phrase‑level, mixed

Selection principle: "If a model cannot overfit 20 files, it cannot generalize to 20,000."

Method: each architecture was overfitted on 20 audio files (high learning rate, many epochs). If CER <5% on these 20 files was not reached within 50 epochs, the architecture was discarded without full training.

This filter eliminated non‑viable options before expensive full‑scale training. Out of 23, 7 passed; from those, the final architecture was chosen based on quality/parameter ratio.

2.2 Final Architecture

CNN Frontend: hierarchical Residual CNN with dilation

Block	Dilation	Captures
1	1	Short phonemes (20‑50 ms)
2	2	Syllables (100‑200 ms)
3	4	Word‑level patterns (300‑500 ms)

After the third block, MaxPool along frequency: 80 → 40 mel bins. Receptive field grows without loss of detail.

Projector: Linear 1920 → 512 + LayerNorm + GELU + Dropout 0.25

CfC Backbone: 2 layers, hidden_size=512, backbone_units=256, dropout=0.25

Classifier: Linear 512 → 256 → GELU → Dropout 0.25 → Linear 41

Critical improvement: bias initialization for blank (index 0) = −3.0. This reduced blank dominance from 90% to 81% by epoch 3 and accelerated convergence by ~30%.

2.3 `large` Model Parameters

Component	Value
CNN channels	48
Hidden size	512
CfC layers	2
Backbone units	256
Dropout	0.25
Input	80 mel filters
Output	41 classes (39 phonemes + blank + pad)
Total parameters	2,002,425
State_dict	7.65 MB

3. Data and Training

3.1 Source

LibriSpeech: ~173,500 words with ARPAbet phoneme transcriptions (obtained via G2P, stresses removed). Split:

Split	Words
Train	138,771
Validation	17,346
Test	17,347

3.2 VAD and Word‑Level Processing

A separate Sanday‑VAD model (lab prototype, 97% accuracy on clean recordings) extracts word boundaries. The acoustic model receives only the segment [start − 0.05s, end + 0.05s] — an isolated word with minimal context.

Constraint: fixed length of 200 frames (~2 sec at hop_length=160, 16 kHz). Longer words are truncated, shorter words are padded + mask.

3.3 Preprocessing

Sampling rate: 16 kHz
Mel spectrogram: n_mels=80, n_fft=512, hop_length=160
Normalization: mean/std within the word
Caching: all spectrograms precomputed in pickle for speed

3.4 Training

Parameter	Value
Optimizer	AdamW
Max LR	5e-4
Scheduler	OneCycleLR (pct_start=0.3)
Weight decay	1e-4
Batch size	64
Epochs	100 (early stopping by PER)
Loss	CTC (blank=0, zero_infinity=True)
Augmentation	None (focus on architecture)

Best epoch: 30 (by validation PER). After that, slight overfitting (train/val gap increases).

Metrics at best epoch:

Val Loss = 0.9392
Val PER = 42.19%

4. Results

4.1 Test Metrics (LibriSpeech, greedy decoding, no LM)

Metric	Value
Test Loss	0.9252
PER (Phoneme Error Rate)	42.39%
CER (Character Error Rate)	28.63%

Comparison with Phase 2:

	Phase 2 (characters)	Phase 3 (phonemes)	Change
CER	57%	28.6%	−28.4 percentage points
Model	0.76M / 9.2 MB	2.0M / 7.65 MB	+1.24M / −1.55 MB

CER is not directly comparable (phonemes vs characters), but the trend is clear: switching to phonemes radically improved accuracy.

4.2 Recognition Examples

Word	Target Phonemes	Prediction	Comment
stayed	S T EY D	S T EY D	100%
statistics	S T AH T IH S T IH K S	S T AH T IH S T IH K S	100%
existence	IH G Z IH S T AH N S	IH G Z IH S T AH N S	100% (in Phase 2 — 11%)
mothers	M AH DH ER Z	M AH DH ER Z	100%
chlorophyll	K L AO R AH F IH L	K L AO R AH F IH L	100% (Phase 2: clorifill)
napkin	N AE P K IH N	N AE P K IH N	100% (Phase 2: smalnakton when merged)

The model predicts phonemes flawlessly for most test words. Errors occur mainly on rare words or with strong accents.

4.3 Full Pipeline (VAD + Acoustic + Experimental LM)

Additional test: 50‑hour YouTube podcast (various speakers, background noise).

Sanday‑VAD (lab version) → words
Phase 3 acoustic → phonemes
Experimental LM on another LNN architecture → text

Result: WER = 32%

LM: 1.8M parameters (experimental, unstable). Scaling to 10M parameters is in progress.

5. Limitations and Critical Findings

5.1 VAD Problem and Missing Space Token

Word‑level strategy works with an ideal VAD. Reality: VAD sometimes outputs phrases instead of words:

VAD output	Should be	Reality
"welcome to"	"welcome" + "to"	Single word 1.8 sec
"my name is Bagzhan"	4 separate words	One "word" 3.2 sec

The model truncates anything beyond 2 seconds. Result: end loss, hallucinations, phoneme merging.

The cause is architectural: 41 classes without a space token. The model cannot separate words within a segment. We implicitly assumed VAD would always give isolated words (0.2–1.2 sec). This was a design error.

5.2 Context Length

The rigid 200‑frame limit (~2 sec) is a compromise for stable training. Words longer than 2 sec are truncated, causing errors even with correct VAD (e.g., "uncharacteristically").

5.3 Dependence on VAD Quality

VAD accuracy of 97% is on clean LibriSpeech recordings. On real noisy podcasts, segmentation errors increase, cascading into worse final WER.

6. Conclusions

6.1 Confirmed Hypotheses

Phonemes are more effective than characters. CER dropped from 57% to 28.6%; recognition quality for complex words improved dramatically (existence: 11% → 100%).
Word‑level processing works — with an ideal VAD. Reduced CfC memory requirements, stable training, predictable errors.
CfC are capable of phoneme‑level modeling. PER 42.4% on a purely acoustic approach without LM is a competitive result for 2M parameters.

6.2 Architectural Findings

Hierarchical CNN with dilation outperformed SincConv+Attention. The "short phonemes → syllables → words" principle works.
Blank bias −3.0 — a simple trick that accelerates convergence by ~30%.
"20‑file" selection principle saved resources: 23 variants filtered down to 1 without full training runs.

6.3 Urgent Problems

Absence of a space token makes the system fragile to VAD errors.
Rigid length limit (2 sec) is unacceptable for the real world.
Dependence on an external VAD creates cascading errors.

7. Transition to Phase 4

Phase 4 moves from isolated words to contextual processing. We are expanding the time window to 8 seconds and solving the space token problem, allowing the model to handle naturally long phrases.

In parallel, we are testing the hypothesis of whether full CfC complexity is necessary: perhaps simplified dynamics can retain key tempo‑adaptation properties at lower computational cost.

Architecture details, VAD, and language model will be presented in separate technical reports after development is complete.