Manifestro
Manifestro LabAcoustic Modeling

Liquid Sound, Phase 3: From Characters to Phonemes

Liquid Sound, Phase 3: From Characters to Phonemes
Achieving phoneme-level accuracy with 2M parameters and a hierarchical CNN frontend.
Manifestro Team
Share:

Liquid Sound, Phase 3: From Characters to Phonemes

Author: Bagzhan Karl
Organization: Manifestro
Date: February 22, 2026
Status: Phase 3 completed (phoneme acoustic modeling), transitioning to Phase 4


Abstract

The third phase of the Sanday project tested the hypothesis: would switching from character-level to phoneme-level targets improve recognition quality while maintaining model compactness? We moved from character CTC (28 classes) to ARPAbet (39 phonemes + blank + pad = 41 classes) and introduced word-level processing via a separate VAD.

After 23 experiments with various architectures (frontends, CfC configurations, processing strategies), we selected a hierarchical Residual CNN with dilation + 2 CfC layers. The large model: 2,002,425 parameters (7.65 MB state_dict). On LibriSpeech: PER = 42.4%, CER = 28.6% (greedy decoding, no LM). Full pipeline (VAD + acoustic model + experimental LM) on a 50‑hour podcast: WER = 32%.

The switch to phonemes yielded substantial accuracy improvement (Phase 2 CER: 57% → 28.6%). However, a critical limitation emerged: the absence of a space token and the rigid length constraint (~2 sec) make the model fragile to VAD errors, which occasionally output phrases instead of isolated words. Phase 4 will address this limitation.

LM and VAD are being developed separately and will be presented in independent technical reports.


1. Introduction: Why Phonemes

1.1 The Problem with Character‑Level Approach

Phase 2 ended with a CER of 57% at the character level. Error analysis revealed a fundamental limitation: the model hears correctly but confuses orthography. Examples from Phase 2:

WordPredictionProblem
existenceekzistens / ekzistentsAmbiguity of "c" vs "s"
chlorophyllclorifill"ph" → "f" is phonetically correct but orthographically wrong
statisticsstatistiksMorpheme confusion

The word "existence" was recognized with only 11% accuracy — not because of acoustic weakness, but because of character‑level ambiguity. The model spent capacity memorizing spelling rules instead of pure acoustic modeling.

1.2 Phonemes as a Solution

English orthography is deeply inconsistent (42 phonemes, 26 letters, 500+ spelling patterns). ARPAbet provides a one‑to‑one mapping: one phoneme = one sound. Moving from 28 classes (characters) to 41 classes (39 phonemes + blank + pad) eliminates orthographic ambiguity.

Expected effect: the model learns pure acoustics, and the phoneme‑to‑text conversion is pushed outside the acoustic model (to a future LM).

1.3 Word‑Level Strategy

An additional vector was abandoning the phrase level. In Phase 2, the model processed sequences up to 12 seconds, causing long‑memory problems for CfC (degradation after 8 sec).

Word‑level processing:

  • An external VAD extracts word boundaries (0.2–1.2 sec)
  • The acoustic model sees only the isolated word plus a small context (0.05 sec on each side)
  • Reduces CfC's temporal memory requirements
  • Simplifies batching and augmentation

Risk: VAD errors, outputting phrases instead of words. This risk materialized fully and became the key limitation of Phase 3 (see Section 5).


2. Methodology: 23 Architectures and Selection

2.1 Selection Principle

Before choosing the final architecture, we ran 23 experiments with various configurations:

  • Frontends: SincConv + Attention, Residual CNN with different dilation (1,2,4), various kernel sizes
  • CfC configurations: 1‑3 layers, hidden size 256‑1024, backbone units 128‑512
  • Processing strategies: word‑level, phrase‑level, mixed

Selection principle: "If a model cannot overfit 20 files, it cannot generalize to 20,000."

Method: each architecture was overfitted on 20 audio files (high learning rate, many epochs). If CER <5% on these 20 files was not reached within 50 epochs, the architecture was discarded without full training.

This filter eliminated non‑viable options before expensive full‑scale training. Out of 23, 7 passed; from those, the final architecture was chosen based on quality/parameter ratio.

2.2 Final Architecture

CNN Frontend: hierarchical Residual CNN with dilation

BlockDilationCaptures
11Short phonemes (20‑50 ms)
22Syllables (100‑200 ms)
34Word‑level patterns (300‑500 ms)

After the third block, MaxPool along frequency: 80 → 40 mel bins. Receptive field grows without loss of detail.

Projector: Linear 1920 → 512 + LayerNorm + GELU + Dropout 0.25

CfC Backbone: 2 layers, hidden_size=512, backbone_units=256, dropout=0.25

Classifier: Linear 512 → 256 → GELU → Dropout 0.25 → Linear 41

Critical improvement: bias initialization for blank (index 0) = −3.0. This reduced blank dominance from 90% to 81% by epoch 3 and accelerated convergence by ~30%.

2.3 large Model Parameters

ComponentValue
CNN channels48
Hidden size512
CfC layers2
Backbone units256
Dropout0.25
Input80 mel filters
Output41 classes (39 phonemes + blank + pad)
Total parameters2,002,425
State_dict7.65 MB

3. Data and Training

3.1 Source

LibriSpeech: ~173,500 words with ARPAbet phoneme transcriptions (obtained via G2P, stresses removed). Split:

SplitWords
Train138,771
Validation17,346
Test17,347

3.2 VAD and Word‑Level Processing

A separate Sanday‑VAD model (lab prototype, 97% accuracy on clean recordings) extracts word boundaries. The acoustic model receives only the segment [start − 0.05s, end + 0.05s] — an isolated word with minimal context.

Constraint: fixed length of 200 frames (~2 sec at hop_length=160, 16 kHz). Longer words are truncated, shorter words are padded + mask.

3.3 Preprocessing

  • Sampling rate: 16 kHz
  • Mel spectrogram: n_mels=80, n_fft=512, hop_length=160
  • Normalization: mean/std within the word
  • Caching: all spectrograms precomputed in pickle for speed

3.4 Training

ParameterValue
OptimizerAdamW
Max LR5e-4
SchedulerOneCycleLR (pct_start=0.3)
Weight decay1e-4
Batch size64
Epochs100 (early stopping by PER)
LossCTC (blank=0, zero_infinity=True)
AugmentationNone (focus on architecture)

Best epoch: 30 (by validation PER). After that, slight overfitting (train/val gap increases).

Metrics at best epoch:

  • Val Loss = 0.9392
  • Val PER = 42.19%

4. Results

4.1 Test Metrics (LibriSpeech, greedy decoding, no LM)

MetricValue
Test Loss0.9252
PER (Phoneme Error Rate)42.39%
CER (Character Error Rate)28.63%

Comparison with Phase 2:

Phase 2 (characters)Phase 3 (phonemes)Change
CER57%28.6%−28.4 percentage points
Model0.76M / 9.2 MB2.0M / 7.65 MB+1.24M / −1.55 MB

CER is not directly comparable (phonemes vs characters), but the trend is clear: switching to phonemes radically improved accuracy.

4.2 Recognition Examples

WordTarget PhonemesPredictionComment
stayedS T EY DS T EY D100%
statisticsS T AH T IH S T IH K SS T AH T IH S T IH K S100%
existenceIH G Z IH S T AH N SIH G Z IH S T AH N S100% (in Phase 2 — 11%)
mothersM AH DH ER ZM AH DH ER Z100%
chlorophyllK L AO R AH F IH LK L AO R AH F IH L100% (Phase 2: clorifill)
napkinN AE P K IH NN AE P K IH N100% (Phase 2: smalnakton when merged)

The model predicts phonemes flawlessly for most test words. Errors occur mainly on rare words or with strong accents.

4.3 Full Pipeline (VAD + Acoustic + Experimental LM)

Additional test: 50‑hour YouTube podcast (various speakers, background noise).

  • Sanday‑VAD (lab version) → words
  • Phase 3 acoustic → phonemes
  • Experimental LM on another LNN architecture → text

Result: WER = 32%

LM: 1.8M parameters (experimental, unstable). Scaling to 10M parameters is in progress.


5. Limitations and Critical Findings

5.1 VAD Problem and Missing Space Token

Word‑level strategy works with an ideal VAD. Reality: VAD sometimes outputs phrases instead of words:

VAD outputShould beReality
"welcome to""welcome" + "to"Single word 1.8 sec
"my name is Bagzhan"4 separate wordsOne "word" 3.2 sec

The model truncates anything beyond 2 seconds. Result: end loss, hallucinations, phoneme merging.

The cause is architectural: 41 classes without a space token. The model cannot separate words within a segment. We implicitly assumed VAD would always give isolated words (0.2–1.2 sec). This was a design error.

5.2 Context Length

The rigid 200‑frame limit (~2 sec) is a compromise for stable training. Words longer than 2 sec are truncated, causing errors even with correct VAD (e.g., "uncharacteristically").

5.3 Dependence on VAD Quality

VAD accuracy of 97% is on clean LibriSpeech recordings. On real noisy podcasts, segmentation errors increase, cascading into worse final WER.


6. Conclusions

6.1 Confirmed Hypotheses

  1. Phonemes are more effective than characters. CER dropped from 57% to 28.6%; recognition quality for complex words improved dramatically (existence: 11% → 100%).

  2. Word‑level processing works — with an ideal VAD. Reduced CfC memory requirements, stable training, predictable errors.

  3. CfC are capable of phoneme‑level modeling. PER 42.4% on a purely acoustic approach without LM is a competitive result for 2M parameters.

6.2 Architectural Findings

  • Hierarchical CNN with dilation outperformed SincConv+Attention. The "short phonemes → syllables → words" principle works.
  • Blank bias −3.0 — a simple trick that accelerates convergence by ~30%.
  • "20‑file" selection principle saved resources: 23 variants filtered down to 1 without full training runs.

6.3 Urgent Problems

  • Absence of a space token makes the system fragile to VAD errors.
  • Rigid length limit (2 sec) is unacceptable for the real world.
  • Dependence on an external VAD creates cascading errors.

7. Transition to Phase 4

Phase 4 moves from isolated words to contextual processing. We are expanding the time window to 8 seconds and solving the space token problem, allowing the model to handle naturally long phrases.

In parallel, we are testing the hypothesis of whether full CfC complexity is necessary: perhaps simplified dynamics can retain key tempo‑adaptation properties at lower computational cost.

Architecture details, VAD, and language model will be presented in separate technical reports after development is complete.

Share: