Abstract
The second phase of the Sanday project tested a set of hypotheses: whether data balancing would eliminate gender bias, whether CfC networks could adapt to multiple speakers without increasing parameters, and whether fine‑tuning would serve as a generalization mechanism. We fine‑tuned the model (760K parameters, 3.1 MB) on 10,000 recordings from Common Voice with a strict 50/50 gender split.
The results confirmed all three hypotheses. The gender gap in WER dropped from 35% to less than 0.5%. Character Error Rate fell from 85‑90% to 57%, indicating successful acquisition of acoustic patterns from diverse speakers. Fine‑tuning worked: the model adapted to new data without catastrophic forgetting of Phase 1.
However, training was stopped prematurely. We had planned 80 epochs but completed only 35. The decision was impulsive: the progress in CER and gender parity was perceived as sufficient to conclude the phase. Subsequent analysis revealed incomplete convergence — loss was still decreasing. A restart is impossible because the Common Voice distribution has since been updated, making Phase 2 conditions unreproducible.
WER increased from 94% to 96.8%. This is not a regression of the model but a reflection of Common Voice's complexity: regional accents, variable recording quality, spontaneous speech. The model became more accurate at the character level but made more word‑level errors. The contribution of early stopping to this deterioration is immeasurable — a fully trained model might have performed better.
We have reached the ceiling of the current architecture in real‑world conditions. Phase 3 will focus on testing alternative signal processing approaches.
1. Introduction: From Lab to Reality
Phase 1 proved that a CfC network with 760K parameters can extract phonetic patterns — but in sterile conditions: a single speaker, studio recording, no noise. That was a "laboratory" baseline necessary to verify the basic viability of the architecture.
Phase 2 aimed to move toward real‑world conditions without increasing parameters. We chose two improvement vectors:
Gender balance. LJSpeech contains only a female voice, creating critical bias. We hypothesized that a strict 50/50 balance in the new dataset would eliminate the quality gap between genders.
Speaker variability. Common Voice provides thousands of different voices with varying timbre, pace, and accent. The task: test whether a compact CfC network can generalize to multiple speakers without parameter growth, simply through fine‑tuning.
We also introduced SpecAugment to improve noise robustness, though we understood that a true test of noise resilience requires specialized datasets we did not have.
2. Methodology
2.1 Data
Common Voice: 10,000 audio clips (~28 hours), multiple speakers, diverse recording quality (from studio microphones to phone recorders), spontaneous speech with incomplete sentences and conversational reductions.
Stratification:
- Train: 8,000 (4,000 male / 4,000 female)
- Validation: 1,000 (500 / 500)
- Test: 1,000 (500 / 500)
Average duration: 4.2 seconds, maximum 12 seconds. This is an important difference from LJSpeech: longer phrases, not always complete sentences.
2.2 Architecture
Identical to Phase 1:
- Frontend: 3 Residual blocks, 1D‑CNN
- Backend: 2 CfC layers (
ncps) - Parameters: 764,956 (~760K)
- Size: 3.1 MB (state_dict), 9.2 MB (checkpoint)
Fine‑tuning with early CNN layers frozen. Only CfC layers were trained — protection against overfitting and preservation of Phase 1 phonetic representations.
2.3 Augmentation
SpecAugment: freq_mask=15, time_mask=35. Applied on‑the‑fly during training.
2.4 Training
- Optimizer: AdamW
- Learning rate: 1e-4 (same as Phase 1)
- Planned: 80 epochs
- Actual: 35 epochs
Training was stopped after reaching acceptable metrics for CER and gender parity. The decision was impulsive: progress was taken as sufficient to end the phase. Subsequent analysis of the loss curves showed incomplete convergence — loss was still decreasing. A restart is impossible because Common Voice data has been updated and the distribution has changed.
3. Results
3.1 Training Dynamics
| Epoch | Train Loss | Val Loss | Note |
|---|---|---|---|
| 1 | 2.1 | 2.05 | Fast start, transfer learning from Phase 1 |
| 10 | 1.85 | 1.82 | Stabilization |
| 20 | 1.78 | 1.77 | Plateau |
| 30 | 1.74 | 1.76 | SpecAugment effect visible |
| 35 (stop) | 1.72 | 1.7626 | Final point |
After stopping, analysis showed loss was still slowly decreasing. The optimum likely lay around 50‑60 epochs.
3.2 Comparison with Phase 1
| Metric | Phase 1 (LJSpeech) | Phase 2 (Common Voice) | Change |
|---|---|---|---|
| CER | 85‑90% | 57% | ‑28 to ‑33 pp |
| WER | 94% | 96.8% | +2.8 pp |
| Val Loss | 0.49 | 1.76 | Not comparable (different data) |
3.3 Gender Parity
| Gender | CER | WER | Gap vs Phase 1 |
|---|---|---|---|
| Male | 57.2% | 97.0% | Previously: model did not work |
| Female | 56.8% | 96.6% | Previously: 94% WER |
| Difference | 0.4% | 0.4% | Previously: ~35% |
Gender bias completely eliminated.
3.4 Long Sequence Problem
A critical limitation was discovered: for audio longer than 8 seconds, the model "swallows" the end — it stops decoding the last 10‑15% of the time. This is an architectural feature of the current CfC backend implementation, unrelated to training.
4. Error Analysis
4.1 Why WER Increased While CER Decreased
The nature of errors changed qualitatively:
Phase 1 (LJSpeech): Errors were orthographic. The model heard correctly but wrote phonetically ("threts" instead of "threats"). CER was high due to systematic substitutions; WER was relatively low because word structure was predictable.
Phase 2 (Common Voice): Errors are segmentation and contextual. The model confuses word boundaries in spontaneous speech, loses coherence in long phrases, and struggles with regional accents. CER is low — individual characters are recognized more accurately. WER is high — word integrity breaks down.
Example:
| Ground Truth | Prediction | Error Type |
|---|---|---|
| "gonna go home" | "gon ago home" | Word merging (spontaneous speech) |
| "thomas won the game" | "tommas pon the gain" | Accent + segmentation |
| "particularly interesting" | "particuly interestin" | Unstressed vowel reduction |
4.2 Impact of Early Stopping
It is impossible to separate the contribution of incomplete training from data complexity. Possible scenarios:
- Optimistic: A fully trained model would have achieved WER ~90%, retaining the CER improvement.
- Realistic: WER would have stayed at 95‑97%; Common Voice data is fundamentally harder.
- Pessimistic: Overfitting at epochs 40‑80 would have degraded generalization.
Data for verification is lost.
5. Limitations
5.1 Engineering Mistake
The main limitation of Phase 2 is premature training stop. We did not wait for convergence despite clear signs of ongoing improvement. This is not a methodological flaw but an execution error: an impulsive decision to freeze progress before reaching a stable plateau.
5.2 Architectural Limitations
- Long audio problem: Quality degrades on segments >8 seconds. The CfC backend loses temporal coherence on long sequences.
- No language model: Homophones, orthographic ambiguity, lack of semantic correction. This is a deliberate choice but caps the quality ceiling.
- Spectrogram frontend: Loss of phase information, fixed temporal resolution.
5.3 Data
Common Voice provides variability but not controlled conditions. It is impossible to separate the influence of accent from recording quality, or noise from spontaneous speech. This is realistic but complicates diagnostics.
6. Conclusions
6.1 Confirmed Hypotheses
- Gender balancing works. Gap from 35% to <0.5% — complete bias elimination through data stratification.
- CfC adapt to multiple speakers. CER 57% on multi‑speaker dataset vs 85‑90% on single speaker — evidence of generalization.
- Fine‑tuning is effective. The model retained Phase 1 knowledge and adapted to new conditions without parameter increase.
6.2 Uncertainties
- The true potential of the architecture remains unexplored due to early stopping.
- The contribution of SpecAugment to robustness was not quantitatively assessed.
- The long‑audio problem requires an architectural solution, not more training.
6.3 Practical Result
We obtained a gender‑neutral acoustic model with predictable errors. It is not production‑ready but is suitable for research purposes and as a foundation for language modeling.
7. Transition to Phase 3
Experience from Phase 2 revealed the limits of the current approach. Fine‑tuning works, but the architecture with mel‑spectrogram and a fixed CfC backend has inherent limitations: loss of phase information, degradation on long sequences, fixed temporal resolution.
Phase 3 will focus on testing alternative signal‑processing architectures. We are considering approaches that work with raw audio without intermediate representations, and streaming mechanisms that constrain context to manageable chunks.
The parameter budget will remain comparable (~2M parameters). Language modeling is not yet planned; we continue to explore the potential of purely acoustic approaches.