(gp_transcription_benchmarks=)
GlobalPhone transcription benchmarks#
MFA has some simple language model training and transcription capabilities in addition to its alignment functionalities. The Word Error Rate (WER) and Character Error Rate was calculated over the Buckeye and TIMIT corpora below. Note that unlike the English transcription benchmarks, these datasets were included in the training data, and hence will not be representative of use on your own datasets. The reason for training on GlobalPhone is that many of the languages here have very little data outside of it, and so sacrificing user’s alignment performance for the benefit of cleaner benchmark metrics does not seem worth it to me.
Datasets#
The following datasets were used in evaluation:
Please note that there were a number of fixes for these corpora to clean them up and ensure that they worked properly. See Aanchan Mohan’s write up of GlobalPhone corpus fixes. Also note that the evaluations here use the version of GlobalPhone from 2015 (listed in the manual as 3.1), not the 2017 version (listed on ELRA as 1.0).
Warning
The evaluation data here was included in the training for all models, and so it not likely to be fully representative of performance on your data. See English transcription benchmarks for a transcription evaluation on unseen data.
Experimental set up#
Each language had a trained language model on just the GlobalPhone data using the MFA lexicons as input to make them maximally comparable (as the GlobalPhone lexicons are tailored to this particular corpus)
Mandarin was trained on Pinyin romanization and Pinyin phones in 1.0 following GlobalPhone’s lexicon, so a separate language model was trained on the romanized transcripts for the 1.0 Mandarin GP evaluation
Models
Bulgarian
Croatian
Czech
French
German
Hausa
Korean
Mandarin
Polish
Portuguese
Russian
Spanish
Swahili
Swedish
Thai
Turkish
Ukrainian
Vietnamese
Benchmarks#
Word error rate#
In general we see improvements from the GlobalPhone lexicon-based models to the 2.0 MFA phone sets, particularly for East Asian character based models (Thai, Vietnamese, Korean), along with big improvements for Ukrainian and Bulgarian. Mandarin is better under GlobalPhone’s approach, however, suggesting that there’s likely gaps in the Hanzi-based lexicon used in the latest versions. We can also a increase error rate for some languages using the 2.0 trained models due to a bug in silence estimation, but 2.0a improves performance to below the 1.0 baseline.
Note
To make the points easier to see, Korean 1.0 GP results have been excluded. For word error rate, it was much higher than any other model, at 28.3%. The primary factor going into its poor performance is likely the lexicon, that often has plain stops for tense ones.
Character error rate#
Character error rate shows similar patterns to the word error rate analysis, but it does exacerbate the discrepancies for Mandarin between Pinyin and Hanzi-based lexicons, as Hanzi character substitutions will be more costly than Pinyin substitutions.
Note
As above, Korean 1.0 GP results have been excluded. For character error rate, it was much higher than any other model, at 23.1%, for the same reasons above.