(korean_alignment_benchmarks=)

Korean alignment benchmarks#

Dataset#

The dataset used for this benchmark is the Seoul Corpus. The Seoul Corpus was modeled off of the Buckeye Corpus to create a phonetically/phonemically hand-aligned corpus of Seoul Korean. The corpus consists of 40 speakers of Seoul Korean with 20 male speakers and 20 female speakers, along with 10 speakers each in their teens, twenties, thirties and forties. Similar to the Buckeye Corpus, socio-economic class was also not controlled, but the setting of academic sociolinguistic interviews will bias towards middle to upper class.

The corpus was transribed in Hangul and aligned in HTK, and then corrected by hand. The transcription is more phonemic than the Buckeye Corpus’s phone set (though, even the final Buckeye phone set is not as as phonetic as the original TIMIT-based set they used).

The dataset is freely available on OpenSLR. The reorganization script here is the basis of the testing data, and creates input TextGrids to align and reference textgrids to compare against in the alignment evaluation script, along with the necessary mapping files to the Seoul Corpus phone set from MFA’s phone set and GlobalPhone’s phone set.

Benchmarks#

These benchmarks were performed using MFA v2.0.0rc8, which was used to train the latest Korean MFA acoustic model v2.0.0a. The Korean 1.0 models were trained and released as part of MFA v1.0.1 and used the GlobalPhone Korean lexicon. The Korean MFA acoustic model v2.0.0 was trained as part of the MFA v2.0.0rc5 release.

Note

This benchmark is not particularly great, because the Seoul Corpus is used as training data for the Korean mfa model. The reason for this choice is due to the limited data for Korean speech over all, so the Seoul Corpus accounts for 36% of the training hours in the Korean MFA model.

I’d rather have a better model with 119 hours of training data for Korean than a more accurate benchmark for a model with 76 hours of training data in this case, but see English alignment benchmarks for an alignment benchmark on unseen data for American English.

Alignment score#

Alignment score represents the average boundary error between the reference alignment and the aligner’s output. The two phone sequences are aligned with BioPython’s pairwise2 module using the mapping files to establish “identical” phones across the different phone sets. Then alignment score is calculated as the average distance of the average start and end boundary distance to the reference phone’s start and end. Thus, it can be interpreted as the average error in seconds per boundary.

Korean 2.0 mfa model vastly outperform 1.0 GlobalPhone models for several reasons:

1.0 GlobalPhone model was not trained on the Seoul Corpus, but mfa model were. However, original testing of the alignment scores for MFA 2.0 had alignment scores ~30ms, so the later versions still would outperform the 1.0 GlobalPhone without cheating by training on the test data.
The GlobalPhone lexicon does not have great coverage of the Seoul Corpus, as it was created for the GlobalPhone corpus specifically, so a number of unknown words affect its performance.
Additionally, there are a number of errors in the GlobalPhone lexicon that lead to reduced performance:
1. Some tense stops ㅃ /p͈/, ㄸ /t͈/, and ㄲ /k͈/ transcribed as B, D, and G instead of BB, DD, and GG, respectively, based on their documentation.
2. The grapheme ㄹ /l~ɾ/ is sometimes transcribed in GlobalPhone as [N] instead of [L] or [R] (regardless of phonological context, like at the beginning of a word).
3. Some characteristics of the vowels in their documentation are not accurate. There’s no length distinction, and the closest examples in English don’t align (“IPA” e -> GP E -> English “bet” /b ε t/, “IPA” ε -> GP AE -> English “cat” /k æ t/).

Korean alignment benchmarks#

Dataset#

Benchmarks#

Alignment score#

Phone error rate#