Tatar CV dictionary v2.0.0#

Maintainer: Vox Communis
Language: Tatar
Dialect: N/A
Phone set: Epitran
Number of words: 22,664
Phones: a b d e f h i j k l m n o p r s t u w x y z ø ŋ ɑ ɕ ɡ ɤ ɯ ʃ ʑ ʒ ʔ
License: CC-0
Compatible MFA version: v2.0.0
Citation:

@misc{Ahn_Chodroff_2022,
	author={Ahn, Emily and Chodroff, Eleanor},
	title={VoxCommunis Corpus},
	address={\url{https://osf.io/t957v}},
	publisher={OSF},
	year={2022},
	month={Jan}
}

If you have comments or questions about this dictionary or its phone set, you can check previous MFA model discussion posts or create a new one.

Acoustic models

Tatar CV acoustic model v2.0.0

Installation#

Install from the MFA command line:

mfa model download dictionary tatar_cv

Or download from the release page.

Intended use#

This dictionary is intended for forced alignment of Tatar transcripts.

This dictionary uses the Epitran phone set for Tatar, and was used in training the Tatar Epitran acoustic model. Pronunciations can be added on top of the dictionary, as long as no additional phones are introduced.

Performance Factors#

When trying to get better alignment accuracy, adding pronunciations is generally helpful, especially for different styles and dialects. The most impactful improvements will generally be seen when adding reduced variants that involve deleting segments/syllables common in spontaneous speech. Alignment must include all phones specified in the pronunciation of a word, and each phone has a minimum duration (by default 10ms). If a speaker pronounces a multisyllabic word with just a single syllable, it can be hard for MFA to fit all the segments in, so it will lead to alignment errors on adjacent words as well.

Ethical considerations#

Deploying any Speech-to-Text model into any production setting has ethical implications. You should consider these implications before use.

Demographic Bias#

You should assume every machine learning model has demographic bias unless proven otherwise. For pronunciation dictionaries, it is often the case that transcription accuracy and lexicon coverage for the prestige variety modeled in this dictionary compared to other variants. If you are using this dictionary in production, you should acknowledge this as a potential issue.

IPA Charts#

Consonants#

Obstruent symbols to the left of are unvoiced and those to the right are voiced.

Manner	Labial	Labiodental	Alveolar	Alveopalatal	Palatal	Velar	Glottal
Nasal	m Occurrences: 5,846 Examples: * сумга: [s u m ɡ ɑ] * кидем: [k i d e m] * форум: [f o r u m] * миләш: [m i l a ʃ]		n Occurrences: 12,539 Examples: * җыен: [ʑ ɤ e n] * кенә: [k e n a] * туена: [t u e n ɑ] * иркен: [i r k e n]			ŋ Occurrences: 2,356 Examples: * ирең: [i r e ŋ] * туң: [t u ŋ] * сукаң: [s u k ɑ ŋ] * аеның: [ɑ e n ɤ ŋ]
Stop	p Occurrences: 2,019 Examples: * торып: [t o r ɤ p] * кисеп: [k i s e p] * биләп: [b i l a p] * апам: [ɑ p ɑ m] b Occurrences: 3,521 Examples: * буаз: [b u ɑ z] * биләп: [b i l a p] * бите: [b i t e] * бәһа: [b a h ɑ]		t Occurrences: 8,955 Examples: * уятты: [u j ɑ t t ɤ] * туена: [t u e n ɑ] * туй: [t u j] * торып: [t o r ɤ p] d Occurrences: 4,150 Examples: * радар: [r ɑ d ɑ r] * дияр: [d i j ɑ r] * кидем: [k i d e m] * чыдар: [ɕ ɤ d ɑ r]			k Occurrences: 9,337 Examples: * ачлык: [ɑ ɕ l ɤ k] * кенә: [k e n a] * торак: [t o r ɑ k] * кидем: [k i d e m] ɡ Occurrences: 5,630 Examples: * сумга: [s u m ɡ ɑ] * азагы: [ɑ z ɑ ɡ ɤ] * ачуга: [ɑ ɕ u ɡ ɑ] * ягъни: [j ɑ ɡ ʔ n i]	ʔ Occurrences: 577 Examples: * ягъни: [j ɑ ɡ ʔ n i] * июль: [i ɯ l ʔ] * тверь: [t w e r ʔ] * яшьне: [j ɑ ʃ ʔ n e]
Sibilant			s Occurrences: 5,631 Examples: * сумга: [s u m ɡ ɑ] * кисеп: [k i s e p] * саумы: [s ɑ u m ɤ] * аскын: [ɑ s k ɤ n] z Occurrences: 2,840 Examples: * буаз: [b u ɑ z] * азагы: [ɑ z ɑ ɡ ɤ] * йөзә: [j ø z a] * зират: [z i r ɑ t]	ʃ Occurrences: 3,051 Examples: * эшләү: [e ʃ l a y] * миләш: [m i l a ʃ] * кашык: [k ɑ ʃ ɤ k] * кыш: [k ɤ ʃ] ʒ Occurrences: 112 Examples: * пляж: [p l j ɑ ʒ] * жгут: [ʒ ɡ u t] * жуаны: [ʒ u ɑ n ɤ] * стаж: [s t ɑ ʒ]	ɕ Occurrences: 2,823 Examples: * ачлык: [ɑ ɕ l ɤ k] * ачуга: [ɑ ɕ u ɡ ɑ] * чыдар: [ɕ ɤ d ɑ r] * мичкә: [m i ɕ k a] ʑ Occurrences: 863 Examples: * җыен: [ʑ ɤ e n] * әҗере: [a ʑ e r e] * җаным: [ʑ ɑ n ɤ m] * җигеп: [ʑ i ɡ e p]
Fricative		f Occurrences: 598 Examples: * форум: [f o r u m] * фани: [f ɑ n i] * туфан: [t u f ɑ n] * җәфа: [ʑ a f ɑ]					h Occurrences: 291 Examples: * бәһа: [b a h ɑ] * шәһри: [ʃ a h r i] * мөһим: [m ø h i m] * һөнәр: [h ø n a r]
Approximant	w Occurrences: 937 Examples: * вобла: [w o b l ɑ] * ватып: [w ɑ t ɤ p] * тавис: [t ɑ w i s] * явым: [j ɑ w ɤ m]				j Occurrences: 4,552 Examples: * уятты: [u j ɑ t t ɤ] * туй: [t u j] * дияр: [d i j ɑ r] * укый: [u k ɤ j]
Trill			r Occurrences: 12,208 Examples: * торып: [t o r ɤ p] * торак: [t o r ɑ k] * радар: [r ɑ d ɑ r] * әҗере: [a ʑ e r e]
Lateral			l Occurrences: 10,913 Examples: * эшләү: [e ʃ l a y] * ачлык: [ɑ ɕ l ɤ k] * килен: [k i l e n] * миләш: [m i l a ʃ]

Vowels#

Vowel symbols to the left of are unrounded and those to the right are rounded.

	Front	Central	Back
Close	i Occurrences: 5,455 Examples: * дияр: [d i j ɑ r] * кидем: [k i d e m] * кисеп: [k i s e p] * иркен: [i r k e n] y Occurrences: 1,884 Examples: * эшләү: [e ʃ l a y] * үлем: [y l e m] * сөрү: [s ø r y] * күн: [k y n]		ɯ Occurrences: 311 Examples: * июль: [i ɯ l ʔ] * бюст: [b ɯ s t] * юка: [ɯ k ɑ] * ешаю: [j e ʃ ɑ ɯ] u Occurrences: 3,554 Examples: * буаз: [b u ɑ z] * уятты: [u j ɑ t t ɤ] * туена: [t u e n ɑ] * туй: [t u j]

Close-Mid	e Occurrences: 12,038 Examples: * эшләү: [e ʃ l a y] * җыен: [ʑ ɤ e n] * кенә: [k e n a] * туена: [t u e n ɑ] ø Occurrences: 1,664 Examples: * йөзә: [j ø z a] * йөзе: [j ø z e] * өчен: [ø ɕ e n] * сөрү: [s ø r y]		ɤ Occurrences: 12,544 Examples: * җыен: [ʑ ɤ e n] * ачлык: [ɑ ɕ l ɤ k] * уятты: [u j ɑ t t ɤ] * торып: [t o r ɤ p] o Occurrences: 2,507 Examples: * торып: [t o r ɤ p] * торак: [t o r ɑ k] * форум: [f o r u m] * йокым: [j o k ɤ m]

Open-Mid

Open		a Occurrences: 11,405 Examples: * эшләү: [e ʃ l a y] * кенә: [k e n a] * әҗере: [a ʑ e r e] * йөзә: [j ø z a]	ɑ Occurrences: 21,754 Examples: * ачлык: [ɑ ɕ l ɤ k] * буаз: [b u ɑ z] * уятты: [u j ɑ t t ɤ] * туена: [t u e n ɑ]