Corpora#

ID

Language

Dialect

License

A Scripted Pakistani English Daily-use Speech Corpus

English

India

CC BY-NC-ND 4.0

African-accented French

French

N/A

Apache 2.0

AI-DataTang Corpus

Mandarin

China;Erhua

CC BY-NC-ND 4.0

AISHELL-3

Mandarin

China;Erhua

Apache 2.0

ALFFA Swahili

Swahili

N/A

MIT

ARU English corpus

English

UK

CC BY 3.0

ASR-KCSC A Korean Conversational Speech Corpus

Korean

N/A

CC BY-NC-ND 4.0

ASR-SKDuSC A Scripted Korean Daily-use Speech Corpus

Korean

N/A

CC BY-NC-ND 4.0

Buckeye Corpus

English

US

Buckeye License

Common Voice Abkhaz v7_0

Abkhaz

N/A

CC-0

Common Voice Arabic v8_0

Arabic

N/A

CC-0

Common Voice Armenian v7_0

Armenian

N/A

CC-0

Common Voice Bashkir v7_0

Bashkir

N/A

CC-0

Common Voice Basque v7_0

Basque

N/A

CC-0

Common Voice Belarusian v7_0

Belarusian

N/A

CC-0

Common Voice Bulgarian v16_1

Bulgarian

N/A

CC-0

Common Voice Bulgarian v7_0

Bulgarian

N/A

CC-0

Common Voice Bulgarian v8_0

Bulgarian

N/A

CC-0

Common Voice Bulgarian v9_0

Bulgarian

N/A

CC-0

Common Voice Chinese (China) v16_1

Mandarin

China;Erhua

CC-0

Common Voice Chinese (China) v8_0

Mandarin

China;Erhua

CC-0

Common Voice Chinese (China) v9_0

Mandarin

China;Erhua

CC-0

Common Voice Chinese (Taiwan) v16_1

Mandarin

Taiwan

CC-0

Common Voice Chinese (Taiwan) v8_0

Mandarin

Taiwan

CC-0

Common Voice Chinese (Taiwan) v9_0

Mandarin

Taiwan

CC-0

Common Voice Chuvash v7_0

Chuvash

N/A

CC-0

Common Voice Czech v7_0

Czech

N/A

CC-0

Common Voice Czech v8_0

Czech

N/A

CC-0

Common Voice Czech v9_0

Czech

N/A

CC-0

Common Voice Dutch v7_0

Dutch

N/A

CC-0

Common Voice English v8_0

English

Nigeria;UK;US

CC-0

Common Voice French v16_1

French

N/A

CC-0

Common Voice French v7_0

French

N/A

CC-0

Common Voice French v8_0

French

N/A

CC-0

Common Voice Georgian v7_0

Georgian

N/A

CC-0

Common Voice German v16_1

German

N/A

CC-0

Common Voice German v7_0

German

N/A

CC-0

Common Voice German v8_0

German

N/A

CC-0

Common Voice Greek v7_0

Greek

N/A

CC-0

Common Voice Guarani v7_0

Guarani

N/A

CC-0

Common Voice Hausa v7_0

Hausa

N/A

CC-0

Common Voice Hausa v8_0

Hausa

N/A

CC-0

Common Voice Hausa v9_0

Hausa

N/A

CC-0

Common Voice Hindi v7_0

Hindi

N/A

CC-0

Common Voice Hungarian v7_0

Hungarian

N/A

CC-0

Common Voice Indonesian v7_0

Indonesian

N/A

CC-0

Common Voice Italian v7_0

Italian

N/A

CC-0

Common Voice Japanese v12_0

Japanese

N/A

CC-0

Common Voice Japanese v7_0

Japanese

N/A

CC-0

Common Voice Japanese v8_0

Japanese

N/A

CC-0

Common Voice Japanese v9_0

Japanese

N/A

CC-0

Common Voice Kazakh v7_0

Kazakh

N/A

CC-0

Common Voice Korean v16_1

Korean

N/A

CC-0

Common Voice Kurmanji v7_0

Kurmanji

N/A

CC-0

Common Voice Kyrgyz v7_0

Kyrgyz

N/A

CC-0

Common Voice Maltese v7_0

Maltese

N/A

CC-0

Common Voice Polish v7_0

Polish

N/A

CC-0

Common Voice Polish v8_0

Polish

N/A

CC-0

Common Voice Portuguese v7_0

Portuguese

Brazil;Portugal

CC-0

Common Voice Portuguese v8_0

Portuguese

Brazil;Portugal

CC-0

Common Voice Punjabi v7_0

Punjabi

N/A

CC-0

Common Voice Romanian v7_0

Romanian

N/A

CC-0

Common Voice Russian v7_0

Russian

N/A

CC-0

Common Voice Russian v8_0

Russian

N/A

CC-0

Common Voice Russian v9_0

Russian

N/A

CC-0

Common Voice Serbian v8_0

Croatian

N/A

CC-0

Common Voice Serbian v9_0

Croatian

N/A

CC-0

Common Voice Sorbian Upper v7_0

Sorbian

Upper

CC-0

Common Voice Spanish v8_0

Spanish

Latin America;Spain

CC-0

Common Voice Swahili v8_0

Swahili

N/A

CC-0

Common Voice Swahili v9_0

Swahili

N/A

CC-0

Common Voice Swedish v7_0

Swedish

N/A

CC-0

Common Voice Swedish v8_0

Swedish

N/A

CC-0

Common Voice Tamil v7_0

Tamil

N/A

CC-0

Common Voice Tatar v7_0

Tatar

N/A

CC-0

Common Voice Thai v16_1

Thai

N/A

CC-0

Common Voice Thai v7_0

Thai

N/A

CC-0

Common Voice Thai v8_0

Thai

N/A

CC-0

Common Voice Thai v9_0

Thai

N/A

CC-0

Common Voice Turkish v16_1

Turkish

N/A

CC-0

Common Voice Turkish v7_0

Turkish

N/A

CC-0

Common Voice Turkish v8_0

Turkish

N/A

CC-0

Common Voice Turkish v9_0

Turkish

N/A

CC-0

Common Voice Ukrainian v16_1

Ukrainian

N/A

CC-0

Common Voice Ukrainian v7_0

Ukrainian

N/A

CC-0

Common Voice Ukrainian v8_0

Ukrainian

N/A

CC-0

Common Voice Ukrainian v9_0

Ukrainian

N/A

CC-0

Common Voice Urdu v7_0

Urdu

N/A

CC-0

Common Voice Uyghur v7_0

Uyghur

N/A

CC-0

Common Voice Uzbek v7_0

Uzbek

N/A

CC-0

Common Voice Vietnamese v17_0

Vietnamese

N/A

CC-0

Common Voice Vietnamese v7_0

Vietnamese

N/A

CC-0

Common Voice Vietnamese v8_0

Vietnamese

N/A

CC-0

Common Voice Vietnamese v9_0

Vietnamese

N/A

CC-0

Corpus of Regional African American Language v2021_07

English

US

CC BY-NC-SA 4.0

Czech Parliament Meetings

Czech

N/A

CC BY-NC-ND 3.0

Deeply Korean read speech corpus public sample

Korean

N/A

CC BY-NC-ND 4.0

GlobalPhone Arabic v3_1

Arabic

N/A

ELRA

GlobalPhone Bulgarian v3_1

Bulgarian

N/A

ELRA

GlobalPhone Chinese-Mandarin v3_1

Mandarin

China;Erhua

ELRA

GlobalPhone Croatian v3_1

Croatian

N/A

ELRA

GlobalPhone Czech v3_1

Czech

N/A

ELRA

GlobalPhone French v3_1

French

N/A

ELRA

GlobalPhone German v3_1

German

N/A

ELRA

GlobalPhone Hausa v3_1

Hausa

N/A

ELRA

GlobalPhone Japanese v3_1

Japanese

N/A

ELRA

GlobalPhone Korean v3_1

Korean

N/A

ELRA

GlobalPhone Polish v3_1

Polish

N/A

ELRA

GlobalPhone Portuguese (Brazilian) v3_1

Portuguese

Brazil

ELRA

GlobalPhone Russian v3_1

Russian

N/A

ELRA

GlobalPhone Spanish (Latin American) v3_1

Spanish

Latin America

ELRA

GlobalPhone Swahili v3_1

Swahili

N/A

ELRA

GlobalPhone Swedish v3_1

Swedish

N/A

ELRA

GlobalPhone Thai v3_1

Thai

N/A

ELRA

GlobalPhone Turkish v3_1

Turkish

N/A

ELRA

GlobalPhone Ukrainian v3_1

Ukrainian

N/A

ELRA

GlobalPhone Vietnamese v3_1

Vietnamese

Hanoi;Ho Chi Minh City

ELRA

Google i18n Chile

Spanish

Latin America

CC BY-SA 4.0

Google i18n Columbia

Spanish

Latin America

CC BY-SA 4.0

Google i18n Peru

Spanish

Latin America

CC BY-SA 4.0

Google i18n Puerto Rico

Spanish

Latin America

CC BY-SA 4.0

Google i18n Venezuela

Spanish

Latin America

CC BY-SA 4.0

Google Nigerian English

English

Nigeria

CC BY-SA 4.0

Google UK and Ireland English

English

UK

CC BY-SA 4.0

Gowajee Corpus v0_9_3

Thai

N/A

MIT

ICE-Nigeria

English

Nigeria

CC BY-NC-SA 3.0

Japanese Versatile Speech

Japanese

N/A

CC BY-SA 4.0

Korean Single Speaker Speech Dataset

Korean

N/A

CC BY-NC-SA 4.0

L2-ARCTIC

English

N/A

CC BY-NC 4.0

LaboroTV Japanese v1_0d

Japanese

N/A

LaboroTV Non-commercial

Large Corpus of Czech Parliament Plenary Hearings

Czech

N/A

CC BY 4.0

LibriSpeech English

English

US

CC BY 4.0

Lotus Corpus v1_0

Thai

N/A

CC BY-SA-NC 3.0

M-AILABS Polish

Polish

N/A

M-AILABS License

M-AILABS Russian

Russian

N/A

M-AILABS License

M-AILABS Spanish

Spanish

Latin America;Spain

M-AILABS License

M-AILABS Ukrainian

Ukrainian

N/A

M-AILABS License

MediaSpeech Arabic v1_1

Arabic

N/A

CC BY 4.0

MediaSpeech Turkish v1_1

Turkish

N/A

CC BY 4.0

Microsoft Speech Language Translation Japanese

Japanese

N/A

Microsoft Research Data License

Multilingual LibriSpeech French

French

N/A

CC BY 4.0

Multilingual LibriSpeech German

German

N/A

CC BY 4.0

Multilingual LibriSpeech Polish

Polish

N/A

CC BY 4.0

Multilingual LibriSpeech Portuguese

Portuguese

Portugal

CC BY 4.0

Multilingual LibriSpeech Spanish

Spanish

Spain

CC BY 4.0

Multilingual TEDx Portuguese

Portuguese

Portugal

CC BY-NC-ND 4.0

Multilingual TEDx Russian

Russian

N/A

CC BY-NC-ND 4.0

NCHLT English

English

Nigeria;UK

CC BY 3.0

NST Swedish

Swedish

N/A

CC-0

Pansori TEDxKR

Korean

N/A

CC BY-NC-ND 4.0

ParlaSpeech

Croatian

N/A

CC BY-SA 4.0

Russian LibriSpeech

Russian

N/A

Public domain in the USA

Seoul Corpus

Korean

N/A

CC BY-NC 2.0

TEDxJP-10K v1_1

Japanese

N/A

Apache 2.0

Thai Elderly Speech dataset by Data Wow and VISAI v1_0_0

Thai

N/A

CC BY-SA 4.0

THCHS-30

Mandarin

China;Erhua

Apache 2.0

TIMIT

English

US

LDC License

VIVOS

Vietnamese

Ho Chi Minh City

CC BY-NC-SA 4.0

VoxPopuli Croatian

Croatian

N/A

CC-0

VoxPopuli Czech

Czech

N/A

CC-0

VoxPopuli Polish

Polish

N/A

CC-0

Zeroth Korean

Korean

N/A

CC BY 4.0