Datasets

Filters:
Search results for “breton”
Common Voice

Common Voice Spontaneous Speech 2.0 - Breton

A collection of spontaneous spoken phrases in Breton.
License Icon

License: CC0-1.0

Locale Icon

Locale: br

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 13.57 MB

Common Voice

Common Voice Scripted Speech 24.0 - Breton

A collection of scripted spoken phrases in Breton.
License Icon

License: CC0-1.0

Locale Icon

Locale: br

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 759.42 MB

Institute of African Digital Humanities

Bamun-French Parallel Corpus 1.1

This dataset is an updated version of the "Bamun-French Parallel Corpus", a parallel corpus of texts in Bamun (Shupament) and French.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 99.78 KB

Institute of African Digital Humanities

Ewondo-French Parallel Corpus

This dataset is a parallel corpus of Ewondo to French texts. Text were obtained by transcription of raw audio files. Translation were added to enrich the original corpus. Alignment of Ewondo and French texts were made in the process of creating this dataset.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo, fr

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 137.84 KB

MDC Community Concierge

Bangor Patagonia Welsh-Spanish Corpus

Welsh-Spanish corpus contains around 195,000 words.
License Icon

License: GPL-3.0

Locale Icon

Locale: cym, spa

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA, TSV

Size Icon

Size: 988.02 MB

MDC Community Concierge

CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes

Over 11 million words (14.4m tokens) from written, spoken and electronic Welsh language sources, taken from a range of genres, language varieties and contexts
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: cy

Task Icon

Task: NLP

Format Icon

Format: TXT, TSV

Size Icon

Size: 147.89 MB

MDC Community Concierge

Bangor Siarad Welsh-English Corpus

Welsh-English bilingual speech corpus with 40 hours of recorded audio and transcriptions making up 450,000 words
License Icon

License: GPL-3.0

Locale Icon

Locale: cym

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA. TSV

Size Icon

Size: 2.13 GB

Common Voice

Common Voice Scripted Speech 24.0 - Irish

A collection of scripted spoken phrases in Irish.
License Icon

License: CC0-1.0

Locale Icon

Locale: ga-IE

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 357.21 MB

Common Voice

Common Voice Spontaneous Speech 2.0 - Irish

A collection of spontaneous spoken phrases in Irish.
License Icon

License: CC0-1.0

Locale Icon

Locale: ga-IE

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 3.13 MB

Institute of African Digital Humanities

Mada-French Parallel Corpus 1.0

This dataset comprises a parallel corpus of 2,154 lines of translations of literary texts from Mada (mxu) to French.
License Icon

License: NOODL-1.0

Locale Icon

Locale: mxu

Task Icon

Task: TTS

Format Icon

Format: TSV

Size Icon

Size: 122.37 KB

Common Voice

Common Voice Scripted Speech 24.0 - Occitan

A collection of scripted spoken phrases in Occitan.
License Icon

License: CC0-1.0

Locale Icon

Locale: oc

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 261.41 MB

Common Voice

Common Voice Scripted Speech 24.0 - French

A collection of scripted spoken phrases in French.
License Icon

License: CC0-1.0

Locale Icon

Locale: fr

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 28.05 GB

Common Voice

Common Voice Spontaneous Speech 2.0 - French

A collection of spontaneous spoken phrases in French.
License Icon

License: CC0-1.0

Locale Icon

Locale: fr

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 25.04 MB

Institute of African Digital Humanities

Spoken-Congolese-French-Dataset

The dataset consists of paired audio and text resources on spoken French from the Republic of the Congo. The audio files were extracted from longer recordings of semi-guided interviews conducted in Brazzaville, and orthographic transcriptions were added. The long audio recordings and their corresponding TRJS transcription files were automatically clipped alongside their respective transcriptions. The dataset comprises ten folders containing audio files and ten audio/text mapping files.
License Icon

License: NOODL-1.0

Locale Icon

Locale: fr-CG

Task Icon

Task: NLP

Format Icon

Format: MP3, WAV, TSV

Size Icon

Size: 3.44 GB

Balochistan Educational and Cultural Organization

Brahui Research Work Corpus

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.13 MB

Pro Svizra Rumantscha

Putèr Newspaper Corpus

1.3 million tokens in the Putèr variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-puter

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 8.94 MB

Common Voice

Common Voice Scripted Speech 24.0 - Manx

A collection of scripted spoken phrases in Manx.
License Icon

License: CC0-1.0

Locale Icon

Locale: gv

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 214.45 MB

Common Voice

Common Voice Spontaneous Speech 2.0 - Manx

A collection of spontaneous spoken phrases in Manx.
License Icon

License: CC0-1.0

Locale Icon

Locale: gv

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 15.35 MB

Common Voice

Common Voice Scripted Speech 24.0 - Ouldémé

A collection of scripted spoken phrases in Ouldémé.
License Icon

License: CC0-1.0

Locale Icon

Locale: udl

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 217.00 MB

Balochistan Educational and Cultural Organization

Western Balochi Literature Cropus

A cleaned literary corpus in Western Balochi (Rakhshani) including articles, research, translations, and creative literature. (~1.1 m tokens)
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.26 MB

Pro Svizra Rumantscha

Rumantsch Grischun Newspaper Corpus

6.1 million tokens in the Rumantsch Grischun variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-rumgr

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 19.03 MB

Common Voice

Common Voice Scripted Speech 24.0 - Ebrie

A collection of scripted spoken phrases in Ebrie.
License Icon

License: CC0-1.0

Locale Icon

Locale: ebr

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 61.92 MB

Balochistan Educational and Cultural Organization

Talar (تلار) Barahui Magazine Corpus

A ~150,000-word Brahui corpus from the monthly magazine Talar, covering editorials, essays, fiction, poetry, and socio-cultural commentary.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 317.22 KB

Common Voice

Common Voice Scripted Speech 24.0 - Welsh

A collection of scripted spoken phrases in Welsh.
License Icon

License: CC0-1.0

Locale Icon

Locale: cy

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 3.87 GB