Common Voice Spontaneous Speech 3.0 - Turkish

License:

CC0-1.0

Steward:

Common Voice

Task: ASR

Release Date: 3/22/2026

Format: MP3

Size: 5.58 MB

Description

A collection of spontaneous responses to questions in Turkish (Türkçe).

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None provided.

Forbidden Usage

It is forbidden to attempt to determine the identity of speakers in the Common Voice datasets. It is forbidden to re-host or re-share this dataset.

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

Türkçe — Turkish (`tr`)

This datasheet is for sps-corpus-3.0-2026-03-09 of the Mozilla Common Voice Spontaneous Speech dataset for Turkish [Türkçe - tr]. The dataset contains 46 clips representing 0.27 hours of recorded speech (0.14 hours validated) from 10 speakers.

Language

Turkish is the most widely spoken language among Turkic languages and has around 100 million L1 speakers, which makes it the 18th most spoken language. It is the national language of Turkey and one of two official languages of Cyprus, and secondary languages of some neighboring countries. Many smaller groups in other countries exist, through migrations or communities from Ottoman era. These smaller groups should usually be categorized as a variant.

Variants

There are currently no variants defined for Common Voice Turkish dataset. It is worth noting that, until now, this dataset focused on literary Turkish, often called "Turkish of Turkey". There are also some L2 voices, mostly from immigrants coming into the country, but these can be categorized as "foreign accents".

Data splits for modelling

The dataset clips are categorised by transcription status and training-set assignment. The following tables summarise the distribution.

Audio clips

Bucket	Clips	%
Transcribed & Validated	22	47.8%
Transcribed & Pending	0	0.0%
Not transcribed	24	52.2%

Training splits

Bucket	Clips	%
Train	0	0.0%
Dev	0	0.0%
Test	0	0.0%
Unassigned	46	100.0%

Training split coverage: 0 of 22 transcribed & validated clips (0.0%)

Transcriptions

Transcription status

Bucket	Clips	%
Validated	22	100.0%
Pending	0	0.0%
Edited	8	36.4%

Writing system

Turkish uses an extended Latin alphabet.

Symbol table

Official Alphabet:

Lowercase: a b c ç d e f g ğ h ı i j k l m n o ö p r s ş t u ü v y z
Uppercase: A B C Ç D E F G Ğ H I İ J K L M N O Ö P R S Ş T U Ü V Y Z

Auxilary Characters (Arabic/Farsi loanwords): â î û Â Î Û

Samples

Questions

There follows a randomly selected sample of questions used in the corpus.

Dünyanın en büyük sorunları sence hangileri?
Yurtdışında en çok hangi şehirleri gezmek isterdin?
Farklı uçlarda düşünceleri olan insanlar sence ortak paydada buluşabilir mi?
İş hayatında en çok değer verdiğin şey nedir?
Çocukken en çok ne olmayı hayal ederdin ve bu hayalin değişti mi?

Responses

There follows a randomly selected sample of transcribed responses from the corpus.

On yıldır bir tatil yöresinde yaşıyorum zaten, o yüzden pek tatile gitme ihtiyacı hissetmiyorum. Ama geçen yıl üç günlüğüne bir otelde arkadaşımla birlikte konakladık.
Ben sabahları genelde uyurum çünkü geceleri çalışırım. Uyandıktan sonra da... en az iki, ve hatta bazı günlerde dört tane kahvede- kahve içmeden de kendime gelemeeem. Yani, enerjik olmamı sağlayacak tek şey, kahve benim için.
Valla, en önemli şey iş etiği bence... Birçok insan çok katakulli işler yapıyor iş ortamlarında ve beni çıldırtıyorlar.
Valla çok fazla bilimkurgu ve fantastik kitap ve film okudum, izledim. Bunlarda genellikle kötü adamların istediği birşey var; bütün güçlere sahip olmak! İşte onu isterdim. Çünkü bu dünya eğer o bütün dünlere- bütün güçlere sahip olmazsam düzelebilecek birşey değil. Hepsi bende olursa bu dünyayı düzeltebilirim.
Vallaha önce, atom mühendisi olmak isterdim. Benim çocukluğumda atom araştırmaları çok gündemdeydi. Hemen arkasından ee, aya gidiş söz konusu oldu ee, 1969'da, eee ve ben tabi astronot olmak istedim, ee bu kadar yıl sonra tabii bu hayeller değişti, öyle, geçmişte kalan güzel anılar haline geldi.

Fields

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
audio_id - numeric id for audio file
audio_file - audio file name
duration_ms - duration of audio in milliseconds
prompt_id - numeric id for prompt
prompt - question for user
transcription - transcription of the audio response
votes - number of people that who approved a given transcript
age - age of the speaker1
gender - gender of the speaker1
language - language name
split - for data modelling, which subset of the data does this clip pertain to
char_per_sec - how many characters of transcription per second of audio
quality_tags - some automated assessment of the transcription--audio pair, separated by |
- transcription-length - character per second under 3 characters per second
- speech-rate - characters per second over 30 characters per second
- short-audio - audio length under 2 seconds
- long-audio - audio length over 5 minutes

Get involved

Community links

Main Channels:

Social media channels used during campaigns:

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2