Common Voice Spontaneous Speech 3.0 - Turkish
License:
CC0-1.0
Steward:
Common VoiceTask: ASR
Release Date: 3/22/2026
Format: MP3
Size: 5.58 MB
Share
Description
A collection of spontaneous responses to questions in Turkish (Türkçe).
Specifics
Considerations
Restrictions/Special Constraints
None provided.
Forbidden Usage
It is forbidden to attempt to determine the identity of speakers in the Common Voice datasets. It is forbidden to re-host or re-share this dataset.
Processes
Intended Use
This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.
Metadata
Türkçe — Turkish (tr)
This datasheet is for sps-corpus-3.0-2026-03-09 of the Mozilla Common Voice Spontaneous Speech dataset for Turkish [Türkçe - tr]. The dataset contains 46 clips representing 0.27 hours of recorded speech (0.14 hours validated) from 10 speakers.
Language
Turkish is the most widely spoken language among Turkic languages and has around 100 million L1 speakers, which makes it the 18th most spoken language. It is the national language of Turkey and one of two official languages of Cyprus, and secondary languages of some neighboring countries. Many smaller groups in other countries exist, through migrations or communities from Ottoman era. These smaller groups should usually be categorized as a variant.
Variants
There are currently no variants defined for Common Voice Turkish dataset. It is worth noting that, until now, this dataset focused on literary Turkish, often called "Turkish of Turkey". There are also some L2 voices, mostly from immigrants coming into the country, but these can be categorized as "foreign accents".
Data splits for modelling
The dataset clips are categorised by transcription status and training-set assignment. The following tables summarise the distribution.
Audio clips
| Bucket | Clips | % |
|---|---|---|
| Transcribed & Validated | 22 | 47.8% |
| Transcribed & Pending | 0 | 0.0% |
| Not transcribed | 24 | 52.2% |
Training splits
| Bucket | Clips | % |
|---|---|---|
| Train | 0 | 0.0% |
| Dev | 0 | 0.0% |
| Test | 0 | 0.0% |
| Unassigned | 46 | 100.0% |
Training split coverage: 0 of 22 transcribed & validated clips (0.0%)
Transcriptions
Transcription status
| Bucket | Clips | % |
|---|---|---|
| Validated | 22 | 100.0% |
| Pending | 0 | 0.0% |
| Edited | 8 | 36.4% |
Writing system
Turkish uses an extended Latin alphabet.
Symbol table
Official Alphabet:
Lowercase:
a b c ç d e f g ğ h ı i j k l m n o ö p r s ş t u ü v y zUppercase:
A B C Ç D E F G Ğ H I İ J K L M N O Ö P R S Ş T U Ü V Y Z
Auxilary Characters (Arabic/Farsi loanwords): â î û Â Î Û
Samples
Questions
There follows a randomly selected sample of questions used in the corpus.
Dünyanın en büyük sorunları sence hangileri?
Yurtdışında en çok hangi şehirleri gezmek isterdin?
Farklı uçlarda düşünceleri olan insanlar sence ortak paydada buluşabilir mi?
İş hayatında en çok değer verdiğin şey nedir?
Çocukken en çok ne olmayı hayal ederdin ve bu hayalin değişti mi?
Responses
There follows a randomly selected sample of transcribed responses from the corpus.
On yıldır bir tatil yöresinde yaşıyorum zaten, o yüzden pek tatile gitme ihtiyacı hissetmiyorum. Ama geçen yıl üç günlüğüne bir otelde arkadaşımla birlikte konakladık.
Ben sabahları genelde uyurum çünkü geceleri çalışırım. Uyandıktan sonra da... en az iki, ve hatta bazı günlerde dört tane kahvede- kahve içmeden de kendime gelemeeem. Yani, enerjik olmamı sağlayacak tek şey, kahve benim için.
Valla, en önemli şey iş etiği bence... Birçok insan çok katakulli işler yapıyor iş ortamlarında ve beni çıldırtıyorlar.
Valla çok fazla bilimkurgu ve fantastik kitap ve film okudum, izledim. Bunlarda genellikle kötü adamların istediği birşey var; bütün güçlere sahip olmak! İşte onu isterdim. Çünkü bu dünya eğer o bütün dünlere- bütün güçlere sahip olmazsam düzelebilecek birşey değil. Hepsi bende olursa bu dünyayı düzeltebilirim.
Vallaha önce, atom mühendisi olmak isterdim. Benim çocukluğumda atom araştırmaları çok gündemdeydi. Hemen arkasından ee, aya gidiş söz konusu oldu ee, 1969'da, eee ve ben tabi astronot olmak istedim, ee bu kadar yıl sonra tabii bu hayeller değişti, öyle, geçmişte kalan güzel anılar haline geldi.
Fields
Each row of a tsv file represents a single audio clip, and contains the following information:
client_id- hashed UUID of a given useraudio_id- numeric id for audio fileaudio_file- audio file nameduration_ms- duration of audio in millisecondsprompt_id- numeric id for promptprompt- question for usertranscription- transcription of the audio responsevotes- number of people that who approved a given transcriptage- age of the speaker1gender- gender of the speaker1language- language namesplit- for data modelling, which subset of the data does this clip pertain tochar_per_sec- how many characters of transcription per second of audioquality_tags- some automated assessment of the transcription--audio pair, separated by|transcription-length- character per second under 3 characters per secondspeech-rate- characters per second over 30 characters per secondshort-audio- audio length under 2 secondslong-audio- audio length over 5 minutes
Get involved
Community links
Main Channels:
Social media channels used during campaigns:
Discussions
Contribute
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.
Footnotes
For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2