Common Voice Scripted Speech 25.0 - Chinese (Taiwan)
License:
CC0-1.0
Steward:
Common VoiceTask: ASR
Release Date: 3/23/2026
Format: MP3
Size: 2.95 GB
Share
Description
A collection of read speech recordings in Chinese (Taiwan) (華語(台灣)).
Specifics
Considerations
Restrictions/Special Constraints
None provided.
Forbidden Usage
It is forbidden to attempt to determine the identity of speakers in the Common Voice datasets. It is forbidden to re-host or re-share this dataset.
Processes
Intended Use
This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.
Metadata
華語(台灣) — Chinese (Taiwan) (zh-TW)
Mozilla Common Voice cv-corpus-25.0-2026-03-09 Chinese (Taiwan) (zh-TW) 文本錄音 語料集。
本語料集包含 2317 位錄音者,共 131.33 小時的錄音資料,其中 79.68 小時已驗證(經另二名參與者確認),來自 21,763 條文本句。
語言
中華民國國語 / 台灣華語(Taiwan Mandarin, cmn-TW),台灣繁體中文(zh-TW)語料。
腔調(Accents)
| Code | Accent | Clips | Speakers |
|---|---|---|---|
| taipei_city | 出生地:臺北市 | 19,646 (14.0%) | 107 (4.6%) |
| new_taipei_city | 出生地:新北市 | 8,850 (6.3%) | 62 (2.7%) |
| taichung_city | 出生地:臺中市 | 4,411 (3.1%) | 47 (2.0%) |
| kaohsiung_city | 出生地:高雄市 | 3,266 (2.3%) | 42 (1.8%) |
| taoyuan_city | 出生地:桃園市 | 3,015 (2.1%) | 23 (1.0%) |
| hsinchu_city | 出生地:新竹市 | 2,866 (2.0%) | 11 (0.5%) |
| yunlin_county | 出生地:雲林縣 | 2,560 (1.8%) | 8 (0.3%) |
| nantou_county | 出生地:南投縣 | 2,101 (1.5%) | 7 (0.3%) |
| changhua_county | 出生地:彰化縣 | 2,009 (1.4%) | 22 (0.9%) |
| tainan_city | 出生地:臺南市 | 1,708 (1.2%) | 21 (0.9%) |
| chiayi_city | 出生地:嘉義市 | 1,195 (0.8%) | 5 (0.2%) |
| pingtung_county | 出生地:屏東縣 | 913 (0.6%) | 6 (0.3%) |
| hualien_county | 出生地:花蓮縣 | 878 (0.6%) | 5 (0.2%) |
| yilan_county | 出生地:宜蘭縣 | 765 (0.5%) | 8 (0.3%) |
| hong_kong | 香港 | 690 (0.5%) | 26 (1.1%) |
| chiayi_county | 出生地:嘉義縣 | 379 (0.3%) | 7 (0.3%) |
| hsinchu_county | 出生地:新竹縣 | 343 (0.2%) | 8 (0.3%) |
| keelung_city | 出生地:基隆市 | 141 (0.1%) | 10 (0.4%) |
| kinmen_county | 出生地:金門縣 | 55 (0.0%) | 1 (0.0%) |
| penghu_county | 出生地:澎湖縣 | 20 (0.0%) | 2 (0.1%) |
| miaoli_county | 出生地:苗栗縣 | 15 (0.0%) | 2 (0.1%) |
| taitung_county | 出生地:臺東縣 | 10 (0.0%) | 2 (0.1%) |
| - | Other | 5,017 (3.6%) | 21 (0.9%) |
統計資料
本資料集包含以下自我申報的年齡與性別分布。每個表格下方會顯示涵蓋率摘要。
性別
錄音者自行宣告的性別資訊。表格顯示錄音數與錄音者數及其百分比。未宣告性別的錄音者列為「未指定」。破折號(-)表示零。
| Code | Gender | Clips | Speakers |
|---|---|---|---|
| male_masculine | Male, masculine | 68,527 (48.7%) | 598 (25.8%) |
| female_feminine | Female, feminine | 31,056 (22.1%) | 258 (11.1%) |
| transgender | Transgender | 100 (0.1%) | 1 (0.0%) |
| non-binary | Non-binary | - | - |
| do_not_wish_to_say | Prefer not to say | 25 (0.0%) | 2 (0.1%) |
| - | Unspecified | 40,922 (29.1%) | 1,575 (68.0%) |
Gender declared: 99,708 of 140,630 clips (70.9%), 742 of 2,317 speakers (32.0%)
年齡
錄音者自行宣告的年齡層資訊。表格顯示錄音數與錄音者數及其百分比。未宣告年齡的錄音者列為「未指定」。破折號(-)表示零。
| Code | Age | Clips | Speakers |
|---|---|---|---|
| teens | Teens | 8,440 (6.0%) | 82 (3.5%) |
| twenties | Twenties | 41,664 (29.6%) | 451 (19.5%) |
| thirties | Thirties | 27,021 (19.2%) | 231 (10.0%) |
| fourties | Fourties | 12,771 (9.1%) | 105 (4.5%) |
| fifties | Fifties | 12,587 (9.0%) | 27 (1.2%) |
| sixties | Sixties | 431 (0.3%) | 3 (0.1%) |
| seventies | Seventies | 30 (0.0%) | 4 (0.2%) |
| eighties | Eighties | - | - |
| nineties | Nineties | - | - |
| - | Unspecified | 37,686 (26.8%) | 1,540 (66.5%) |
Age declared: 102,944 of 140,630 clips (73.2%), 777 of 2,317 speakers (33.5%)
資料分群(用於模型訓練)
Clip buckets
| Bucket | Clips |
|---|---|
| Validated | 85,324 (60.7%) |
| Invalidated | 4,920 (3.5%) |
| Other | 50,386 (35.8%) |
Training splits
| Split | Clips |
|---|---|
| Train | 7,394 (8.7%) |
| Dev | 5,119 (6.0%) |
| Test | 5,119 (6.0%) |
Training split coverage: 17,632 of 85,324 validated clips (20.7%)
本資料集包含 85324 筆已驗證、4920 筆未通過驗證及 50386 筆待審查的片段。片段平均長度為 3.362 秒。
文本語料
大部分繁體文本語料整理於:MozTW CC0 語料庫。
以下是文本語料的統計資訊,請檢視上述 repo 以進一步瞭解統計方式:
There are 3573 characters in the corpus, covering about 85.6% of the MOU 2015 common chars data (教育部2015常用字99.75% (3593字)).
1046 phonetic are covered, about 66.75% of the total phonetic from CnsPhonetic2016-08v2.cin table.
我們亟需更多「日常生活用句」,歡迎捐贈您以國語書寫的作品,請參考下方社群頻道資訊與我們聯繫。
Validated sentences: 20,786
| Category | Count |
|---|---|
| Unvalidated sentences | 977 |
| Pending sentences | 137 |
| Rejected sentences | 840 |
| Reported sentences | 179 |
本語料庫包含 21,763 條句子:20,786 條已驗證、977 條未驗證(137 條待審查、840 條被拒絕),另有 179 條被回報需審查。
樣本
以下為五個隨機選擇的錄音句子樣本
需要認真考慮
要有具體的想法
與其他的議事組安全組新聞組一樣
還沒調參數
下稅後的事情
來源
文本語料由 Mozilla 台灣社群(moztw.org)、g0v 社群、及其他開放原始碼運動志工參與者共同建立。
錄音者主要為來自台灣的個別志工參與者。
| Source | Sentences |
|---|---|
| sentence-collector | 15,566 (74.9%) |
| setences | 2,897 (13.9%) |
| https://github.com/moztw/cc0-sentences/commit/01033097e6b3b2b2d58354bc55760e4bdbe19166 | 666 (3.2%) |
| https://github.com/moztw/cc0-sentences/commit/e340b6dad1e08b65c3fc7d72e1afd9544c3752d5 | 451 (2.2%) |
| taipei_city_gov | 355 (1.7%) |
| chatlogs | 309 (1.5%) |
| Other | 533 (2.6%) |
文本領域
| Code | Domain | Clips | Speakers |
|---|---|---|---|
| general | General | 1,502 (1.1%) | 84 (3.6%) |
| agriculture_food | Agriculture and Food | 12 (0.0%) | 7 (0.3%) |
| automotive_transport | Automotive and Transport | 278 (0.2%) | 45 (1.9%) |
| finance | Finance | 3 (0.0%) | 3 (0.1%) |
| service_retail | Service and Retail | 151 (0.1%) | 36 (1.6%) |
| healthcare | Healthcare | 25 (0.0%) | 18 (0.8%) |
| history_law_government | History, Law and Government | 170 (0.1%) | 39 (1.7%) |
| media_entertainment | Media and Entertainment | 170 (0.1%) | 44 (1.9%) |
| nature_environment | Nature and Environment | 14 (0.0%) | 12 (0.5%) |
| news_current_affairs | News and Current Affairs | 44 (0.0%) | 19 (0.8%) |
| technology_robotics | Technology and Robotics | 777 (0.6%) | 49 (2.1%) |
| language_fundamentals | Language Fundamentals | 8 (0.0%) | 7 (0.3%) |
欄位
片段
每個 tsv 檔案的每一列代表一個音聲片段,包含以下資訊:
client_id- 使用者的雜湊 UUIDpath- 音檔的相對路徑text- 音檔的預期轉錄文本up_votes- 認為音檔與文本相符的人數down_votes- 認為音檔與文本不符的人數age- 錄音者的年齡1gender- 錄音者的性別1accents- 錄音者的腔調1variant- 語言的變體1segment- 若句子屬於自訂資料集分群,會列在此欄位prompt_upvotes- 句子提示收到的贊成票數prompt_reports- 句子提示收到的檢舉數is_edited- 片段的轉錄是否已被編輯
validated_sentences.tsv
validated_sentences.tsv 檔案中的每一列代表文本語料庫中一條已驗證的句子:
sentence_id- 句子的唯一識別碼sentence- 句子文本variant- 該語言的變體sentence_domain- 句子所屬的領域source- 句子的來源is_used- 該句子是否仍在流通中供錄製使用clips_count- 為該句子錄製的片段數量
unvalidated_sentences.tsv
unvalidated_sentences.tsv 檔案中的每一列代表文本語料庫中一條未驗證的句子:
sentence_id- 句子的唯一識別碼sentence- 句子文本variant- 該語言的變體sentence_domain- 句子所屬的領域source- 句子的來源up_votes- 句子獲得的贊成票數down_votes- 句子獲得的反對票數status- 句子的目前狀態(pending或rejected)
參與
社群連結
Mozilla 台灣社群 (MozTW) Common Voice 專案網站: https://moztw.org/common-voice/
任何問題與建議、協助推廣、捐贈語料,或其他合作需求,請透過以下社群頻道與我們討論:
討論
貢獻
捐出你的句子 - 如您有意願捐出你擁有的文本語料(例如您的個人創作)供參與者錄音,請先聯絡 Irvin ( irvin@moztw.org )或於以上 Line / Telegram 群組討論。
誌謝
資料表編撰
Irvin Chen (MozTW 社群聯絡人)
授權
此資料集以 創用 CC 公眾領域貢獻宣告 (CC-0) 釋出至公共領域。 下載這個資料集,即代表你同意不對資料集中的個別參與者進行識別。
Footnotes
如需年齡、性別及腔調選項的完整清單,請參閱 demographics spec。僅在錄音者同意提供時才會揭露這些資訊。 ↩ ↩2 ↩3 ↩4