Zacatlán Tepetzintla Nahuatl ASR Dataset
License:
CC-BY-ND-4.0
Steward:
KaltepetlahtolTask: ASR
Release Date: 2/18/2026
Format: FLAC, TSV
Size: 789.98 MB
Share
Description
An ASR dataset of Zacatlán-Ahuacatlán-Tepetzintla (Western Sierra Puebla) Nahuatl, ISO 639-3 nhi. This is a derivative work of the Zacatlán Tepetzintla Nahuatl Audio and Transcriptions datasets. It consists of the subset of larger audio dataset with transcriptions (approximately 14 hours) converted to the Mozilla Common Voice Scripted Speech format. The original stereo audio has been split and aligned with the parsed transcriptions.
Specifics
Licensing
Creative Commons Attribution No Derivatives 4.0 International (CC-BY-ND-4.0)
https://spdx.org/licenses/CC-BY-ND-4.0.htmlConsiderations
Restrictions/Special Constraints
This derivative dataset is distributed with the permission of the original authors. It maintains the same license terms as the source material.
Forbidden Usage
NA
Processes
Intended Use
This dataset is specifically formatted to facilitate ASR model-training and evaluation.
Metadata
Zacatlán Tepetzintla Nahuatl ASR Corpus
This is a derivative work of Amith et al (2026)'s Zacatlán Tepetzintla Nahuatl Audio dataset and Zacatlán Tepetzintla Nahuatl Transcriptions dataset, optimized for ASR training and evaluation.
This corpus contains 14 hours of speech and transcriptions of Nahuatl-speakers from the municipalities of Zacatlán and Tepetzintla, state of Puebla, Mexico. The specific Nahuatl variety is often referred to as "Zacatlán-Ahuacatlán-Tepetzintla Nahuatl" for the three municipalities where it most widely spoken, alternatively "Western Sierra Puebla Nahuatl" (from INALI's "Náhuatl de la Sierra oeste de Puebla"). It's ISO 639-3 code is "nhi".
Processing
The original, full-length audio files that had corresponding transcriptions were segmented based on the transcription timestamps, with each channel corresponding to the appropriate speaker (in cases where there are two speakers). The segmented audio were output as .flac format. The original transcriptions are available, as well as an optional "normalized" version (which removes vowel-length marking and metalinguistic information (such as asterisks indicating that a word is a Spanish loan). Data splits were selected to ensure no speaker overlap.
The complete processing script is available in the code/ directory.
Format
The dataset has been formatted to match the Mozilla Common Voice Scripted Speech datasets. There are three tsv files corresponding to the randomly generated data splits: "train.tsv", "dev.tsv", and "test.tsv". Each utterance has a corresponding audio file, and all audio files are in the clips/ directory. Each tsv file has the following columns:
| Column Name | Description |
|---|---|
| audio | The name of the specific audio segment file. |
| original_audio | The corresponding full-length audio file from the original dataset. |
| original_transcription | The corresponding .trs file from the original transcription dataset. |
| speaker | The unique identifier (ID) for the speaker. |
| start | The starting timestamp within the original audio file. |
| stop | The ending timestamp within the original audio file. |
| transcription | The raw text of what was spoken. |
| normalized | The cleaned or formatted version of the transcription. |
| split | The dataset partition (e.g., train, dev, or test). |
The train split has 28 speakers, the dev split has 4 speakers, and the test split has 5 speakers. Speaker information can be consulted in the original Zacatlán Tepetzintla Nahuatl Audio dataset
Citation / Attribution
Please cite both sources, original and licensed derivative, if using Pugh 2026.
Pugh, Robert J. 2026. Zacatlán Tepetzintla Nahuatl ASR-Ready Corpus. Derived from Amith, Domínguez, Salgado, and Márquez (2026).
Amith, Jonathan D., Amelia Domínguez Alcántara, Ceferino Salgado Castañeda, and Ángeles Márquez Hernández.
Corpus of spoken Nahuatl from the municipalities of Zacatlán and Tepetzintla, state of Puebla, with transcriptions, translations, and annotations. Downloaded from Mozilla Data Collective on yyyy-mm-dd.
This dataset is a licensed derivative work (Amith 2026-02-12). To ensure proper credit is given to the original linguists and community members who recorded and transcribed this data, all publications using this version must cite both the primary source and this ASR-ready derivative work (see above). The foundational scholarship, field recordings, and transcriptions were produced by Amith, Domínguez, Salgado, and Márquez (2026)
Amith, Jonathan D., Amelia Domínguez Alcántara, Ceferino Salgado Castañeda, and Ángeles Márquez Hernández. 2026. Corpus of spoken Nahuatl from the municipalities of Zacatlán and Tepetzintla, state of Puebla, with transcriptions, translations, and annotations.
Example In-Text Citation
"We trained our models using the ASR-optimized version of the Zacatlán and Tepetzintla Nahuatl corpus (Amith et al. 2026; Pugh 2026)."
License Note
This derivative dataset is distributed with the permission of the original authors. It maintains the same license terms as the source material.