Zacatlán Tepetzintla Nahuatl ASR Dataset

License:

CC-BY-ND-4.0

Steward:

Kaltepetlahtol

Task: ASR

Release Date: 2/18/2026

Format: FLAC, TSV

Size: 789.98 MB

Description

An ASR dataset of Zacatlán-Ahuacatlán-Tepetzintla (Western Sierra Puebla) Nahuatl, ISO 639-3 nhi. This is a derivative work of the Zacatlán Tepetzintla Nahuatl Audio and Transcriptions datasets. It consists of the subset of larger audio dataset with transcriptions (approximately 14 hours) converted to the Mozilla Common Voice Scripted Speech format. The original stereo audio has been split and aligned with the parsed transcriptions.

Specifics

Licensing

Creative Commons Attribution No Derivatives 4.0 International (CC-BY-ND-4.0)

https://spdx.org/licenses/CC-BY-ND-4.0.html

Considerations

Restrictions/Special Constraints

This derivative dataset is distributed with the permission of the original authors. It maintains the same license terms as the source material.

Forbidden Usage

Processes

Intended Use

This dataset is specifically formatted to facilitate ASR model-training and evaluation.

Metadata

Zacatlán Tepetzintla Nahuatl ASR Corpus

This is a derivative work of Amith et al (2026)'s Zacatlán Tepetzintla Nahuatl Audio dataset and Zacatlán Tepetzintla Nahuatl Transcriptions dataset, optimized for ASR training and evaluation.

This corpus contains 14 hours of speech and transcriptions of Nahuatl-speakers from the municipalities of Zacatlán and Tepetzintla, state of Puebla, Mexico. The specific Nahuatl variety is often referred to as "Zacatlán-Ahuacatlán-Tepetzintla Nahuatl" for the three municipalities where it most widely spoken, alternatively "Western Sierra Puebla Nahuatl" (from INALI's "Náhuatl de la Sierra oeste de Puebla"). It's ISO 639-3 code is "nhi".

Processing

The original, full-length audio files that had corresponding transcriptions were segmented based on the transcription timestamps, with each channel corresponding to the appropriate speaker (in cases where there are two speakers). The segmented audio were output as .flac format. The original transcriptions are available, as well as an optional "normalized" version (which removes vowel-length marking and metalinguistic information (such as asterisks indicating that a word is a Spanish loan). Data splits were selected to ensure no speaker overlap.

The complete processing script is available in the code/ directory.

Format

The dataset has been formatted to match the Mozilla Common Voice Scripted Speech datasets. There are three tsv files corresponding to the randomly generated data splits: "train.tsv", "dev.tsv", and "test.tsv". Each utterance has a corresponding audio file, and all audio files are in the clips/ directory. Each tsv file has the following columns:

Column Name	Description
audio	The name of the specific audio segment file.
original_audio	The corresponding full-length audio file from the original dataset.
original_transcription	The corresponding `.trs` file from the original transcription dataset.
speaker	The unique identifier (ID) for the speaker.
start	The starting timestamp within the original audio file.
stop	The ending timestamp within the original audio file.
transcription	The raw text of what was spoken.
normalized	The cleaned or formatted version of the transcription.
split	The dataset partition (e.g., train, dev, or test).

The train split has 28 speakers, the dev split has 4 speakers, and the test split has 5 speakers. Speaker information can be consulted in the original Zacatlán Tepetzintla Nahuatl Audio dataset

Citation / Attribution

Please cite both sources, original and licensed derivative, if using Pugh 2026.

Pugh, Robert. 2026. Zacatlán Tepetzintla Nahuatl ASR-Ready Corpus. Derived from Amith, Domínguez, Salgado, and Márquez (2026).

Amith, Jonathan D., Amelia Domínguez Alcántara, Ceferino Salgado Castañeda, and Ángeles Márquez Hernández. 2026. Corpus of spoken Nahuatl from the municipalities of Zacatlán and Tepetzintla, state of Puebla, with transcriptions, translations, and annotations. Downloaded from Mozilla Data Collective on yyyy-mm-dd.

This dataset is a licensed derivative work (Amith 2026-02-12). To ensure proper credit is given to the original linguists and community members who recorded and transcribed this data, all publications using this version must cite both the primary source and this ASR-ready derivative work (see above). The foundational scholarship, field recordings, and transcriptions were produced by Amith, Domínguez, Salgado, and Márquez (2026)

Amith, Jonathan D., Amelia Domínguez Alcántara, Ceferino Salgado Castañeda, and Ángeles Márquez Hernández. 2026. Corpus of spoken Nahuatl from the municipalities of Zacatlán and Tepetzintla, state of Puebla, with transcriptions, translations, and annotations.

Example In-Text Citation

"We trained our models using the ASR-optimized version of the Zacatlán and Tepetzintla Nahuatl corpus (Amith et al. 2026; Pugh 2026)."

License Note

This derivative dataset is distributed with the permission of the original authors. It maintains the same license terms as the source material.