Zacatlán Tepetzintla Nahuatl Audio

License icon

License:

CC-BY-ND-4.0

Shield icon

Steward:

Ok nemi totlahtool

Task: ASR

Release Date: 2/7/2026

Format: WAV

Size: 50.19 GB


Share

Description

This corpus contains 578 audio recordings, comprising 114 hours, from 31 speakers of Zacatlán-Ahuacatlán-Tepetzintla Nahuatl (alternatively Nahuatl de la Sierra Oeste de Puebla "Western Sierra Puebla Nahuatl", Glottocode zaca1241) from the municipalities of Zacatlán and Tepetzintla. The transcription of this audio is an ongoing effort, with periodic releases of transcriptions in a separate MDC dataset (https://datacollective.mozillafoundation.org/datasets/cmlct0jzu01s4nv07023lv3m3), "Zacatlan Tepetzintla Nahuatl Transcriptions."

Specifics

Licensing

Creative Commons Attribution No Derivatives 4.0 International (CC-BY-ND-4.0)

https://spdx.org/licenses/CC-BY-ND-4.0.html

Considerations

Restrictions/Special Constraints

Derivative works, though encouraged, are not permitted without express consent of the dataset owner, Jonathan Amith.

Forbidden Usage

N/A

Processes

Intended Use

This dataset is intended as an archive of linguistic documentation materials and as a data source for speech and language technologies for Nahuatl.

Metadata

Access the corresponding transcriptions dataset here

Credit

Financial Support

This corpus was created with the financial help of an National Science Foundation, Dynamic Language Infrastructure collaborative grant to Jonathan D. Amith, PI, with Gettysburg College as the lead instition. The grant is #2123578: Collaborative Research: Improving Techniques of Automatic Speech Recognition and Transfer Learning using Documentary Linguistic Corpora. The other half of the collaborative grant had Shinji Watanabe as PI at Carnegie Mellon University (#2123624).

How to cite

Amith, Jonathan D., Amelia Domínguez Alcántara, Ceferino Salgado Castañeda, and Ángeles Márquez Hernández. 2026. Corpus of spoken Nahuatl from the municipalities of Zacatlán and Tepetzintla, state of Puebla, with transcriptions, translations, and annotations. Downloaded from Mozilla Data Collective on yyyy-mm-dd.

Clarification of License

Although the archived version of this corpus is CC-BY-ND the purpose of this license is simply to ensure that any use and derivatives of this corpus adhere to ethical standards for the use of Indigenous language material, including the recognition of authors, the native speakers who have generously shared their language and culture with the team that recorded, transcribed, translated, and otherwise annotated these audio files. Native speakers are experts and teachers who have agreed to share their knowledge with those who made the recordings, with their community, their schools, and the general public. A good set of best practices for treatment of this knowledge is found at Guidelines for Respecting Cultural Knowledge, http://www.ankn.uaf.edu/publications/knowledge.html, which is a detailed document of protocols to follow that respect Indigenous linguistic and cultural knowledge. Note specifically the requirement that the researcher should "Identify all primary contributors and secondary sources for a particular document, and share the authorship whenever possible." Note also the request that all custodians of local knowledge be "identified" and be considered "co-authors". These are the protocols that we follow for joint work with native experts. Anonymity is of course possible if the speaker so requests it. But in 25 years of our work in Indigenous communities no speaker has ever requested anonymity and, indeed, once the goals of the project are explained all have enthusiastically accepted their public role of custodians and teachers.

Considering the above, all efforts will be made to share the corpus with others who might want to create derivatives to enhance educational and research goals. This will be decided very quickly on a case-by-case basis to ensure compliance with ethical practices. Please contact Jonathan D. Amith at nahuatl.biology@gmail.com

Corpus Description

Size

This corpus comprises 578 audio files of the Ahuacatlán-Zacatlán-Tepetzintla Nahuatl language (Glottocode zaca1241) with a duration of approximately 114.25 hours. The recordings cover the following villages in the municipalities of Zacatlán and Tepetzintla, the numbers in parentheses represent the number of audio files from each community. Municipality of Tepetzintla: Omitlán (140), Tenantitla (14), Tepetzintla (190), and Xochitlaxco (18); Municipality of Zacatlán: San Cristobal Xochimilco (7), San Miguel Tenango (189), and Xonotla (20).

Speakers

A total of 31 speakers contributed to the corpus Miguel Antonio Adalberto Ibánez, , José Adolfo Márquez, María Florencia Juana Álvarez Hernández, Josefa Fernández, Guillermo Hernández Barrios, María Ernestina Hernández Barrios, Esther Hernández Hernández, Heladio Hernández Hernández, Jaime Hernández Juárez, Natalia Hernández Luna, María Concepción Ibáñez Malpica, Elia Prisciliana Juárez Hernández, María de la Luz López Cabrera, Ángeles Márquez Hernández, Elizabeth Márquez Hernández, María Efigenia Márquez Hernández, José Alfonso Rodolfo Márquez Juárez, José Ramón Márquez Juárez, Ubaldo Márquez Pérez, María Consuelo de Carmen Márquez Rodríguez, María Aurelia Méndez Hernández, Miguel Méndez Juárez, Magdalena Méndez Pérez, Inés Mora Vázquez, Petronila Ortega Pérez, Virginia Pérez Reyes, Candelaria Ponce Aldama, Olivia Posadas Hernández, Salvadora Soto Pérez, Pascuala Téllez Allende, and Mercedes Vázquez Villalba. Note that in addition two native speakers from the municipality of Cuetzalan del Progreso, Amelia Domínguez Alcántara and Ceferino Salgado Castañeda, who recorded most of the corpus (see below) also occasionally participated in the recordings as interviewers.

Annotations

A total of 13 hours 18 minutes of transcriptions in 55 files is archived with this first deposit in February 2026, and released as a separate MDC dataset. As work progresses the transcriptions may be edited, and more audio will be transcribed. In addition plans are in place for adding free translations and other annotations. New versions of the transcriptions, translations, and annotations will be uploaded as they are created.

Workflow

Recordings

The original plan to document the Nahuatl of northern Veracruz (Chicontepec) was changed to documentation of Zacatlán-Ahuacatlán-Tepetzintla Nahuatl in order to collaborate with Robert Pugh who was studying the Nahuatl of various regions but most notably in this area, particularly in Omitlán (municipality of Tepetzintla). Amith, accompanied by Ceferino Salgado Castañeda, had in fact already made 151 recordings in June and July 2019 mostly in the communities of Tepetzintla and Tlaquimpa. This was part of a previous ethnobotanical research project. Then Amith, Amelia Domínguez Alcántara and Ceferino Salgado Castañeda, went to Omitlán in late May 2022. Amith recorded the early audio recordings using the time to teach Domínguez and Salgado to use the Sound Devices 702 recorder and Countryman e6 earworn omnidirectional microphones. Subsequent recordings were made by Domínguez and Salgado. Subsequently the latter two returned to record additional material in the municipalities of Tepetzintla and Zacatlán. For the latter, in particular, they were accompanied by Ángeles Márquez Hernández who interviewed most of the speakers from San Miguel Tenango, her home village, and other localities. Márquez Hernández became an integral member of the project since that time and has continued to play an important role.

Transcriptions

At the time of archiving, January 2026, a total of 55 files with a duration of 13 hours 18 minutes had been transcribed (published separately on MDC). The next step is to use this as a training corpus for automatic speech recognition and transfering the ASR recipe in ESPNet that had been built from a larger transcribed corpus of Nahuatl from the municipality of Cuetzalan del Progress. The ASR output will be corrected by Ángeles Márquez and others on the research team until all 114.25 hours have been corrected by human effort.

Future Plans

Once the previous step has been completed the following years will be dedicated to free translation and, hopefully, morphological segmentation and glossing.