Zacatlán Tepetzintla Nahuatl Transcriptions

License:

CC-BY-ND-4.0

Steward:

Ok nemi totlahtool

Task: ASR

Release Date: 2/7/2026

Format: TRS

Size: 320.28 KB

Description

This corpus contains the most up-to-date version of the ongoing transcription effort corresponding to the "Zacatlan Tepetzintla Nahuatl Audio Corpus", also available on MDC. At present, approximately 13 hours of audio have been transcribed using the Transcriber software. The transcriptions will be periodically updated with new transcriptions and corrections to previous versions. See the audio datasheet for dataset details.

Specifics

Licensing

Creative Commons Attribution No Derivatives 4.0 International (CC-BY-ND-4.0)

https://spdx.org/licenses/CC-BY-ND-4.0.html

Considerations

Restrictions/Special Constraints

Derivative works (of both the Audio and Transcriptions datasets) are encouraged, but require the express consent of the dataset owner, Jonathan Amith.

Forbidden Usage

N/A

Processes

Intended Use

This dataset is intended to be paired with the corresponding audio dataset and used as an archive of linguistic documentation materials and as a data source for speech and language technologies for Nahuatl.

Metadata

This dataset only contains transcriptions. For full details of the corpus, see the corresponding audio dataset

Financial Support

This corpus was created with the financial help of an National Science Foundation, Dynamic Language Infrastructure collaborative grant to Jonathan D. Amith, PI, with Gettysburg College as the lead instition. The grant is #2123578: Collaborative Research: Improving Techniques of Automatic Speech Recognition and Transfer Learning using Documentary Linguistic Corpora. The other half of the collaborative grant had Shinji Watanabe as PI at Carnegie Mellon University (#2123624).

How to cite

Amith, Jonathan D., Amelia Domínguez Alcántara, Ceferino Salgado Castañeda, and Ángeles Márquez Hernández. 2026. Corpus of spoken Nahuatl from the municipalities of Zacatlán and Tepetzintla, state of Puebla, with transcriptions, translations, and annotations. Downloaded from Mozilla Data Collective on yyyy-mm-dd.

Clarification of License

Although the archived version of this corpus is CC-BY-ND the purpose of this license is simply to ensure that any use and derivatives of this corpus adhere to ethical standards for the use of Indigenous language material, including the recognition of authors, the native speakers who have generously shared their language and culture with the team that recorded, transcribed, translated, and otherwise annotated these audio files. Native speakers are experts and teachers who have agreed to share their knowledge with those who made the recordings, with their community, their schools, and the general public. A good set of best practices for treatment of this knowledge is found at Guidelines for Respecting Cultural Knowledge, http://www.ankn.uaf.edu/publications/knowledge.html, which is a detailed document of protocols to follow that respect Indigenous linguistic and cultural knowledge. Note specifically the requirement that the researcher should "Identify all primary contributors and secondary sources for a particular document, and share the authorship whenever possible." Note also the request that all custodians of local knowledge be "identified" and be considered "co-authors". These are the protocols that we follow for joint work with native experts. Anonymity is of course possible if the speaker so requests it. But in 25 years of our work in Indigenous communities no speaker has ever requested anonymity and, indeed, once the goals of the project are explained all have enthusiastically accepted their public role of custodians and teachers.

Considering the above, all efforts will be made to share the corpus with others who might want to create derivatives to enhance educational and research goals. This will be decided very quickly on a case-by-case basis to ensure compliance with ethical practices. Please contact Jonathan D. Amith at nahuatl.biology@gmail.com

Corpus Description

Size

There are currently transcription files (.trs) for 55 of the 308 audio recordings in the corpus.

Speakers

A subset of the 31 speakers are included in the transcriptions. This subset will grow as transcriptions are added.

Annotations

A total of 13 hours 18 minutes of transcriptions in 55 files is archived with this first deposit in February 2026, and released as a separate MDC dataset. As work progresses the transcriptions may be edited, and more audio will be transcribed. In addition plans are in place for adding free translations and other annotations. New versions of the transcriptions, translations, and annotations will be uploaded as they are created.

Format

Transcriptions are in the xml file format output by the Transcriber software.

Workflow

Transcriptions

At the time of archiving, January 2026, a total of 55 files with a duration of 13 hours 18 minutes had been transcribed (published separately on MDC). The next step is to use this as a training corpus for automatic speech recognition and transfering the ASR recipe in ESPNet that had been built from a larger transcribed corpus of Nahuatl from the municipality of Cuetzalan del Progress. The ASR output will be corrected by Ángeles Márquez and others on the research team until all 114.25 hours have been corrected by human effort.

Future Plans

Once the previous step has been completed the following years will be dedicated to free translation and, hopefully, morphological segmentation and glossing.