Tetelancingo Nahuatl

License:

CC-BY-NC-4.0

Steward:

Kaltepetlahtol

Task: ASR

Release Date: 11/4/2025

Format: .tsv, .wav

Size: 952.98 MB

Description

Audio del Nahuatl de la Sierra Oeste de Puebla, transcrito, ortográficamente normalizado, traducido, y etiquetado

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Forbidden Usage

Any attempt to clone or imitate the voice of any of the speakers in this dataset is forbidden.

Processes

Intended Use

ASR, orthographic normalization, token-level language identification (from text or speech), translation.

Metadata

Tetelancingo Nahuatl Corpus

A corpus of audio and annotated transcriptions of Western Sierra Puebla Nahuatl, an endangered variety of Nahuatl spoken in Puebla, Mexico. The corpus contains recorded monologues and dialogues from a total of 5 speakers from a community in Zacatlán de las Manzanas. Each recording is associated with (1) an original transcription written in a "spontaneous orthography," (2) a normalized version written in a local written standard from San Miguel Tenango, (3) an unedited Spanish translation, and (4) word-level language tags.

For more information about the dataset and some preliminary experiments, see the paper Ihquin tlahtouah in Tetelahtzincocah: An annotated, multi-purpose audio and text corpus of Western Sierra Puebla Nahuatl

Citation

If you use this dataset, please cite the following paper:

@inproceedings{pugh-etal-2025-ihquin,
    title = "Ihquin tlahtouah in Tetelahtzincocah: An annotated, multi-purpose audio and text corpus of Western Sierra {P}uebla {N}ahuatl",
    author = "Pugh, Robert  and
      Wing, Cheyenne  and
      Ju{\'a}rez Huerta, Mar{\'i}a Ximena  and
      M{\'a}rquez Hernandez, {\'A}ngeles  and
      Tyers, Francis",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    m apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.181/",
    doi = "10.18653/v1/2025.naacl-long.181",
    pages = "3549--3562",
    ISBN = "979-8-89176-189-6",
    abstract = "The development of digital linguistic resources is essential for enhancing the inclusion of indigenous and marginalized languages in the digital domain. Indigenous languages of Mexico, despite representing vast typological diversity and millions of speakers, have largely been overlooked in NLP until recently. In this paper, we present a corpus of audio and annotated transcriptions of Western Sierra Puebla Nahuatl, an endangered variety of Nahuatl spoken in Puebla, Mexico. The data made available in this corpus are useful for ASR, spelling normalization, and word-level language identification. We detail the corpus-creation process, and describe experiments to report benchmark results for each of these important NLP tasks. The corpus audio and text is made freely available."
}