Taruen's Tatar Folklore Text Corpus

License icon

License:

CC0-1.0

Shield icon

Steward:

Taruen

Task: NLP

Release Date: 2/4/2026

Format: TXT

Size: 1.40 MB


Share

Description

This corpus contains a curated collection of Tatar folklore texts, including fairy tales, proverbs, short songs (quatrains), and legends. To ensure linguistic relevance for modern NLP tasks, the content was selected from 5 volumes of the 13-volume Tatar Folklore series, explicitly excluding archaic genres (such as Dastannar) to focus on contemporary language. The majority of the texts are 20th-century field recordings, and each entry is metadata-tagged with the year it was first written down to verify its temporal context. The corpus is structured in plain text files with YAML front matter metadata and includes 484,844 words total (213,619 in fairy tales, 108,767 in proverbs, 71,975 in quatrains, and 90,483 in legends).

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None

Forbidden Usage

None

Metadata

Taruen's Tatar Folklore Text Corpus

Overview

This corpus contains a curated collection of Tatar folklore texts, including fairy tales, proverbs, short songs (quatrains), and legends. The texts have been digitized from academic volumes published in Kazan between 1976 and 1987.

Statistics

  • Total Word Count: 484844

  • Files:

    • tales.txt: 213619 words (243 items)

    • proverbs.txt: 108767 words (18,388 items)

    • quatrains.txt: 71975 words (5,630 items)

    • legends.txt: 90483 words (380 items)

Sources

The following academic volumes served as the source material for this corpus:

Fairy Tales (tales.txt):

  • Gatina, X. X., & Yarmi, X. X. (Comps.). (1977). Tatar xalıq icatı: Äkiätlär [Tatar Folklore: Fairy Tales] (Vol. 1). Tatarstan Kitap Näşriyatı.

  • Camaletdinov, L. Ş. (Comp.). (1981). Tatar xalıq icatı: Äkiätlär [Tatar Folklore: Fairy Tales] (Vol. 3). Tatarstan Kitap Näşriyatı.

Proverbs (proverbs.txt):

  • Mäxmütov, X. Ş. (Comp.). (1987). Tatar xalıq icatı: Mäqallär häm äytemnär [Tatar Folklore: Proverbs and Sayings]. Tatarstan Kitap Näşriyatı.

Legends (legends.txt):

  • Ğıyläcetdinov, S. M. (Comp.). (1987). Tatar xalıq icatı: Riwayatlär häm legendalar [Tatar Folklore: Myths and Legends]. Tatarstan Kitap Näşriyatı.

Quatrains (quatrains.txt):

  • Nadirov, İ. (Comp.). (1976). Tatar xalıq icatı: Qısqa cırlar (Dürtyullıqlar) [Tatar Folklore: Short Songs (Quatrains)]. Tatarstan Kitap Näşriyatı.

Data Format and Metadata

The files are provided in plain text format. Each entry within the files is delimited by a YAML Front Matter block.

Front Matter Structure

For Fairy Tales, the metadata is granular and specific to the individual text:

---
title: "Tölke belän Büre"
lang: "tt"
year: "1956"
source_original: "Bashirov, G., & Yarmi, X. (Comps.). (1956). Tatar xalıq äkiätläre [Tatar Folk Tales]. Kazan."
source_container: "Gatina, X. X., & Yarmi, X. X. (Comps.). (1977). Tatar xalıq icatı: Äkiätlär [Tatar Folklore: Fairy Tales] (Vol. 1). Tatarstan Kitap Näşriyatı."
---

Field Definitions:

  • title: The title of the work.

  • lang: Language code ('tt' for Tatar).

  • year: The year of the original recording or first publication of the specific text.

  • source_original: The specific origin of the text (e.g., field recording or earlier anthology).

  • source_container: The volume from which this specific digital version was extracted.

Note on Proverbs, Quatrains, and Legends: Due to the high volume of items (over 24,000 combined entries), individual tracking of source_original and year for every single proverb or quatrain was not performed. While these specific fields may be absent or generalized in proverbs.txt, quatrains.txt, and legends.txt, the reader should rest assured that the language employed is contemporary standard literary Tatar, consistent with the fairy tales.

Processing Methodology

This dataset was created through the following pipeline:

  1. Scanning: Informants in Kazan supplied high-quality PDF scans of the source volumes.

  2. OCR & Extraction: Text was extracted using the tesseract-tat package and the gemini-2.5-pro LLM.

  3. Proofreading: The resulting texts were manually marked up with YAML front matter and proofread by human editors to ensure accuracy and strictly exclude LLM hallucinations.

Copyright and License

Folklore Status: According to Article 1259 of the Civil Code of the Russian Federation, works of folklore have no authorship and are not subject to copyright.

Publisher Commentary: Scientific commentaries, introductions, and footnotes provided by the compilers/publishers are copyrighted material. Care was taken to exclude all such commentary from this corpus. This dataset contains only the folklore text proper.

Usage: The texts provided are in the public domain, but we would appreciate attribution where you reference this Mozilla Data Collective dataset.