Taruen's Tatar Folklore Text Corpus
License:
CC0-1.0
Steward:
TaruenTask: NLP
Release Date: 2/4/2026
Format: TXT
Size: 1.40 MB
Share
Description
This corpus contains a curated collection of Tatar folklore texts, including fairy tales, proverbs, short songs (quatrains), and legends. To ensure linguistic relevance for modern NLP tasks, the content was selected from 5 volumes of the 13-volume Tatar Folklore series, explicitly excluding archaic genres (such as Dastannar) to focus on contemporary language. The majority of the texts are 20th-century field recordings, and each entry is metadata-tagged with the year it was first written down to verify its temporal context. The corpus is structured in plain text files with YAML front matter metadata and includes 484,844 words total (213,619 in fairy tales, 108,767 in proverbs, 71,975 in quatrains, and 90,483 in legends).
Specifics
Considerations
Restrictions/Special Constraints
None
Forbidden Usage
None
Metadata
Taruen's Tatar Folklore Text Corpus
Overview
This corpus contains a curated collection of Tatar folklore texts, including fairy tales, proverbs, short songs (quatrains), and legends. The texts have been digitized from academic volumes published in Kazan between 1976 and 1987.
Statistics
Total Word Count: 484844
Files:
tales.txt: 213619 words (243 items)proverbs.txt: 108767 words (18,388 items)quatrains.txt: 71975 words (5,630 items)legends.txt: 90483 words (380 items)
Sources
The following academic volumes served as the source material for this corpus:
Fairy Tales (tales.txt):
Gatina, X. X., & Yarmi, X. X. (Comps.). (1977). Tatar xalıq icatı: Äkiätlär [Tatar Folklore: Fairy Tales] (Vol. 1). Tatarstan Kitap Näşriyatı.
Camaletdinov, L. Ş. (Comp.). (1981). Tatar xalıq icatı: Äkiätlär [Tatar Folklore: Fairy Tales] (Vol. 3). Tatarstan Kitap Näşriyatı.
Proverbs (proverbs.txt):
Mäxmütov, X. Ş. (Comp.). (1987). Tatar xalıq icatı: Mäqallär häm äytemnär [Tatar Folklore: Proverbs and Sayings]. Tatarstan Kitap Näşriyatı.
Legends (legends.txt):
Ğıyläcetdinov, S. M. (Comp.). (1987). Tatar xalıq icatı: Riwayatlär häm legendalar [Tatar Folklore: Myths and Legends]. Tatarstan Kitap Näşriyatı.
Quatrains (quatrains.txt):
Nadirov, İ. (Comp.). (1976). Tatar xalıq icatı: Qısqa cırlar (Dürtyullıqlar) [Tatar Folklore: Short Songs (Quatrains)]. Tatarstan Kitap Näşriyatı.
Data Format and Metadata
The files are provided in plain text format. Each entry within the files is delimited by a YAML Front Matter block.
Front Matter Structure
For Fairy Tales, the metadata is granular and specific to the individual text:
---
title: "Tölke belän Büre"
lang: "tt"
year: "1956"
source_original: "Bashirov, G., & Yarmi, X. (Comps.). (1956). Tatar xalıq äkiätläre [Tatar Folk Tales]. Kazan."
source_container: "Gatina, X. X., & Yarmi, X. X. (Comps.). (1977). Tatar xalıq icatı: Äkiätlär [Tatar Folklore: Fairy Tales] (Vol. 1). Tatarstan Kitap Näşriyatı."
---
Field Definitions:
title: The title of the work.lang: Language code ('tt' for Tatar).year: The year of the original recording or first publication of the specific text.source_original: The specific origin of the text (e.g., field recording or earlier anthology).source_container: The volume from which this specific digital version was extracted.
Note on Proverbs, Quatrains, and Legends:
Due to the high volume of items (over 24,000 combined entries), individual
tracking of source_original and year for every single proverb or
quatrain was not performed. While these specific fields may be absent or
generalized in proverbs.txt, quatrains.txt, and legends.txt,
the reader should rest assured that the language employed is contemporary
standard literary Tatar, consistent with the fairy tales.
Processing Methodology
This dataset was created through the following pipeline:
Scanning: Informants in Kazan supplied high-quality PDF scans of the source volumes.
OCR & Extraction: Text was extracted using the
tesseract-tatpackage and thegemini-2.5-proLLM.Proofreading: The resulting texts were manually marked up with YAML front matter and proofread by human editors to ensure accuracy and strictly exclude LLM hallucinations.
Copyright and License
Folklore Status: According to Article 1259 of the Civil Code of the Russian Federation, works of folklore have no authorship and are not subject to copyright.
Publisher Commentary: Scientific commentaries, introductions, and footnotes provided by the compilers/publishers are copyrighted material. Care was taken to exclude all such commentary from this corpus. This dataset contains only the folklore text proper.
Usage: The texts provided are in the public domain, but we would appreciate attribution where you reference this Mozilla Data Collective dataset.