Kyrgyz Folklore Text Corpus
License:
CC0-1.0
Steward:
TaruenTask: NLP
Release Date: 2/17/2026
Format: TXT
Size: 1.28 MB
Share
Description
This corpus contains a curated collection of Kyrgyz folklore texts, including fairy tales, magical tales, tales of everyday life, proverbs, sayings, and aphorisms. The content was digitized from 5 academic volumes published in Bishkek between 2016 and 2017, sourced from the electronic collections of the Central Scientific Library of the National Academy of Sciences of the Kyrgyz Republic. The corpus is structured in plain text files with YAML front matter metadata—notably utilizing the Common Turkic Alphabet for titles—and includes 427,527 words total (338,937 in tales, 71,619 in proverbs, and 16,971 in aphorisms). Text extraction was performed via OCR and LLM processing, followed by strict human proofreading to guarantee accuracy and exclude any hallucinations.
Specifics
Considerations
Restrictions/Special Constraints
None
Forbidden Usage
None
Metadata
Kyrgyz Folklore Text Corpus
Overview
This corpus contains a curated collection of Kyrgyz folklore texts, including fairy tales, magical tales, tales of everyday life, proverbs, sayings, and aphorisms. The texts have been digitized from academic volumes published in Bishkek between 2016 and 2017.
Statistics
Total Word Count: 427527
Files:
tales.txt: 338937 words (399 items)proverbs.txt: 71619 words (1 item block)aphorisms.txt: 16971 words (1 item block)
Sources
The following academic volumes served as the source material for this corpus:
Tales (tales.txt):
Sadyrbek kyzy, G. (Comp.). (2017). El adabiyaty: Jomoktor [Kyrgyz Folklore: Fairy Tales] (Vol. 21). Bishkek: Print-Express.
Osmonalieva, T. (Comp.). (2017). El adabiyaty: Keremettüü jomoktor [Kyrgyz Folklore: Magical Tales] (Vol. 22). Bishkek: Print-Express.
Mederalieva, J. (Comp.). (2017). El adabiyaty: Turmushtuk jomoktor [Kyrgyz Folklore: Tales of Everyday Life] (Vol. 23). Bishkek: Print-Express.
Proverbs (proverbs.txt):
Akmataliev, A., Kasymgeldieva, M., & Toychubek kyzy, J. (Comps.). (2016). El adabiyaty: Er Eshim, Makal-lakaptar [Kyrgyz Folklore: Er Eshim, Proverbs and Sayings] (Vol. 13). Bishkek: Print-Express.
Aphorisms (aphorisms.txt):
Soltobaeva, K. B. (Comp.). (2017). El adabiyaty: Uchkul sozdor, Chechen sozdor, Tamsilder, Myskyldar [Kyrgyz Folklore: Aphorisms, Eloquence, Fables, Satire] (Vol. 25). Bishkek: Print-Express.
Data Format and Metadata
The files are provided in plain text format. Each entry within the files is delimited by a YAML Front Matter block.
Front Matter Structure
For Tales, the metadata is granular and specific to the individual text:
---
title: "Aqılduu balapan"
lang: "ky"
source_container: "Sadyrbek kyzy, G. (Comp.). (2017). El adabiyaty: Jomoktor [Kyrgyz Folklore: Fairy Tales] (Vol. 21). Bishkek: Print-Express."
---
Field Definitions & Data Notes:
title: The title of the work, written using the Common Turkic Alphabet.lang: Language code ('ky' for Kyrgyz).source_container: The volume from which this specific digital version was extracted.
Processing Methodology
This dataset was created through the following pipeline:
Acquisition: Source PDFs were obtained from the electronic collections of the Центральная научная библиотека Национальной академии наук Кыргызской Республики (Central Scientific Library of the National Academy of Sciences of the Kyrgyz Republic) at
http://cslnaskr.krena.kg/collections/ru/collection/161/.OCR & Extraction: Text was extracted using the
tesseract-kirpackage and an LLM.Proofreading: The resulting texts were manually marked up with YAML front matter and proofread by human editors to ensure accuracy and strictly exclude LLM hallucinations.
Copyright and License
Folklore Status: According to Article 8 of the Law of the Kyrgyz Republic "On Copyright and Related Rights", works of folk art (folklore) are not subject to copyright.
Publisher Commentary: Scientific commentaries, introductions, and footnotes provided by the compilers/publishers are copyrighted material. Care was taken to exclude all such commentary from this corpus. This dataset contains only the folklore text proper.
Usage: The texts provided are in the public domain, but we would appreciate attribution where you reference this Mozilla Data Collective dataset.