Kyrgyz Folklore Text Corpus

License icon

License:

CC0-1.0

Shield icon

Steward:

Taruen

Task: NLP

Release Date: 2/17/2026

Format: TXT

Size: 1.28 MB


Share

Description

This corpus contains a curated collection of Kyrgyz folklore texts, including fairy tales, magical tales, tales of everyday life, proverbs, sayings, and aphorisms. The content was digitized from 5 academic volumes published in Bishkek between 2016 and 2017, sourced from the electronic collections of the Central Scientific Library of the National Academy of Sciences of the Kyrgyz Republic. The corpus is structured in plain text files with YAML front matter metadata—notably utilizing the Common Turkic Alphabet for titles—and includes 427,527 words total (338,937 in tales, 71,619 in proverbs, and 16,971 in aphorisms). Text extraction was performed via OCR and LLM processing, followed by strict human proofreading to guarantee accuracy and exclude any hallucinations.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None

Forbidden Usage

None

Metadata

Kyrgyz Folklore Text Corpus

Overview

This corpus contains a curated collection of Kyrgyz folklore texts, including fairy tales, magical tales, tales of everyday life, proverbs, sayings, and aphorisms. The texts have been digitized from academic volumes published in Bishkek between 2016 and 2017.

Statistics

  • Total Word Count: 427527

  • Files:

    • tales.txt: 338937 words (399 items)

    • proverbs.txt: 71619 words (1 item block)

    • aphorisms.txt: 16971 words (1 item block)

Sources

The following academic volumes served as the source material for this corpus:

Tales (tales.txt):

  • Sadyrbek kyzy, G. (Comp.). (2017). El adabiyaty: Jomoktor [Kyrgyz Folklore: Fairy Tales] (Vol. 21). Bishkek: Print-Express.

  • Osmonalieva, T. (Comp.). (2017). El adabiyaty: Keremettüü jomoktor [Kyrgyz Folklore: Magical Tales] (Vol. 22). Bishkek: Print-Express.

  • Mederalieva, J. (Comp.). (2017). El adabiyaty: Turmushtuk jomoktor [Kyrgyz Folklore: Tales of Everyday Life] (Vol. 23). Bishkek: Print-Express.

Proverbs (proverbs.txt):

  • Akmataliev, A., Kasymgeldieva, M., & Toychubek kyzy, J. (Comps.). (2016). El adabiyaty: Er Eshim, Makal-lakaptar [Kyrgyz Folklore: Er Eshim, Proverbs and Sayings] (Vol. 13). Bishkek: Print-Express.

Aphorisms (aphorisms.txt):

  • Soltobaeva, K. B. (Comp.). (2017). El adabiyaty: Uchkul sozdor, Chechen sozdor, Tamsilder, Myskyldar [Kyrgyz Folklore: Aphorisms, Eloquence, Fables, Satire] (Vol. 25). Bishkek: Print-Express.

Data Format and Metadata

The files are provided in plain text format. Each entry within the files is delimited by a YAML Front Matter block.

Front Matter Structure

For Tales, the metadata is granular and specific to the individual text:

---
title: "Aqılduu balapan"
lang: "ky"
source_container: "Sadyrbek kyzy, G. (Comp.). (2017). El adabiyaty: Jomoktor [Kyrgyz Folklore: Fairy Tales] (Vol. 21). Bishkek: Print-Express."
---

Field Definitions & Data Notes:

  • title: The title of the work, written using the Common Turkic Alphabet.

  • lang: Language code ('ky' for Kyrgyz).

  • source_container: The volume from which this specific digital version was extracted.

Processing Methodology

This dataset was created through the following pipeline:

  1. Acquisition: Source PDFs were obtained from the electronic collections of the Центральная научная библиотека Национальной академии наук Кыргызской Республики (Central Scientific Library of the National Academy of Sciences of the Kyrgyz Republic) at http://cslnaskr.krena.kg/collections/ru/collection/161/.

  2. OCR & Extraction: Text was extracted using the tesseract-kir package and an LLM.

  3. Proofreading: The resulting texts were manually marked up with YAML front matter and proofread by human editors to ensure accuracy and strictly exclude LLM hallucinations.

Copyright and License

Folklore Status: According to Article 8 of the Law of the Kyrgyz Republic "On Copyright and Related Rights", works of folk art (folklore) are not subject to copyright.

Publisher Commentary: Scientific commentaries, introductions, and footnotes provided by the compilers/publishers are copyrighted material. Care was taken to exclude all such commentary from this corpus. This dataset contains only the folklore text proper.

Usage: The texts provided are in the public domain, but we would appreciate attribution where you reference this Mozilla Data Collective dataset.