Dolgan Folklore Text Corpus
License:
CC0-1.0
Steward:
TaruenTask: NLP
Release Date: 2/24/2026
Format: TXT
Size: 57.15 KB
Share
Description
This corpus contains a curated collection of 19 Dolgan fairy tales (15,618 words) digitized from a 2000 academic volume published in Novosibirsk. The Dolgans are the northernmost Turkic-speaking people, and their language is highly endangered. This dataset provides clean, structured data designed to catalyze machine learning and NLP research for low-resource languages, empowering the community to revitalize the Dolgan language. The corpus is structured in plain text files with YAML front matter metadata. It was created with the generous contribution of digitized texts from Karina Sheifer at Dartmouth College, with additional digitization and proofreading by Taruen.
Specifics
Considerations
Restrictions/Special Constraints
None
Forbidden Usage
None
Metadata
Dolgan Folklore Text Corpus
Overview
This corpus contains a curated collection of Dolgan folklore texts, specifically fairy tales. The texts have been digitized from an academic volume published in Novosibirsk in 2000.
About the Dolgan People and Language
The Dolgans are the northernmost Turkic-speaking people in the world, primarily inhabiting the Taymyr Peninsula in the Russian Arctic. Their language, Dolgan, is closely related to Yakut (Sakha) but developed its own distinct characteristics due to geographic isolation and heavy linguistic influence from neighboring Evenki populations. With only a few thousand remaining speakers, Dolgan is considered a highly endangered language.
Beyond cultural preservation, this corpus provides clean, structured data designed to catalyze machine learning and natural language processing (NLP) research for low-resource languages. By making these texts machine-readable, we hope to enable researchers to train language models, build educational applications, and develop digital tools that will empower the community to actively learn, use, and revitalize the Dolgan language in the modern digital landscape.
Statistics & Linguistic Data
Total Word Count: 15618
Files:
`tales.txt`: 15618 words (19 items)
Corpus Alphabet: Based on programmatic analysis of the text content (excluding metadata), the following characters are utilized in this corpus:
А, Б, В, Г, Д, И, Й, К, Л, М, Н, О, П, Р, С, Т, У, Ч, Ы, Ь, Э, Ҥ, Ү, Һ, Ө, а, б, г, д, е, з, и, й, к, л, м, н, о, п, р, с, т, у, х, ч, ш, ъ, ы, ь, э, ю, ҥ, ү, һ, ө
Sources
The following academic volume served as the source material for this corpus:
Fairy Tales (`tales.txt`):
Yefremov, P. E. (Comp.). (2000). Фольклор долган [Dolgan Folklore]. Novosibirsk: Издательство Института археологии и этнографии Сибирского отделения Российской академии наук [Publishing House of the Institute of Archaeology and Ethnography of the Siberian Branch of the Russian Academy of Sciences] (Памятники фольклора народов Сибири и Дальнего Востока; Том 19 [Monuments of Folklore of the Peoples of Siberia and the Far East; Vol. 19]).
Acknowledgments
We extend our sincere gratitude to Karina Sheifer, Research Scientist and Lecturer of Linguistics at Dartmouth College. Her expertise in endangered Siberian Turkic languages is highly relevant to this work, and she graciously provided the digitized Dolgan source text for 17 of the 19 tales in this collection:
ДЬОЛЛООК КҮН, ҺОРДООК КҮН
КААМЫЫЛААК
КУРПААСКЫ КУОКАА ҺЭРИИТЭ
КӨТӨР КЫЫЛ ҺЭРИИТЭ
ЛААЙКУ
ЛЫЫБЫРА
ОГОННЬОР, УСКААТТАР, КЫРСАЛАР
ОГОННЬОР ОНУГА ЭМЭЭКСИН
ПААСЫНАЙ ОГОННЬОРДООК ЭМЭЭКСИН
ТААЛ ЭМЭЭКСИН
ТИМИР ҺААПКА ОНУГА ҺЭЛИИ КУР
УКУКУУТ-ЧУКУКУУТ ОГОННЬОР ОНУГА ҺАҺЫА
УОЛ ЫРААКТААГЫ ОНУГА ПААСЫНАЙ КЫЫҺА
ҮС УОЛ
ҺАҺЫЛ АЛБУН
ҺЭТТЭ УОЛЛААК УМНАҺЫТ БААҺЫНАЙ
ЭМЭЭКСИН ОНУГА ҺАҺЫЛ ТИРИИТЭ
The final two texts in this corpus were digitized and proofread by us, that is, Taruen:
ОГОННЬОР ОНУГА УСКААТТАР
ЛЭҤКЭЙ
Any errors that may remain in the texts or the metadata formatting of this corpus are entirely our own responsibility.
Data Format and Metadata
The files are provided in plain text format. Each entry within the files is delimited by a YAML Front Matter block.
Front Matter Structure
The metadata is granular and specific to the individual text:
---
title: "ДЬОЛЛООК КҮН, ҺОРДООК КҮН"
lang: "dlg"
year: "1936"
source_original: "Recorded in December of the year 1936 near Lake Melkoye..."
source_container: "Yefremov, P. E. (Comp.). (2000). Фольклор долган..."
---
Field Definitions:
title: The title of the work.lang: Language code ('dlg' for Dolgan).year: The year of the original recording.source_original: The specific origin of the text, including the collector and performer.source_container: The volume from which this specific digital version was extracted.
Copyright and License
Folklore Status: According to Article 1259 of the Civil Code of the Russian Federation, works of folklore have no authorship and are not subject to copyright.
Note on Translations: The original academic volume (Volume 19 of Памятники фольклора народов Сибири и Дальнего Востока) includes parallel Russian translations of all these tales, which Karina Sheifer also has digitized and available. However, because modern literary translations are subject to copyright, we have intentionally excluded them from this archive. This dataset strictly contains the original Dolgan folklore texts, which belong to the public domain.
Publisher Commentary: Scientific commentaries, introductions, and footnotes provided by the compilers/publishers are copyrighted material. Care was taken to exclude all such commentary from this corpus.
Usage: The texts provided are in the public domain, but we would appreciate attribution where you reference this Mozilla Data Collective dataset.