Dolgan Folklore Text Corpus

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None

Forbidden Usage

None

Metadata

Overview

This corpus contains a curated collection of Dolgan folklore texts, specifically fairy tales. The texts have been digitized from an academic volume published in Novosibirsk in 2000.

About the Dolgan People and Language

The Dolgans are the northernmost Turkic-speaking people in the world, primarily inhabiting the Taymyr Peninsula in the Russian Arctic. Their language, Dolgan, is closely related to Yakut (Sakha) but developed its own distinct characteristics due to geographic isolation and heavy linguistic influence from neighboring Evenki populations. With only a few thousand remaining speakers, Dolgan is considered a highly endangered language.

Beyond cultural preservation, this corpus provides clean, structured data designed to catalyze machine learning and natural language processing (NLP) research for low-resource languages. By making these texts machine-readable, we hope to enable researchers to train language models, build educational applications, and develop digital tools that will empower the community to actively learn, use, and revitalize the Dolgan language in the modern digital landscape.

Statistics & Linguistic Data

Total Word Count: 15618
Files:
- `tales.txt`: 15618 words (19 items)

Corpus Alphabet: Based on programmatic analysis of the text content (excluding metadata), the following characters are utilized in this corpus:

А, Б, В, Г, Д, И, Й, К, Л, М, Н, О, П, Р, С, Т, У, Ч, Ы, Ь, Э, Ҥ, Ү, Һ, Ө, а, б, г, д, е, з, и, й, к, л, м, н, о, п, р, с, т, у, х, ч, ш, ъ, ы, ь, э, ю, ҥ, ү, һ, ө

Sources

The following academic volume served as the source material for this corpus:

Fairy Tales (`tales.txt`):

Yefremov, P. E. (Comp.). (2000). Фольклор долган [Dolgan Folklore]. Novosibirsk: Издательство Института археологии и этнографии Сибирского отделения Российской академии наук [Publishing House of the Institute of Archaeology and Ethnography of the Siberian Branch of the Russian Academy of Sciences] (Памятники фольклора народов Сибири и Дальнего Востока; Том 19 [Monuments of Folklore of the Peoples of Siberia and the Far East; Vol. 19]).

Acknowledgments

We extend our sincere gratitude to Karina Sheifer, Research Scientist and Lecturer of Linguistics at Dartmouth College. Her expertise in endangered Siberian Turkic languages is highly relevant to this work, and she graciously provided the digitized Dolgan source text for 17 of the 19 tales in this collection:

ДЬОЛЛООК КҮН, ҺОРДООК КҮН
КААМЫЫЛААК
КУРПААСКЫ КУОКАА ҺЭРИИТЭ
КӨТӨР КЫЫЛ ҺЭРИИТЭ
ЛААЙКУ
ЛЫЫБЫРА
ОГОННЬОР, УСКААТТАР, КЫРСАЛАР
ОГОННЬОР ОНУГА ЭМЭЭКСИН
ПААСЫНАЙ ОГОННЬОРДООК ЭМЭЭКСИН
ТААЛ ЭМЭЭКСИН
ТИМИР ҺААПКА ОНУГА ҺЭЛИИ КУР
УКУКУУТ-ЧУКУКУУТ ОГОННЬОР ОНУГА ҺАҺЫА
УОЛ ЫРААКТААГЫ ОНУГА ПААСЫНАЙ КЫЫҺА
ҮС УОЛ
ҺАҺЫЛ АЛБУН
ҺЭТТЭ УОЛЛААК УМНАҺЫТ БААҺЫНАЙ
ЭМЭЭКСИН ОНУГА ҺАҺЫЛ ТИРИИТЭ

The final two texts in this corpus were digitized and proofread by us, that is, Taruen:

ОГОННЬОР ОНУГА УСКААТТАР
ЛЭҤКЭЙ

Any errors that may remain in the texts or the metadata formatting of this corpus are entirely our own responsibility.

Data Format and Metadata

The files are provided in plain text format. Each entry within the files is delimited by a YAML Front Matter block.

Front Matter Structure

The metadata is granular and specific to the individual text:

---
title: "ДЬОЛЛООК КҮН, ҺОРДООК КҮН"
lang: "dlg"
year: "1936"
source_original: "Recorded in December of the year 1936 near Lake Melkoye..."
source_container: "Yefremov, P. E. (Comp.). (2000). Фольклор долган..."
---

Field Definitions:

title: The title of the work.
lang: Language code ('dlg' for Dolgan).
year: The year of the original recording.
source_original: The specific origin of the text, including the collector and performer.
source_container: The volume from which this specific digital version was extracted.

Copyright and License

Folklore Status: According to Article 1259 of the Civil Code of the Russian Federation, works of folklore have no authorship and are not subject to copyright.

Note on Translations: The original academic volume (Volume 19 of Памятники фольклора народов Сибири и Дальнего Востока) includes parallel Russian translations of all these tales, which Karina Sheifer also has digitized and available. However, because modern literary translations are subject to copyright, we have intentionally excluded them from this archive. This dataset strictly contains the original Dolgan folklore texts, which belong to the public domain.

Publisher Commentary: Scientific commentaries, introductions, and footnotes provided by the compilers/publishers are copyrighted material. Care was taken to exclude all such commentary from this corpus.

Usage: The texts provided are in the public domain, but we would appreciate attribution where you reference this Mozilla Data Collective dataset.

Dolgan Folklore Text Corpus

Description

Specifics

Considerations

Metadata

Dolgan Folklore Text Corpus

Overview

About the Dolgan People and Language

Statistics & Linguistic Data

Sources

Acknowledgments

Data Format and Metadata

Front Matter Structure

Copyright and License