KyrgyzNER: Human-Annotated NER Dataset for Kyrgyz
License:
CC-BY-NC-SA-4.0
Steward:
Akylai
Task: NLP
Release Date: 11/27/2025
Format: CONLL-2003
Size: 585.87 KB
Description
KyrgyzNER is the first manually annotated Named-Entity Recognition (NER) dataset for the Kyrgyz language. It consists of 1,499 news articles (10,900 sentences, 140k tokens) from the 24.kg news portal, annotated with 39,075 entity mentions across 27 classes.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
You agree that you will not re-host or re-share this dataset
Processes
Intended Use
Training and fine-tuning of Named Entity Recognition models
Metadata
KyrgyzNER: Human-Annotated NER Dataset for Kyrgyz
Overview
KyrgyzNER is the first manually annotated Named-Entity Recognition (NER) dataset for the Kyrgyz language.
It consists of 1,499 news articles (10,900 sentences, 140k tokens) from the 24.kg news portal, annotated with 39,075 entity mentions across 27 classes.
Akylai provides both the dataset and baseline models, aiming to advance NLP research for low-resource Turkic languages.
Key Features
📑 Dataset:
1,499 Kyrgyz news articles (2017–2022)
10,900 sentences, 39,075 entity mentions
27 entity categories (Person, Location, Institution, Period, etc.)
Format: CoNLL-2003
🛠Annotation:
Annotated by 59 trained Kyrgyz linguists and students
Guidelines adapted from GROBID-NER
High-quality dataset with κ = 0.89 inter-annotator agreement
Dataset Statistics
| Split | Docs | Sentences | Tokens | Mentions |
|---|---|---|---|---|
| Train (999) | 999 | 7,033 | 89,248 | 24,949 |
| Test (500) | 500 | 3,867 | 51,118 | 14,126 |
| Total | 1499 | 10,900 | 140,366 | 39,075 |
Most frequent classes: Person, Location, Institution, Measure
Rare classes (few samples): Award, Animal, Substance, Identifier
Contribution
We are grateful to:
59 volunteers (mainly students of KSTU) who annotated the dataset
Dr. Gulnara Kabaeva and Dr. Gulira Zhumalieva for academic support
For the list of contributors, please see the volunteers.md file included in the dataset.
Citation
If you use this dataset in your research, please cite:
@inproceedings{turatali2025kyrgyzner,
title = {Human-Annotated NER Dataset for the Kyrgyz Language},
author = {Turatali, Timur and Alekseev, Anton and Jumalieva, Gulira and Kabaeva, Gulnara and Nikolenko, Sergey},
booktitle = {Proceedings of TurkLang 2025},
year = {2025}
}
License
Dataset: CC BY-NC-SA 4.0
Code & Models: MIT license
👉 Full details are available in our paper.
