KyrgyzNER: Human-Annotated NER Dataset for Kyrgyz

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

Akylai

Task: NLP

Release Date: 11/27/2025

Format: CONLL-2003

Size: 585.87 KB


Description

KyrgyzNER is the first manually annotated Named-Entity Recognition (NER) dataset for the Kyrgyz language. It consists of 1,499 news articles (10,900 sentences, 140k tokens) from the 24.kg news portal, annotated with 39,075 entity mentions across 27 classes.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

You agree that you will not re-host or re-share this dataset

Processes

Intended Use

Training and fine-tuning of Named Entity Recognition models

Metadata

KyrgyzNER: Human-Annotated NER Dataset for Kyrgyz

Paper

Overview

KyrgyzNER is the first manually annotated Named-Entity Recognition (NER) dataset for the Kyrgyz language.
It consists of 1,499 news articles (10,900 sentences, 140k tokens) from the 24.kg news portal, annotated with 39,075 entity mentions across 27 classes.

Akylai provides both the dataset and baseline models, aiming to advance NLP research for low-resource Turkic languages.

Key Features

  • 📑 Dataset:

    • 1,499 Kyrgyz news articles (2017–2022)

    • 10,900 sentences, 39,075 entity mentions

    • 27 entity categories (Person, Location, Institution, Period, etc.)

    • Format: CoNLL-2003

  • 🛠 Annotation:

    • Annotated by 59 trained Kyrgyz linguists and students

    • Guidelines adapted from GROBID-NER

    • High-quality dataset with κ = 0.89 inter-annotator agreement

Dataset Statistics

SplitDocsSentencesTokensMentions
Train (999)9997,03389,24824,949
Test (500)5003,86751,11814,126
Total149910,900140,36639,075
  • Most frequent classes: Person, Location, Institution, Measure

  • Rare classes (few samples): Award, Animal, Substance, Identifier

Contribution

We are grateful to:

  • 59 volunteers (mainly students of KSTU) who annotated the dataset

  • Dr. Gulnara Kabaeva and Dr. Gulira Zhumalieva for academic support

For the list of contributors, please see the volunteers.md file included in the dataset.

Citation

If you use this dataset in your research, please cite:

@inproceedings{turatali2025kyrgyzner,
  title     = {Human-Annotated NER Dataset for the Kyrgyz Language},
  author    = {Turatali, Timur and Alekseev, Anton and Jumalieva, Gulira and Kabaeva, Gulnara and Nikolenko, Sergey},
  booktitle = {Proceedings of TurkLang 2025},
  year      = {2025}
}

License

  • Dataset: CC BY-NC-SA 4.0

  • Code & Models: MIT license

👉 Full details are available in our paper.