RFE/RL Tatar-Bashkir News Text Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

RFERL

Task: NLP

Release Date: 1/16/2026

Format: TXT

Size: 102.44 MB


Share

Description

This dataset serves as a comprehensive longitudinal news corpus for the Tatar and Bashkir languages, sourced from Azatliq Radiosi (azatliq.org), the Tatar-Bashkir service of Radio Free Europe/Radio Liberty (RFE/RL). Spanning from December 2001 to December 2025, the corpus contains over 105,000 unique articles. The data is primarily in Tatar (~103k articles, ~30M tokens), with a significant subset in Bashkir (~1.2k articles) and Russian (contextual/educational content). The dataset captures the linguistic evolution of the region over two decades and includes content in both Cyrillic and Latin scripts. The files are separated by language and formatted as plain text with YAML front-matter metadata, making them ready for linguistic analysis, search indexing, and cultural preservation research.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

- Mandatory Attribution: When using this content in full or in part, you must credit RFE/RL by including a permanent link to the original article and the following text: "Copyright (c) 2026 RFE/RL, Inc. Used with the permission of Radio Free Europe/Radio Liberty." - Translation Requirements: If you translate this content, you must state the original language, provide a link to the original content, and ensure the translation does not alter or distort the meaning, name, or integrity of the content. - Integrity: You must refrain from altering or distorting the meaning, name, or integrity of the product.

Forbidden Usage

- No AI Training: It is strictly forbidden to use this dataset to train artificial intelligence (AI) systems, including Large Language Models (LLMs), chatbots, or other machine learning models. - No Commercial Sale: The sale of this content, in whole or in part, is prohibited. - No Advertising: The use of this content in advertisements or endorsements is prohibited. - No Alteration: It is prohibited to employ this content in any way that compromises its privacy, confidentiality, or integrity.

Metadata

RFE/RL Tatar-Bashkir News Text Corpus (2001–2025)

Overview

This corpus was extracted from the archives of Azatlıq Radiosı (azatliq.org), the Tatar-Bashkir Service of Radio Free Europe/Radio Liberty. It provides a rare, high-volume resource for low-resource Turkic languages.

Statistics

  • Total Articles: 105556

  • Time Period: 2001-12 to 2025-12

  • Languages:

    • Tatar (tt): 103425 articles (~29602211 tokens)

    • Bashkir (ba): 1279 articles (~134340 tokens)

    • Russian (ru): 852 articles (~414478 tokens) - Mostly educational material on Tatar culture/language.

Note on Processing:

  • Language Detection: The language of each article was identified automatically using the pycld2 Python package (version 0.42).

  • Paragraph Structure: Paragraph breaks from the original HTML were preserved to the extent possible.

  • Formatting: Text has been wrapped at 80 characters for easier inspection in terminal environments. This wrapping is done strictly on whitespace; no words were split or chunked apart.

Data Format

The dataset is provided as separate text files for each language:

  • azatliq.tt.txt (Tatar)

  • azatliq.ba.txt (Bashkir)

  • azatliq.ru.txt (Russian)

Inside the files, each article is delimited by a YAML Front Matter block containing metadata, followed by the full article text.

Metadata Fields

  • url: The canonical URL of the original article.

  • title: The headline of the article.

  • date: Publication date (ISO 8601 format: YYYY-MM-DD).

  • script: The writing system used (cyrl for Cyrillic, latn for Latin).

  • lang: The detected language code (tt for Tatar, ba for Bashkir, ru for Russian).

Source & License

All content is the property of RFE/RL, Inc. and is protected by U.S. and international copyright laws.

Users of this dataset must adhere to the RFE/RL Terms of Use. Specifically, users must credit RFE/RL in any reuse:

Copyright (c) 2026 RFE/RL, Inc. Used with the permission of Radio Free Europe/Radio Liberty.