RFE/RL Tatar-Bashkir News Text Corpus
License:
CC-BY-NC-SA-4.0
Steward:
RFERLTask: NLP
Release Date: 1/16/2026
Format: TXT
Size: 102.44 MB
Share
Description
This dataset serves as a comprehensive longitudinal news corpus for the Tatar and Bashkir languages, sourced from Azatliq Radiosi (azatliq.org), the Tatar-Bashkir service of Radio Free Europe/Radio Liberty (RFE/RL). Spanning from December 2001 to December 2025, the corpus contains over 105,000 unique articles. The data is primarily in Tatar (~103k articles, ~30M tokens), with a significant subset in Bashkir (~1.2k articles) and Russian (contextual/educational content). The dataset captures the linguistic evolution of the region over two decades and includes content in both Cyrillic and Latin scripts. The files are separated by language and formatted as plain text with YAML front-matter metadata, making them ready for linguistic analysis, search indexing, and cultural preservation research.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
- Mandatory Attribution: When using this content in full or in part, you must credit RFE/RL by including a permanent link to the original article and the following text: "Copyright (c) 2026 RFE/RL, Inc. Used with the permission of Radio Free Europe/Radio Liberty." - Translation Requirements: If you translate this content, you must state the original language, provide a link to the original content, and ensure the translation does not alter or distort the meaning, name, or integrity of the content. - Integrity: You must refrain from altering or distorting the meaning, name, or integrity of the product.
Forbidden Usage
- No AI Training: It is strictly forbidden to use this dataset to train artificial intelligence (AI) systems, including Large Language Models (LLMs), chatbots, or other machine learning models. - No Commercial Sale: The sale of this content, in whole or in part, is prohibited. - No Advertising: The use of this content in advertisements or endorsements is prohibited. - No Alteration: It is prohibited to employ this content in any way that compromises its privacy, confidentiality, or integrity.
Metadata
RFE/RL Tatar-Bashkir News Text Corpus (2001–2025)
Overview
This corpus was extracted from the archives of Azatlıq Radiosı (azatliq.org), the Tatar-Bashkir Service of Radio Free Europe/Radio Liberty. It provides a rare, high-volume resource for low-resource Turkic languages.
Statistics
Total Articles: 105556
Time Period: 2001-12 to 2025-12
Languages:
Tatar (
tt): 103425 articles (~29602211 tokens)Bashkir (
ba): 1279 articles (~134340 tokens)Russian (
ru): 852 articles (~414478 tokens) - Mostly educational material on Tatar culture/language.
Note on Processing:
Language Detection: The language of each article was identified automatically using the
pycld2Python package (version 0.42).Paragraph Structure: Paragraph breaks from the original HTML were preserved to the extent possible.
Formatting: Text has been wrapped at 80 characters for easier inspection in terminal environments. This wrapping is done strictly on whitespace; no words were split or chunked apart.
Data Format
The dataset is provided as separate text files for each language:
azatliq.tt.txt(Tatar)azatliq.ba.txt(Bashkir)azatliq.ru.txt(Russian)
Inside the files, each article is delimited by a YAML Front Matter block containing metadata, followed by the full article text.
Metadata Fields
url: The canonical URL of the original article.title: The headline of the article.date: Publication date (ISO 8601 format: YYYY-MM-DD).script: The writing system used (cyrlfor Cyrillic,latnfor Latin).lang: The detected language code (ttfor Tatar,bafor Bashkir,rufor Russian).
Source & License
All content is the property of RFE/RL, Inc. and is protected by U.S. and international copyright laws.
Users of this dataset must adhere to the RFE/RL Terms of Use. Specifically, users must credit RFE/RL in any reuse:
Copyright (c) 2026 RFE/RL, Inc. Used with the permission of Radio Free Europe/Radio Liberty.