Gawri (گاؤری) Magazine Corpus
License:
CC-BY-NC-4.0
Steward:
Collaborative Action For Research & Development (CARD)Task: NLP
Release Date: 2/10/2026
Format: TXT
Size: 146.71 KB
Share
Description
The Gawri (گاؤری) Magazine Corpus is a curated collection of monthly Gawri (گاؤری) magazine text drawn from a periodical magazine, totaling approximately 67,724 tokens. It reflects contemporary community writing across recurring magazine sections and provides a natural sample of edited, publishable Gawri prose and poetyr. The corpus is intended to support linguistic research, language technology development and language documentation.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset is intended for research, language documentation, and educational use only. Commercial use, resale/rehosting (in whole or near-verbatim form), and any use that enables profiling, surveillance, harassment, hate, or discrimination of individuals or the Gawri community are not permitted, and access may be revoked if these restrictions are violated.
Forbidden Usage
Forbidden use includes any attempt to identify or profile individuals, any hateful, harassing, discriminatory, or violent content, any deceptive impersonation or harmful misinformation, any fraud or other wrongdoing, and any redistribution or re-hosting of the corpus (or near-verbatim derivatives) without the rights holder’s permission.
Processes
Ethical Review
Ethical review process: We ensured contributors and rights holders understood the corpus purpose, intended uses, and potential risks, and we reviewed the text to minimize privacy and sensitive-content harms.
Intended Use
This dataset is intended for use in linguistic research and language documentation, and in developing and evaluating Gawri language technologies such as tokenizers, spellcheckers, and language models.
Metadata
Language
Gawri (Kalami), ISO: gwc is an Indo-Aryan language spoken in Swat Kohistan (Kalam and nearby valleys) in northern Pakistan. A magazine corpus captures edited community writing and supports documentation, literacy, and NLP.
Domains of Text
Editorials/opinion
Community news/announcements
Cultural articles
Short stories
Poetry
Interviews/biographies
Informational pieces (health/education)
Data Composition
Source: magazine issues/pages
Granularity: issue-level and/or article-level
Includes: titles + main text (optionally captions)
May exclude/clean: headers/footers, page numbers, ads, non-Gawri text, OCR noise
Processing
The corpus is already in plain text (
.txt) format, saved as UTF-8.Since the text may contain digits and non-Gawri content (e.g., Urdu/English words, headings, ads, references), processing will:
normalize Unicode and whitespace,
standardize punctuation and digits/symbols,
remove or flag non-Gawri segments (recommended: keep a “marked” version rather than deleting).
Deliver two layers:
raw/: original text (minimal changes)clean/: normalized + cleaned text, ready for training and analysis
Sample
کالام کلچرل سوسائٹی اۤ ہیچھاٹ تانی گاؤری جوؤ ایں کیر نام ا رسالاں بندوباۤس کِیت۔ ان مئی مہ تھوں لاڑاگاۤل مقالاۤ، قصاۤ، اسلامی واقعات اوئے شامل کِیت۔ اُماد تھی اُوں تھاکہ ائی خوش یئی۔ آئندہ ایں کیر پا مہ دیانت اُوں تھہ ماکہ تانی چُنڑیل لاڑا۔ ان دہ آۤک تہ ماکہ تانی جِب مِڄ کۤروگ مئی فائیدہ ہوئے تے دویاۤم ای کوستینی جؤو آں مطالعہ ایں کیر ناۤم ناۤم مواد میلا ہوئے۔ اِیں اخبار مئی کہ غلطی یا کہ کۤمی بیشی ہِیت تے تھہ تانی خطونہ تے اِیمیلاں زریعاۤ دہ مُوں شید کراواں۔ مُوں خطاں پتہ اِیں تُھو:۔ شمشی خان کالا۟می