Gawri (گاؤری) Magazine Corpus

License:

CC-BY-NC-4.0

Steward:

Collaborative Action For Research & Development (CARD)

Task: NLP

Release Date: 2/10/2026

Format: TXT

Size: 146.71 KB

Description

The Gawri (گاؤری) Magazine Corpus is a curated collection of monthly Gawri (گاؤری) magazine text drawn from a periodical magazine, totaling approximately 67,724 tokens. It reflects contemporary community writing across recurring magazine sections and provides a natural sample of edited, publishable Gawri prose and poetyr. The corpus is intended to support linguistic research, language technology development and language documentation.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended for research, language documentation, and educational use only. Commercial use, resale/rehosting (in whole or near-verbatim form), and any use that enables profiling, surveillance, harassment, hate, or discrimination of individuals or the Gawri community are not permitted, and access may be revoked if these restrictions are violated.

Forbidden Usage

Forbidden use includes any attempt to identify or profile individuals, any hateful, harassing, discriminatory, or violent content, any deceptive impersonation or harmful misinformation, any fraud or other wrongdoing, and any redistribution or re-hosting of the corpus (or near-verbatim derivatives) without the rights holder’s permission.

Processes

Ethical Review

Ethical review process: We ensured contributors and rights holders understood the corpus purpose, intended uses, and potential risks, and we reviewed the text to minimize privacy and sensitive-content harms.

Intended Use

This dataset is intended for use in linguistic research and language documentation, and in developing and evaluating Gawri language technologies such as tokenizers, spellcheckers, and language models.

Metadata

Language

Gawri (Kalami), ISO: gwc is an Indo-Aryan language spoken in Swat Kohistan (Kalam and nearby valleys) in northern Pakistan. A magazine corpus captures edited community writing and supports documentation, literacy, and NLP.

Domains of Text

Editorials/opinion
Community news/announcements
Cultural articles
Short stories
Poetry
Interviews/biographies
Informational pieces (health/education)

Data Composition

Source: magazine issues/pages
Granularity: issue-level and/or article-level
Includes: titles + main text (optionally captions)
May exclude/clean: headers/footers, page numbers, ads, non-Gawri text, OCR noise

Processing

The corpus is already in plain text (.txt) format, saved as UTF-8.
Since the text may contain digits and non-Gawri content (e.g., Urdu/English words, headings, ads, references), processing will:
- normalize Unicode and whitespace,
- standardize punctuation and digits/symbols,
- remove or flag non-Gawri segments (recommended: keep a “marked” version rather than deleting).
Deliver two layers:
- raw/: original text (minimal changes)
- clean/: normalized + cleaned text, ready for training and analysis

Sample

کالام کلچرل سوسائٹی اۤ ہیچھاٹ تانی گاؤری جوؤ ایں کیر نام ا رسالاں بندوباۤس کِیت۔ ان مئی مہ تھوں لاڑاگاۤل مقالاۤ، قصاۤ، اسلامی واقعات اوئے شامل کِیت۔ اُماد تھی اُوں تھاکہ ائی خوش یئی۔ آئندہ ایں کیر پا مہ دیانت اُوں تھہ ماکہ تانی چُنڑیل لاڑا۔ ان دہ آۤک تہ ماکہ تانی جِب مِڄ کۤروگ مئی فائیدہ ہوئے تے دویاۤم ای کوستینی جؤو آں مطالعہ ایں کیر ناۤم ناۤم مواد میلا ہوئے۔ اِیں اخبار مئی کہ غلطی یا کہ کۤمی بیشی ہِیت تے تھہ تانی خطونہ تے اِیمیلاں زریعاۤ دہ مُوں شید کراواں۔ مُوں خطاں پتہ اِیں تُھو:۔ شمشی خان کالا۟می