Sursilvan Newspaper Corpus
License:
CC0-1.0
Steward:
Pro Svizra RumantschaTask: OTH
Release Date: 11/26/2025
Format: TSV
Size: 37.80 MB
Share
Description
14.6 million tokens in the Sursilvan variety of Romansh from the daily newspaper “La Quotidiana”.
Specifics
Metadata
Sursilvan articles from the Romansh daily newspaper La Quotidiana between 1997 and 2008. The Sursilvan texts were automatically extracted from a mixed Romansh newspaper corpus using a Support Vector Machine trained on a smaller, manually labeled dataset.
To the extent possible under law, the newspaper’s publisher Somedia has waived all copyright and related or neighboring rights to this corpus. This work is published from Switzerland.
| Language variant | IETF BCP47 language code | Corpus size |
|---|---|---|
| Rumantsch Sursilvan | rm-sursilv | 14.6 million tokens |