Sursilvan Newspaper Corpus

License icon

License:

CC0-1.0

Shield icon

Steward:

Pro Svizra Rumantscha

Task: OTH

Release Date: 11/26/2025

Format: TSV

Size: 37.80 MB


Share

Description

14.6 million tokens in the Sursilvan variety of Romansh from the daily newspaper “La Quotidiana”.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Metadata

Sursilvan articles from the Romansh daily newspaper La Quotidiana between 1997 and 2008. The Sursilvan texts were automatically extracted from a mixed Romansh newspaper corpus using a Support Vector Machine trained on a smaller, manually labeled dataset.

To the extent possible under law, the newspaper’s publisher Somedia has waived all copyright and related or neighboring rights to this corpus. This work is published from Switzerland.

Language variantIETF BCP47 language codeCorpus size
Rumantsch Sursilvanrm-sursilv14.6 million tokens