Araina Text Corpus (Occitan Aranese)

License icon

License:

CC0-1.0

Shield icon

Steward:

Community

Task: LM

Release Date: 3/24/2026

Format: txt

Size: 22.97 MB


Share

Description

This text corpus includes sentences from three sources. Public domain literary texts translated by Antòni Nogués. Sourced from institutestudisaranesi.cat, Language educational material by Jordi Suïls Subirà, Administrative proceedings from Conselh Generau d'Aran.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

No restrictions.

Forbidden Usage

No forbidden usages.

Processes

Intended Use

This dataset was compiled in order to launch voice data collection in Common Voice. It can also be used for language modelling.

Metadata

Araina Project was run by non-profit cooperative Col·lectivaT to create a speech dataset for Aranese. These are the sentences collected and used to launch Common Voice in this variety of Occitan.

Antòni Nogués's literary works are made available publicly through Institut Aranesi with open license and was consulted when creating this resource.

Jordi Suïls Subirà has permitted his works to be included in this corpus and was a collaborator of the Araina Project.

This corpus was prepared with support from Culture Department of the Catalan autonomous government and Aran Valley General Council.

Aquest corpus s'ha elaborat amb el suport del Departament de Cultura de la Generalitat de Catalunya i Conselh Generau d'Aran.