Corpus de llenguatge ofensiu en català
License:
CC-BY-SA-4.0
Steward:
MDC CuratorsTask: NLP
Release Date: 3/24/2026
Format: TSV
Size: 57.35 KB
Share
Description
This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan. The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.
Specifics
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
These sentences must not be used to generate offensive content.
Forbidden Usage
It is forbidden to use to generate offensive content.
Processes
Intended Use
The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.
Metadata
This dataset consists of sentences tagged as offensive-language in the version 25.0 release
of Mozilla Common Voice in Catalan.
The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.
There are a total of 770 lines consisting of 8,219 tokens.
Many of them were seemingly generated based on templates involving a place name. It is also noteworthy that many of them contain grammatical errors Els negras, as is typical of the genre. The majority express bigotry towards Muslims and Black people, particularly immigrants, but there are also some that express bigotry towards immigrants from Latin America.
These sentences were uploaded via the "single sentence" upload facility in Mozilla Common Voice and are licensed CC-0.
Structure
The dataset contains a single file, offensive-language.tsv which contains four columns:
sentence_id: The hash of the sentencesentence: The sentence textlocale: The locale (in this caseca-- Catalan)category: The category (in this caseoffensive-language)