Corpus de llenguatge ofensiu en català

License icon

License:

CC-BY-SA-4.0

Shield icon

Steward:

MDC Curators

Task: NLP

Release Date: 3/24/2026

Format: TSV

Size: 57.35 KB


Share

Description

This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan. The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

These sentences must not be used to generate offensive content.

Forbidden Usage

It is forbidden to use to generate offensive content.

Processes

Intended Use

The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.

Metadata

This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan.

The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.

There are a total of 770 lines consisting of 8,219 tokens.

Many of them were seemingly generated based on templates involving a place name. It is also noteworthy that many of them contain grammatical errors Els negras, as is typical of the genre. The majority express bigotry towards Muslims and Black people, particularly immigrants, but there are also some that express bigotry towards immigrants from Latin America.

These sentences were uploaded via the "single sentence" upload facility in Mozilla Common Voice and are licensed CC-0.

Structure

The dataset contains a single file, offensive-language.tsv which contains four columns:

  • sentence_id: The hash of the sentence

  • sentence: The sentence text

  • locale: The locale (in this case ca -- Catalan)

  • category: The category (in this case offensive-language)