Finance Sentences - North American Spanish

License icon

License:

CC0-1.0

Shield icon

Steward:

MDC Curators

Task: NLP

Release Date: 2/26/2026

Format: TSV, JSON

Size: 18.35 MB


Share

Description

This is a public domain corpus of North American Spanish sentences in the finance domain. The corpus was collected in the second half of 2023 in aid of the Mozilla Common Voice project. The dataset contains 79,655 clean sentences (1,325,013 tokens) from nine distinct federal domains. It also contains a total of 209,061 sentences (4,125,637 tokens) of sentences without cleaning.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None

Forbidden Usage

None

Metadata

This is a public domain corpus of North American Spanish sentences in the finance domain. The corpus was collected in aid of the Mozilla Common Voice project. The data was collected in the second half of 2023.

The dataset contains 79,655 clean sentences (1,325,013 tokens) from nine distinct federal domains. It also contains a total of 209,061 sentences (4,125,637 tokens) of sentences without cleaning.

Process

We downloaded HTML and PDF files from the following subdomains of the US Federal Government.

  • home.treasury.gov

  • mx.usembassy.gov

  • studentaid.gov

  • www.consumerfinance.gov

  • www.fdic.gov

  • www.ftc.gov

  • www.irs.gov

  • www.sba.gov

  • www.usa.gov

We then extracted sentences and used a glossary to score them. The scoring function returned the number of terms found and a badness score based on punctuation and numeral expressions:

  • Contains [0-9]

  • Contains /%@_()▪:${}|

  • Contains unbalanced double quotes

  • The start of the sentence is lowercase or sentence-ending punctuation ,)!.

  • The number of URLs or other numerals

  • The number of acronyms

  • The number of words with mixed case

Sentences with a score of under 0 were excluded.

There are five files in this archive:

  • MANIFEST.tsv: A mapping between page and ID

  • GLOSARIO.tsv: The finance terms glossary in Spanish, collected from a number of sources, including:

  • README.md: This file

  • corpus.json: Contains the corpus

  • corpus.tsv: A TSV file with the corpus having the following fields:

    • score: The badness score of the sentence

    • length: The length of the sentence in space-separated tokens

    • terms_found: The number of glossary terms found

    • file_hash: An MD5 hash of the filename

    • sentence_number: The number of the sentence within the file

    • sentence: The sentence itself.

Sample

TSV

score	length	terms_found	file_hash	sentence_number	sentence
0	15	2	789e77a042ff6fe847fe16517e4cbf78	37	Pero también es una ocasión para hablar sobre cómo buscar un préstamo con buenas condiciones.
0	8	0	f229b77e70235378736afabb6916d888	61	Escriba una carta solicitando que eliminemos la multa.
0	11	0	4c39d6aa1217bc203eb2387e872648a4	98	Usted no está obligado a solicitar comunicación en un formato alternativo.
0	5	3	d127cc79ba3559e1490096a028e78394	79	Cumplen los requisitos de ingresos.
0	7	5	16f7ae07af3dc64b0852f184947159d5	7544	Reembolso de impuestos estatales sobre los ingresos.

JSON

  {
    "badness": -1,
    "terms": 2,
    "hash": "71fe6f85bc56253ab2d9ec8ed08d08c2/19",
    "sent": "Impuestos Estimados",
    "url": "./www.irs.gov/es/businesses/small-businesses-self-employed/recommended-reading-for-small-businesses",
    "length": 2
  },
  {
    "badness": -2,
    "terms": 0,
    "hash": "71fe6f85bc56253ab2d9ec8ed08d08c2/20",
    "sent": "Multas Reembolsos Resumen ¿Dónde está mi reembolso?",
    "url": "./www.irs.gov/es/businesses/small-businesses-self-employed/recommended-reading-for-small-businesses",
    "length": 7
  },
  {
    "badness": 0,
    "terms": 0,
    "hash": "71fe6f85bc56253ab2d9ec8ed08d08c2/23",
    "sent": "Lo que debe esperar",
    "url": "./www.irs.gov/es/businesses/small-businesses-self-employed/recommended-reading-for-small-businesses",
    "length": 4
  },
  {
    "badness": 0,
    "terms": 2,
    "hash": "71fe6f85bc56253ab2d9ec8ed08d08c2/24",
    "sent": "Depósito directo",
    "url": "./www.irs.gov/es/businesses/small-businesses-self-employed/recommended-reading-for-small-businesses",
    "length": 2
  }

Limitations

No syntactic analysis was performed, so the corpus may contain sentences that are not sentences in sensu stricto (for example titles, or sentence fragments).

Acknowledgements

This corpus was prepared for Mozilla Common Voice with the support of NVIDIA.