Finance Sentences - North American Spanish
License:
CC0-1.0
Steward:
MDC CuratorsTask: NLP
Release Date: 2/26/2026
Format: TSV, JSON
Size: 18.35 MB
Share
Description
This is a public domain corpus of North American Spanish sentences in the finance domain. The corpus was collected in the second half of 2023 in aid of the Mozilla Common Voice project. The dataset contains 79,655 clean sentences (1,325,013 tokens) from nine distinct federal domains. It also contains a total of 209,061 sentences (4,125,637 tokens) of sentences without cleaning.
Specifics
Considerations
Restrictions/Special Constraints
None
Forbidden Usage
None
Metadata
This is a public domain corpus of North American Spanish sentences in the finance domain. The corpus was collected in aid of the Mozilla Common Voice project. The data was collected in the second half of 2023.
The dataset contains 79,655 clean sentences (1,325,013 tokens) from nine distinct federal domains. It also contains a total of 209,061 sentences (4,125,637 tokens) of sentences without cleaning.
Process
We downloaded HTML and PDF files from the following subdomains of the US Federal Government.
home.treasury.govmx.usembassy.govstudentaid.govwww.consumerfinance.govwww.fdic.govwww.ftc.govwww.irs.govwww.sba.govwww.usa.gov
We then extracted sentences and used a glossary to score them. The scoring function returned the number of
terms found and a badness score based on punctuation and numeral expressions:
Contains [0-9]
Contains
/%@_()▪:${}|Contains unbalanced double quotes
The start of the sentence is lowercase or sentence-ending punctuation
,)!.The number of URLs or other numerals
The number of acronyms
The number of words with mixed case
Sentences with a score of under 0 were excluded.
There are five files in this archive:
MANIFEST.tsv: A mapping between page and IDGLOSARIO.tsv: The finance terms glossary in Spanish, collected from a number of sources, including:README.md: This filecorpus.json: Contains the corpuscorpus.tsv: A TSV file with the corpus having the following fields:score: The badness score of the sentencelength: The length of the sentence in space-separated tokensterms_found: The number of glossary terms foundfile_hash: An MD5 hash of the filenamesentence_number: The number of the sentence within the filesentence: The sentence itself.
Sample
TSV
score length terms_found file_hash sentence_number sentence
0 15 2 789e77a042ff6fe847fe16517e4cbf78 37 Pero también es una ocasión para hablar sobre cómo buscar un préstamo con buenas condiciones.
0 8 0 f229b77e70235378736afabb6916d888 61 Escriba una carta solicitando que eliminemos la multa.
0 11 0 4c39d6aa1217bc203eb2387e872648a4 98 Usted no está obligado a solicitar comunicación en un formato alternativo.
0 5 3 d127cc79ba3559e1490096a028e78394 79 Cumplen los requisitos de ingresos.
0 7 5 16f7ae07af3dc64b0852f184947159d5 7544 Reembolso de impuestos estatales sobre los ingresos.
JSON
{
"badness": -1,
"terms": 2,
"hash": "71fe6f85bc56253ab2d9ec8ed08d08c2/19",
"sent": "Impuestos Estimados",
"url": "./www.irs.gov/es/businesses/small-businesses-self-employed/recommended-reading-for-small-businesses",
"length": 2
},
{
"badness": -2,
"terms": 0,
"hash": "71fe6f85bc56253ab2d9ec8ed08d08c2/20",
"sent": "Multas Reembolsos Resumen ¿Dónde está mi reembolso?",
"url": "./www.irs.gov/es/businesses/small-businesses-self-employed/recommended-reading-for-small-businesses",
"length": 7
},
{
"badness": 0,
"terms": 0,
"hash": "71fe6f85bc56253ab2d9ec8ed08d08c2/23",
"sent": "Lo que debe esperar",
"url": "./www.irs.gov/es/businesses/small-businesses-self-employed/recommended-reading-for-small-businesses",
"length": 4
},
{
"badness": 0,
"terms": 2,
"hash": "71fe6f85bc56253ab2d9ec8ed08d08c2/24",
"sent": "Depósito directo",
"url": "./www.irs.gov/es/businesses/small-businesses-self-employed/recommended-reading-for-small-businesses",
"length": 2
}
Limitations
No syntactic analysis was performed, so the corpus may contain sentences that are not sentences in sensu stricto (for example titles, or sentence fragments).
Acknowledgements
This corpus was prepared for Mozilla Common Voice with the support of NVIDIA.