English Hausa Parallel Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

LocaleNLP

Task: MT

Release Date: 3/23/2026

Format: csv

Size: 164.32 KB


Share

Description

This English–Hausa Parallel Corpus is a curated bilingual dataset of 5,000 aligned sentence pairs, translated from English into Hausa and organized into a clean sentence-level format to ensure reliable alignment. The dataset is designed to support machine translation training and evaluation, bilingual lexicon development, and broader linguistic and natural language processing (NLP) research for Hausa, including data-driven language technology development.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

Dataset is intended for research and non-commercial use

Forbidden Usage

Generating harmful or misleading content Commercial use without permission Misrepresentation of Hausa language or culture

Metadata

Languages

Hausa

Hausa is a widely spoken Chadic language used across West Africa, particularly in Nigeria and Niger. It serves as a major lingua franca with strong cultural and linguistic importance. Hausa is primarily written using the Latin-based Boko script.

English

English serves as the source language and represents modern, general-purpose usage.

Content of the Corpus

The dataset consists of general-purpose sentences translated from English into Hausa. It is designed to support research, machine translation systems, and the development of Hausa language technologies.

Details of the Dataset

This corpus is a bilingual English–Hausa parallel dataset containing 5,000 professionally aligned sentence pairs (English → Hausa). The dataset is structured at the sentence level and formatted for direct use in NLP pipelines, including machine translation, evaluation benchmarks, and linguistic analysis.

Dataset Statistics

• Sentence Pairs: 5,000

• English Words: 41,727

• Hausa Words: 45,921

• Total Words: 87,648

• Translation Direction: English → Hausa

• Content Type: Parallel sentences

• Script: English (Latin), Hausa (Latin )

Processing

• Unicode normalization (NFC)

• Standardize punctuation and spacing

• Verify sentence alignment

• Remove duplicates

• Filter noisy or corrupted text