English Hausa Parallel Corpus
License:
CC-BY-NC-4.0
Steward:
LocaleNLPTask: MT
Release Date: 3/23/2026
Format: csv
Size: 164.32 KB
Share
Description
This English–Hausa Parallel Corpus is a curated bilingual dataset of 5,000 aligned sentence pairs, translated from English into Hausa and organized into a clean sentence-level format to ensure reliable alignment. The dataset is designed to support machine translation training and evaluation, bilingual lexicon development, and broader linguistic and natural language processing (NLP) research for Hausa, including data-driven language technology development.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
Dataset is intended for research and non-commercial use
Forbidden Usage
Generating harmful or misleading content Commercial use without permission Misrepresentation of Hausa language or culture
Metadata
Languages
Hausa
Hausa is a widely spoken Chadic language used across West Africa, particularly in Nigeria and Niger. It serves as a major lingua franca with strong cultural and linguistic importance. Hausa is primarily written using the Latin-based Boko script.
English
English serves as the source language and represents modern, general-purpose usage.
Content of the Corpus
The dataset consists of general-purpose sentences translated from English into Hausa. It is designed to support research, machine translation systems, and the development of Hausa language technologies.
Details of the Dataset
This corpus is a bilingual English–Hausa parallel dataset containing 5,000 professionally aligned sentence pairs (English → Hausa). The dataset is structured at the sentence level and formatted for direct use in NLP pipelines, including machine translation, evaluation benchmarks, and linguistic analysis.
Dataset Statistics
• Sentence Pairs: 5,000
• English Words: 41,727
• Hausa Words: 45,921
• Total Words: 87,648
• Translation Direction: English → Hausa
• Content Type: Parallel sentences
• Script: English (Latin), Hausa (Latin )
Processing
• Unicode normalization (NFC)
• Standardize punctuation and spacing
• Verify sentence alignment
• Remove duplicates
• Filter noisy or corrupted text