Malayalam Time-Aligned Speech Corpus

License:

CC-BY-NC-4.0

Steward:

Community

Task: ASR

Release Date: 3/12/2026

Format: WAV, SRT

Size: 1.50 GB

Description

This dataset is a speaker-organized Malayalam speech corpus consisting of 100 audio recordings and 100 corresponding transcription files in .srt format. The transcriptions are time-aligned and include timestamps matched to the audio. The dataset contains recordings from 5 speakers, including 3 male and 2 female speakers, and the average length of each audio file is approximately 3 minutes. The data is arranged speaker-wise, making it easy to identify and work with each speaker’s recordings and transcriptions separately. This dataset is suitable for automatic speech recognition, forced alignment, speech-text synchronization, subtitle alignment, speech segmentation, and Malayalam speech technology development.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended for research, educational, and language technology development purposes only, and users must not use it in ways that violate speaker privacy, misrepresent the content, or cause harm.

Forbidden Usage

You agree not to identify speakers, clone or imitate their voices, or use this dataset to train chatbots, large language models, or any system intended for deceptive, harmful, or privacy-violating purposes.

Processes

Ethical Review

All speakers were informed about the purpose of the data collection, the intended research and computational uses of their recordings and transcriptions, and any relevant privacy considerations before their participation

Intended Use

This dataset is intended for use in creating automatic speech recognition systems.

Metadata

Language

Malayalam is a Dravidian language spoken mainly in Kerala and nearby regions of India, with additional speaker communities in the diaspora. It is a major literary and spoken language with broad use in education, media, and everyday communication, making it important for speech technology and language resource development.

Script

Malayalam is written in the Malayalam script, an abugida of the Brahmic family. The core alphabet includes independent vowels അ, ആ, ഇ, ഈ, ഉ, ഊ, ഋ, എ, ഏ, ഐ, ഒ, ഓ, ഔ; consonants ക, ഖ, ഗ, ഘ, ങ, ച, ഛ, ജ, ഝ, ഞ, ട, ഠ, ഡ, ഢ, ണ, ത, ഥ, ദ, ധ, ന, പ, ഫ, ബ, ഭ, മ, യ, ര, ല, വ, ശ, ഷ, സ, ഹ, ള, ഴ, റ; and additional signs such as the anusvara ം, visarga ഃ, chandrakala ്, and vowel signs used in orthographic combinations.

Source

This dataset is a Malayalam speech corpus consisting of audio recordings paired with time-aligned transcriptions. It includes 5 speakers in total, with 3 male and 2 female speakers. The data is organized speaker-wise, with separate audio and transcription files for each speaker.

Size

The dataset contains 100 audio files and 100 corresponding .srt transcription files. The average length of each audio file is approximately 3 minutes, and the total duration of the dataset is approximately 6 hours of speech.

Structure

The dataset contains two main folders: one for audio and one for transcription.
Within each main folder, the data is further organized by speaker number.
The speaker-numbered folder in the audio directory corresponds directly to the same speaker-numbered folder in the transcription directory.
The dataset is organized speaker-wise.
It contains 5 speakers in total.
The speaker distribution is 3 male and 2 female.
The dataset includes 100 audio files.
The dataset includes 100 transcription files in .srt format.
Each audio file has a corresponding .srt transcription file.
Each .srt file contains timestamped and time-aligned transcription segments.
Audio and transcription files are grouped by speaker, such as Speaker 1 audio and Speaker 1 transcription.
The average duration of each audio file is approximately 3 minutes.
The total audio duration is approximately 6 hours.
The structure supports speech-text alignment, speaker-based analysis, and corpus organization.

Domain

The dataset contains spoken Malayalam and is useful for speech and language technology tasks such as automatic speech recognition, forced alignment, speech segmentation, subtitle alignment, and Malayalam corpus development.

Recommended Processing

Standardize audio files into a consistent computational format such as WAV, with uniform sampling rate, bit depth, and channel settings.
Match each audio file with its corresponding .srt transcription file.
Extract timestamps and transcription segments from .srt files into structured formats such as CSV or JSON.
Normalize Malayalam Unicode text to ensure consistent character representation.
Clean transcription text by removing formatting inconsistencies, extra whitespace, and subtitle artifacts where needed.
Segment the data into utterance-level units using the .srt timestamps.
Preserve speaker-wise folder structure for speaker-based modeling and analysis.
Create metadata tables linking file name, speaker ID, gender, duration, timestamps, and transcription text.
Validate file pairings and timestamp consistency before model training or corpus analysis.
Prepare the processed data for downstream tasks such as ASR, forced alignment, speech segmentation, subtitle alignment, and corpus indexing.

Sample

1
00:00:00,016 --> 00:00:01,420
[silence]

2
00:00:01,445 --> 00:00:05,533
മനുഷ്യജീവിതത്തിൻറെ ഏറ്റവും വലിയ ശക്തിയും അടിത്തറയും കുടുംബമാണ്

3
00:00:05,558 --> 00:00:10,558
ഒരു കുഞ്ഞ് ഈ ലോകത്ത് ജനിക്കുന്ന നിമിഷം മുതൽ അതിന് സുരക്ഷയും സ്നേഹവും നൽകുന്നത് കുടുംബമാണ്.

4
00:00:10,583 --> 00:00:15,071
നമ്മുടെ ആദ്യ പാഠശാലയും ആദ്യ ഗുരുക്കന്മാരും നമ്മുടെ മാതാപിതാക്കളാണ്.