Malayalam Time-Aligned Speech Corpus
License:
CC-BY-NC-4.0
Steward:
CommunityTask: ASR
Release Date: 3/12/2026
Format: WAV, SRT
Size: 1.50 GB
Share
Description
This dataset is a speaker-organized Malayalam speech corpus consisting of 100 audio recordings and 100 corresponding transcription files in .srt format. The transcriptions are time-aligned and include timestamps matched to the audio. The dataset contains recordings from 5 speakers, including 3 male and 2 female speakers, and the average length of each audio file is approximately 3 minutes. The data is arranged speaker-wise, making it easy to identify and work with each speaker’s recordings and transcriptions separately. This dataset is suitable for automatic speech recognition, forced alignment, speech-text synchronization, subtitle alignment, speech segmentation, and Malayalam speech technology development.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset is intended for research, educational, and language technology development purposes only, and users must not use it in ways that violate speaker privacy, misrepresent the content, or cause harm.
Forbidden Usage
You agree not to identify speakers, clone or imitate their voices, or use this dataset to train chatbots, large language models, or any system intended for deceptive, harmful, or privacy-violating purposes.
Processes
Ethical Review
All speakers were informed about the purpose of the data collection, the intended research and computational uses of their recordings and transcriptions, and any relevant privacy considerations before their participation
Intended Use
This dataset is intended for use in creating automatic speech recognition systems.
Metadata
Language
Malayalam is a Dravidian language spoken mainly in Kerala and nearby regions of India, with additional speaker communities in the diaspora. It is a major literary and spoken language with broad use in education, media, and everyday communication, making it important for speech technology and language resource development.
Malayalam is a Dravidian language spoken mainly in Kerala and nearby regions of India, with additional speaker communities in the diaspora. It is a major literary and spoken language with broad use in education, media, and everyday communication, making it important for speech technology and language resource development.
Script
Malayalam is written in the Malayalam script, an abugida of the Brahmic family. The core alphabet includes independent vowels അ, ആ, ഇ, ഈ, ഉ, ഊ, ഋ, എ, ഏ, ഐ, ഒ, ഓ, ഔ; consonants ക, ഖ, ഗ, ഘ, ങ, ച, ഛ, ജ, ഝ, ഞ, ട, ഠ, ഡ, ഢ, ണ, ത, ഥ, ദ, ധ, ന, പ, ഫ, ബ, ഭ, മ, യ, ര, ല, വ, ശ, ഷ, സ, ഹ, ള, ഴ, റ; and additional signs such as the anusvara ം, visarga ഃ, chandrakala ്, and vowel signs used in orthographic combinations.
Source
This dataset is a Malayalam speech corpus consisting of audio recordings paired with time-aligned transcriptions. It includes 5 speakers in total, with 3 male and 2 female speakers. The data is organized speaker-wise, with separate audio and transcription files for each speaker.
Size
The dataset contains 100 audio files and 100 corresponding .srt transcription files. The average length of each audio file is approximately 3 minutes, and the total duration of the dataset is approximately 6 hours of speech.
Structure
The dataset contains two main folders: one for audio and one for transcription.
Within each main folder, the data is further organized by speaker number.
The speaker-numbered folder in the audio directory corresponds directly to the same speaker-numbered folder in the transcription directory.
The dataset is organized speaker-wise.
It contains 5 speakers in total.
The speaker distribution is 3 male and 2 female.
The dataset includes 100 audio files.
The dataset includes 100 transcription files in
.srtformat.Each audio file has a corresponding
.srttranscription file.Each
.srtfile contains timestamped and time-aligned transcription segments.Audio and transcription files are grouped by speaker, such as Speaker 1 audio and Speaker 1 transcription.
The average duration of each audio file is approximately 3 minutes.
The total audio duration is approximately 6 hours.
The structure supports speech-text alignment, speaker-based analysis, and corpus organization.
Domain
The dataset contains spoken Malayalam and is useful for speech and language technology tasks such as automatic speech recognition, forced alignment, speech segmentation, subtitle alignment, and Malayalam corpus development.
Recommended Processing
Standardize audio files into a consistent computational format such as WAV, with uniform sampling rate, bit depth, and channel settings.
Match each audio file with its corresponding
.srttranscription file.Extract timestamps and transcription segments from
.srtfiles into structured formats such as CSV or JSON.Normalize Malayalam Unicode text to ensure consistent character representation.
Clean transcription text by removing formatting inconsistencies, extra whitespace, and subtitle artifacts where needed.
Segment the data into utterance-level units using the
.srttimestamps.Preserve speaker-wise folder structure for speaker-based modeling and analysis.
Create metadata tables linking file name, speaker ID, gender, duration, timestamps, and transcription text.
Validate file pairings and timestamp consistency before model training or corpus analysis.
Prepare the processed data for downstream tasks such as ASR, forced alignment, speech segmentation, subtitle alignment, and corpus indexing.
Sample
1
00:00:00,016 --> 00:00:01,420
[silence]
2
00:00:01,445 --> 00:00:05,533
മനുഷ്യജീവിതത്തിൻറെ ഏറ്റവും വലിയ ശക്തിയും അടിത്തറയും കുടുംബമാണ്
3
00:00:05,558 --> 00:00:10,558
ഒരു കുഞ്ഞ് ഈ ലോകത്ത് ജനിക്കുന്ന നിമിഷം മുതൽ അതിന് സുരക്ഷയും സ്നേഹവും നൽകുന്നത് കുടുംബമാണ്.
4
00:00:10,583 --> 00:00:15,071
നമ്മുടെ ആദ്യ പാഠശാലയും ആദ്യ ഗുരുക്കന്മാരും നമ്മുടെ മാതാപിതാക്കളാണ്.