Tamil Time Aligned Speech Dataset

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

MirasAI

Task: ASR

Release Date: 4/6/2026

Format: OGG, SRT

Size: 37.11 MB


Share

Description

The Tamil Time-Aligned Speech Dataset is a curated 5-hour speech corpus consisting of Tamil audio recordings paired with precise time-aligned transcriptions. The dataset is designed to support a wide range of speech and language technology tasks, including automatic speech recognition, forced alignment, speech segmentation, subtitle generation, and timestamp-aware linguistic analysis. By preserving the correspondence between spoken audio and textual content at the segment level, the dataset enables detailed study of pronunciation, timing, and spoken language structure. It is a useful resource for researchers, developers, and institutions working on Tamil speech technologies and low-resource language processing.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

No speaker identification, surveillance, or harmful use permitted.

Forbidden Usage

You agree not to attempt to determine the identity of speakers in this dataset.

Processes

Intended Use

This dataset is intended for use in creating automatic speech recognition systems.

Metadata

Language

Tamil is a major Dravidian language spoken primarily in Tamil Nadu, Sri Lanka, and Tamil-speaking communities around the world, with a long literary history and rich linguistic tradition.

Script

அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ க ங ச ஞ ட ண த ந ப ம ய ர ல வ ழ ள ற ன ஜ ஷ ஸ ஹ க்ஷ ஶ ா ி ீ ு ூ ெ ே ை ொ ோ ௌ ் ஂ ஃ

Data Structure

  • Main folders: Audio/ and Time-Aligned_Transcripts/

  • Speaker-wise organization: each main folder contains Speaker_1/ and Speaker_2/

  • Audio folder: stores speech recordings for each speaker

  • Transcript folder: stores corresponding time-aligned transcript files for each speaker

  • Parallel structure: transcript files follow the same speaker-based organization as the audio files

Speaker Information

  • Speaker 1: Age: 30, Gender: Male, Region: Neelagiri

  • Speaker 2: Age: 30, Gender: Female, Region: Chengalpattu

Sample

1 00:00:00,023 --> 00:00:04,020 நவீன வாழ்க்கையில தொழில்நுட்பத்தோட பங்கு

2 00:00:04,057 --> 00:00:05,855 அனைவருக்கும்

3 00:00:05,880 --> 00:00:10,620 நம்மளோட வந்து நவீன வாழ்க்கையில தொழில் நுட்பத்தோட பங்கு வந்து என்னனா

4 00:00:10,644 --> 00:00:15,153 இந்த உலகத்தில வந்து தொழில் நுட்பம் இல்லாம சிந்திக்க முடியாது அளவுக்கு வளர்ந்துடுச்சு