Kannada Time Aligned Speech Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

MirasAI

Task: ASR

Release Date: 4/1/2026

Format: OGG, SRT

Size: 355.77 MB


Share

Description

The Kannada Time-Aligned Speech Corpus is a 5-hour speech dataset containing Kannada audio with corresponding time-aligned transcriptions. It is designed to support speech technology and research tasks such as automatic speech recognition, forced alignment, speech segmentation, pronunciation modeling, and spoken language analysis. The dataset provides a useful resource for developing and evaluating Kannada language technologies.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Use is permitted with attribution for non-commercial purposes only, and any shared adaptations must be distributed under the same license terms.

Forbidden Usage

Forbidden uses include commercial use, redistribution without proper attribution, and sharing modified versions under a different license.

Processes

Intended Use

This dataset is intended for use in speech technology and language research, including automatic speech recognition, forced alignment, speech-text matching, and spoken Kannada language processing.

Metadata

Language

Kannada is a major Dravidian language primarily spoken in the Indian state of Karnataka and by Kannada-speaking communities in other parts of India and abroad. It has a long literary history, a rich written tradition, and its own script. Kannada is widely used in education, media, administration, literature, and everyday communication, making it one of the most important languages of South India.

Data Structure

The dataset is organized into two main folders:

  • Audio/ — contains the Kannada speech recordings

  • Transcription/ — contains the corresponding text transcriptions for each audio file

Each transcription file corresponds to an audio file, making the dataset easy to use for speech processing, alignment, and transcription-based tasks.

Speaker Information

The dataset includes recordings from two native Kannada speakers:

  • Speaker 1: Male, 32 years old

  • Speaker 2: Female, 39 years old

This provides basic speaker diversity in terms of gender and age within the corpus.

Recommended Processing

  • Verify audio quality

  • Normalize transcription text

  • Match audio and transcription filenames

  • Check alignment consistency

  • Remove noisy or corrupted files

  • Standardize formats and metadata

Sample

1
00:00:00,001 --> 00:00:02,956
ನಾನು ಇಂದು ಶಿಕ್ಷಣದ ಬಗ್ಗೆ ಮಾತನಾಡಲ್ಲ ಶಿಕ್ಷಣದ

2
00:00:02,980 --> 00:00:04,783
ಮಹತ್ವದ ಬಗ್ಗೆ

3
00:00:04,807 --> 00:00:06,031
ಮಾತನಾಡಲು ಹೊರಟಿದ್ದೇನೆ.