ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants

License icon

License:

GPL-3.0

Shield icon

Steward:

Rerooted Archive

Task: ASR

Release Date: 11/18/2025

Format: WAV, TEXTGRID

Size: 3.25 GB


Description

ReRooted is an open-access online YouTube corpus of interviews with Armenian refugees and immigrants. As of now, the online corpus has over 80hrs of recordings on YouTube, alongside subtitles in Armenian from Amara. The current dataset is 10 hours from the corpus that we have curated with corrected and finely-annotated transcripts. The dataset is for use in NLP research, and we hope to continuously update the dataset with more curated transcripts.

Specifics

Licensing

GNU General Public License v3.0 or later (GPL-3.0)

https://spdx.org/licenses/GPL-3.0-or-later.html

Considerations

Forbidden Usage

The use of this data to train, generate, or create synthetic voice clones or deepfake video/audio likenesses of any person featured is explicitly prohibited.

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants

ReRooted is an open-access online YouTube corpus of interviews with Armenian refugees and immigrants. As of now, the online corpus has over 80hrs of recordings on YouTube, alongside subtitles in Armenian from Amara. The current dataset is 10 hours from the corpus that we have curated with corrected and finely-annotated transcripts. The dataset is for use in NLP research, and we hope to continuously update the dataset with more curated transcripts.

Contents

SpeakerDuration (hh:mm:ss)
Ani Avakian0:42:56
Ara Boudakian1:04:00
Dikran Sazian1:26:00
Garbis Arabatlian0:34:25
Hagop Kereshian0:30:51
Kevork Mouradian0:47:50
Pardy Minassian0:54:00
Terez Barsoum0:22:18
Zarouhi Hamalian0:43:29
Talar Berberian0:56:48
Talene0:38:52
Vartanoush Shitilian0:33:02
Anjel Iranian0:45:35
Shushanik Nargozian0:20:03

For the first pass, the original SRT transcripts were automatically converted into TextGrids (using SrtToTextgrid). The TextGrids were then manually cleaned up to catch missing words and to re-align the utterance boundaries. It took around 10hrs to clean up 1hr of video. Priority was first given to editing the Interviewee's speech, after which the Interviewer's speech was fixed.

This dataset contains the cleaned-up TextGrids and the accompanying WAV sound files.

Sample TextGrid

The following excerpt is taken from the AniAvakian.TextGrid file, with two intervals from each of the 3 tiers — Interviewee, Interviewer and Notes — it contains:

File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0 
xmax = 2576.435 
tiers?  
size = 3 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "Interviewee" 
        xmin = 0 
        xmax = 2576.435 
        intervals: size = 1490 
        intervals [1]:
            xmin = 0 
            xmax = 10.970132923614393 
            text = "" 
        intervals [2]:
            xmin = 10.970132923614393 
            xmax = 12.741053044197592 
            text = "Օքէյ, նախահայրերս,  " 

    [... snip ...]

    item [2]:
        class = "IntervalTier" 
        name = "Interviewer" 
        xmin = 0 
        xmax = 2576.435 
        intervals: size = 338 
        intervals [1]:
            xmin = 0 
            xmax = 4.360865870301772 
            text = "" 
        intervals [2]:
            xmin = 4.360865870301772 
            xmax = 7.053409517047024 
            text = "Ուրեմ---, հիմ--- կը սկսինք, եթէ ըմմ " 

    [... snip ...]

    item [3]:
        class = "IntervalTier" 
        name = "Notes" 
        xmin = 0 
        xmax = 2576.435 
        intervals: size = 214 
        intervals [1]:
            xmin = 0 
            xmax = 10.970132923614393 
            text = "" 
        intervals [2]:
            xmin = 10.970132923614393 
            xmax = 13.531393243288772 
            text = "interruption" 

Metadata

The metadata.tsv file contains interviewee and interview details, as well as links to the original rerooted.org pages, Youtube videos, SRT subtiles, and WAV audio files. Below is a sample entry from that file:

Interviewee name (English)Interviewee name (Armenian)Rerooted linkGenderCurrent locationCity of birthAncestral homeAgeDate of interviewSubtitle version (English)Subtitle version (Armenian)LengthAMARA link (subtitles)YouTube linkAudio file linkpath_to_textgridpath_to_audio
Ani AvakianԱնի ԱվագեանlinkFYerevan, ArmeniaAleppo, SyriaMarash5116-Aug-17Rev 336 (2020)Rev 67 (2020)0:42:56linklinklinkTextGrids/AniAvakian.TextGridAudios/AniAvakian.wav

Contributing

Please contact us if you are interested in the remaining 70 hours of the original ReRooted corpus — especially if you speak Armenian and wish to collaborate on preparing the TextGrid files.