ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants
License:
GPL-3.0
Steward:
Rerooted Archive
Task: ASR
Release Date: 11/18/2025
Format: WAV, TEXTGRID
Size: 3.25 GB
Description
ReRooted is an open-access online YouTube corpus of interviews with Armenian refugees and immigrants. As of now, the online corpus has over 80hrs of recordings on YouTube, alongside subtitles in Armenian from Amara. The current dataset is 10 hours from the corpus that we have curated with corrected and finely-annotated transcripts. The dataset is for use in NLP research, and we hope to continuously update the dataset with more curated transcripts.
Specifics
Licensing
GNU General Public License v3.0 or later (GPL-3.0)
https://spdx.org/licenses/GPL-3.0-or-later.htmlConsiderations
Forbidden Usage
The use of this data to train, generate, or create synthetic voice clones or deepfake video/audio likenesses of any person featured is explicitly prohibited.
Processes
Intended Use
This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.
Metadata
ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants
ReRooted is an open-access online YouTube corpus of interviews with Armenian refugees and immigrants. As of now, the online corpus has over 80hrs of recordings on YouTube, alongside subtitles in Armenian from Amara. The current dataset is 10 hours from the corpus that we have curated with corrected and finely-annotated transcripts. The dataset is for use in NLP research, and we hope to continuously update the dataset with more curated transcripts.
Contents
| Speaker | Duration (hh:mm:ss) |
|---|---|
| Ani Avakian | 0:42:56 |
| Ara Boudakian | 1:04:00 |
| Dikran Sazian | 1:26:00 |
| Garbis Arabatlian | 0:34:25 |
| Hagop Kereshian | 0:30:51 |
| Kevork Mouradian | 0:47:50 |
| Pardy Minassian | 0:54:00 |
| Terez Barsoum | 0:22:18 |
| Zarouhi Hamalian | 0:43:29 |
| Talar Berberian | 0:56:48 |
| Talene | 0:38:52 |
| Vartanoush Shitilian | 0:33:02 |
| Anjel Iranian | 0:45:35 |
| Shushanik Nargozian | 0:20:03 |
For the first pass, the original SRT transcripts were automatically converted into TextGrids (using SrtToTextgrid). The TextGrids were then manually cleaned up to catch missing words and to re-align the utterance boundaries. It took around 10hrs to clean up 1hr of video. Priority was first given to editing the Interviewee's speech, after which the Interviewer's speech was fixed.
This dataset contains the cleaned-up TextGrids and the accompanying WAV sound files.
Sample TextGrid
The following excerpt is taken from the AniAvakian.TextGrid file, with two intervals from each of the 3 tiers — Interviewee, Interviewer and Notes — it contains:
File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0
xmax = 2576.435
tiers?
size = 3
item []:
item [1]:
class = "IntervalTier"
name = "Interviewee"
xmin = 0
xmax = 2576.435
intervals: size = 1490
intervals [1]:
xmin = 0
xmax = 10.970132923614393
text = ""
intervals [2]:
xmin = 10.970132923614393
xmax = 12.741053044197592
text = "Օքէյ, նախահայրերս, "
[... snip ...]
item [2]:
class = "IntervalTier"
name = "Interviewer"
xmin = 0
xmax = 2576.435
intervals: size = 338
intervals [1]:
xmin = 0
xmax = 4.360865870301772
text = ""
intervals [2]:
xmin = 4.360865870301772
xmax = 7.053409517047024
text = "Ուրեմ---, հիմ--- կը սկսինք, եթէ ըմմ "
[... snip ...]
item [3]:
class = "IntervalTier"
name = "Notes"
xmin = 0
xmax = 2576.435
intervals: size = 214
intervals [1]:
xmin = 0
xmax = 10.970132923614393
text = ""
intervals [2]:
xmin = 10.970132923614393
xmax = 13.531393243288772
text = "interruption"
Metadata
The metadata.tsv file contains interviewee and interview details, as well as links to the original rerooted.org pages, Youtube videos, SRT subtiles, and WAV audio files. Below is a sample entry from that file:
| Interviewee name (English) | Interviewee name (Armenian) | Rerooted link | Gender | Current location | City of birth | Ancestral home | Age | Date of interview | Subtitle version (English) | Subtitle version (Armenian) | Length | AMARA link (subtitles) | YouTube link | Audio file link | path_to_textgrid | path_to_audio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ani Avakian | Անի Ավագեան | link | F | Yerevan, Armenia | Aleppo, Syria | Marash | 51 | 16-Aug-17 | Rev 336 (2020) | Rev 67 (2020) | 0:42:56 | link | link | link | TextGrids/AniAvakian.TextGrid | Audios/AniAvakian.wav |
Contributing
Please contact us if you are interested in the remaining 70 hours of the original ReRooted corpus — especially if you speak Armenian and wish to collaborate on preparing the TextGrid files.
