ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants

Specifics

Licensing

GNU General Public License v3.0 or later (GPL-3.0)

https://spdx.org/licenses/GPL-3.0-or-later.html

Considerations

Forbidden Usage

The use of this data to train, generate, or create synthetic voice clones or deepfake video/audio likenesses of any person featured is explicitly prohibited.

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants

ReRooted is an open-access online YouTube corpus of interviews with Armenian refugees and immigrants. As of now, the online corpus has over 80hrs of recordings on YouTube, alongside subtitles in Armenian from Amara. The current dataset is 10 hours from the corpus that we have curated with corrected and finely-annotated transcripts. The dataset is for use in NLP research, and we hope to continuously update the dataset with more curated transcripts.

Speaker	Duration (hh:mm:ss)
Ani Avakian	0:42:56
Ara Boudakian	1:04:00
Dikran Sazian	1:26:00
Garbis Arabatlian	0:34:25
Hagop Kereshian	0:30:51
Kevork Mouradian	0:47:50
Pardy Minassian	0:54:00
Terez Barsoum	0:22:18
Zarouhi Hamalian	0:43:29
Talar Berberian	0:56:48
Talene	0:38:52
Vartanoush Shitilian	0:33:02
Anjel Iranian	0:45:35
Shushanik Nargozian	0:20:03

For the first pass, the original SRT transcripts were automatically converted into TextGrids (using SrtToTextgrid). The TextGrids were then manually cleaned up to catch missing words and to re-align the utterance boundaries. It took around 10hrs to clean up 1hr of video. Priority was first given to editing the Interviewee's speech, after which the Interviewer's speech was fixed.

This dataset contains the cleaned-up TextGrids and the accompanying WAV sound files.

Sample TextGrid

The following excerpt is taken from the AniAvakian.TextGrid file, with two intervals from each of the 3 tiers — Interviewee, Interviewer and Notes — it contains:

File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0 
xmax = 2576.435 
tiers?  
size = 3 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "Interviewee" 
        xmin = 0 
        xmax = 2576.435 
        intervals: size = 1490 
        intervals [1]:
            xmin = 0 
            xmax = 10.970132923614393 
            text = "" 
        intervals [2]:
            xmin = 10.970132923614393 
            xmax = 12.741053044197592 
            text = "Օքէյ, նախահայրերս,  " 

    [... snip ...]

    item [2]:
        class = "IntervalTier" 
        name = "Interviewer" 
        xmin = 0 
        xmax = 2576.435 
        intervals: size = 338 
        intervals [1]:
            xmin = 0 
            xmax = 4.360865870301772 
            text = "" 
        intervals [2]:
            xmin = 4.360865870301772 
            xmax = 7.053409517047024 
            text = "Ուրեմ---, հիմ--- կը սկսինք, եթէ ըմմ " 

    [... snip ...]

    item [3]:
        class = "IntervalTier" 
        name = "Notes" 
        xmin = 0 
        xmax = 2576.435 
        intervals: size = 214 
        intervals [1]:
            xmin = 0 
            xmax = 10.970132923614393 
            text = "" 
        intervals [2]:
            xmin = 10.970132923614393 
            xmax = 13.531393243288772 
            text = "interruption"

Metadata

The metadata.tsv file contains interviewee and interview details, as well as links to the original rerooted.org pages, Youtube videos, SRT subtiles, and WAV audio files. Below is a sample entry from that file:

Interviewee name (English)	Interviewee name (Armenian)	Rerooted link	Gender	Current location	City of birth	Ancestral home	Age	Date of interview	Subtitle version (English)	Subtitle version (Armenian)	Length	AMARA link (subtitles)	YouTube link	Audio file link	path_to_textgrid	path_to_audio
Ani Avakian	Անի Ավագեան	link	F	Yerevan, Armenia	Aleppo, Syria	Marash	51	16-Aug-17	Rev 336 (2020)	Rev 67 (2020)	0:42:56	link	link	link	TextGrids/AniAvakian.TextGrid	Audios/AniAvakian.wav

Contributing

Please contact us if you are interested in the remaining 70 hours of the original ReRooted corpus — especially if you speak Armenian and wish to collaborate on preparing the TextGrid files.

ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants

Description

Specifics

Considerations

Processes

Metadata

ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants

Contents

Sample TextGrid

Metadata

Contributing