Finnish Public Domain 20th Century Literature Text Corpus

License:

CC0-1.0

Steward:

Taruen

Task: NLP

Release Date: 2/27/2026

Format: TXT

Size: 205.76 MB

Description

This corpus contains a curated collection of public domain literature from Finland, featuring works by authors who died between 1901 and 1955. The dataset captures the literary landscape of early 20th-century Finland and includes independent texts in both of the country's official languages: Finnish (fi) and Swedish (sv). The texts were programmatically extracted from Project Lönnrot, a volunteer-driven digital library. To ensure linguistic relevance for modern NLP tasks, the extraction pipeline strictly filtered for works published in 1901 or later. Language codes for each text were dynamically detected using CLD algorithms. The corpus comprises approximately 69.1 million words across multiple plain text files, with each file prefaced by structured YAML front matter containing relevant metadata (title, author, year, source URL, language), followed by the original project's boilerplate preamble enclosed in delimiter tags, and finally the literary text proper. All included works are fully in the public domain under Finnish and EU copyright law.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None

Forbidden Usage

None

Metadata

Finland Public Domain 20th Century Literature Text Corpus

Overview

This text corpus contains a collection of 20th-century literature from Finland. All included works are in the public domain. It includes works in both Finnish (fi) and Swedish (sv).

Statistics

Total Word Count: ~69,163,126
- Finnish (fi): ~68,770,899 words
- Swedish (sv): ~392,227 words
Languages: Finnish (fi), Swedish (sv)
Format: Multiple plain text files with YAML front matter

(Note: The full list of included works is available in the README.md file bundled inside the archive).

Data Format and Metadata

The files are provided in plain text format. Each text is prepended with a YAML Front Matter block containing relevant metadata, followed by the original project's boilerplate preamble enclosed in *** PROJECT LÖNNROT PREAMBLE START *** and *** PROJECT LÖNNROT PREAMBLE END *** tags.

---
title: "Iisakki vähäpuheinen"
author: "Haanpää, Pentti (1905-1955)"
lang: "fi"
year: "1953"
source: "[https://lonnrot.net/kirjat/3717.zip](https://lonnrot.net/kirjat/3717.zip)"
license: "Public Domain"
---

Field Definitions:

title: The title of the literary work.
author: The author of the work including lifespan.
lang: Language code dynamically detected using CLD algorithms ('fi' or 'sv').
year: The year of initial publication extracted from the text.
source: The exact URL to the source material on Project Lönnrot.
license: The copyright status of the text itself.

Processing Methodology

Source: Texts were fetched directly from the digital library Project Lönnrot (lonnrot.net).
Filtering: The script automatically filtered for works authored by writers who died between 1901 and 1955, where the text's publication year is 1901 or later.
Language Detection: The ISO language code for each text was dynamically identified via Python's langdetect library.
Orthography: The texts are preserved in their original transcribed orthography.

Copyright and License

The literary works themselves are in the public domain. Under Finnish and EU copyright law, economic copyrights expire 70 years after the end of the year of the author's death. As of 2026, works by authors who died in 1955 or earlier are in the public domain.

This dataset is released for the Mozilla Data Collective to aid in the development of free/libre/open-source language technologies.

Acknowledgements

The digitization, proofreading, and initial formatting of these texts were performed by the incredible volunteers of Project Lönnrot (Projekti Lönnrot). Project Lönnrot is a community-driven initiative dedicated to making Finnish and Swedish public domain literature freely accessible to everyone.

If you find this text corpus useful, please acknowledge their effort and consider supporting them or volunteering to digitize more works:

Project Lönnrot Website: https://lonnrot.net/