Polish Public Domain 20th Century Literature Text Corpus

License icon

License:

CC0-1.0

Shield icon

Steward:

Taruen

Task: NLP

Release Date: 2/24/2026

Format: TXT

Size: 10.86 MB


Share

Description

This corpus contains a curated collection of 54 iconic Polish literary works, including major novels, sprawling multi-volume historical epics, and documentary prose from the late 19th and early 20th centuries. The dataset features the complete canonical works of literary titans such as Władysław Reymont, Stefan Żeromski, Henryk Sienkiewicz, Bolesław Prus, Józef Ignacy Kraszewski, Eliza Orzeszkowa, Tadeusz Dołęga-Mostowicz, and Zofia Nałkowska. All texts utilize modern Polish orthography (post-1936 standard) to ensure consistency and utility for training contemporary language models. The corpus comprises approximately 4.2 million words across multiple plain text files, with each file prefaced by structured YAML front matter containing relevant metadata (author, year, source URL). All included works are fully in the public domain under Polish law.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None

Forbidden Usage

None

Metadata

Polish Public Domain 20th Century Literature Text Corpus

Overview

This text corpus contains a massive collection of iconic Polish prose literature (including novels, multi-volume epics, and documentary prose) from the late 19th and early 20th centuries. All included works are in the public domain in Poland.

Statistics

  • Total Word Count: ~4,220,714

  • Language: Polish (pl)

  • Format: Multiple plain text files with YAML front matter

Included Works

  • Chłopi, Część pierwsza - Jesień (Władysław Reymont, 1904)

  • Chłopi, Część druga - Zima (Władysław Reymont, 1904)

  • Chłopi, Część trzecia - Wiosna (Władysław Reymont, 1906)

  • Chłopi, Część czwarta - Lato (Władysław Reymont, 1909)

  • Ziemia Obiecana, Tom 1 (Władysław Reymont, 1899)

  • Ziemia Obiecana, Tom 2 (Władysław Reymont, 1899)

  • Komediantka (Władysław Reymont, 1896)

  • Fermenty, Tom 1 (Władysław Reymont, 1897)

  • Fermenty, Tom 2 (Władysław Reymont, 1897)

  • Wampir (Władysław Reymont, 1911)

  • Bunt (Władysław Reymont, 1924)

  • Popioły, Tom 1 (Stefan Żeromski, 1904)

  • Popioły, Tom 2 (Stefan Żeromski, 1904)

  • Popioły, Tom 3 (Stefan Żeromski, 1904)

  • Przedwiośnie (Stefan Żeromski, 1924)

  • Ludzie bezdomni, Tom 1 (Stefan Żeromski, 1899)

  • Ludzie bezdomni, Tom 2 (Stefan Żeromski, 1899)

  • Syzyfowe prace (Stefan Żeromski, 1897)

  • Wierna rzeka (Stefan Żeromski, 1912)

  • Dzieje grzechu (Stefan Żeromski, 1908)

  • W pustyni i w puszczy (Henryk Sienkiewicz, 1911)

  • Ogniem i mieczem, Tom 1 (Henryk Sienkiewicz, 1884)

  • Ogniem i mieczem, Tom 2 (Henryk Sienkiewicz, 1884)

  • Potop, Tom 1 (Henryk Sienkiewicz, 1886)

  • Potop, Tom 2 (Henryk Sienkiewicz, 1886)

  • Potop, Tom 3 (Henryk Sienkiewicz, 1886)

  • Pan Wołodyjowski (Henryk Sienkiewicz, 1888)

  • Quo vadis (Henryk Sienkiewicz, 1896)

  • Krzyżacy, Tom 1 (Henryk Sienkiewicz, 1900)

  • Krzyżacy, Tom 2 (Henryk Sienkiewicz, 1900)

  • Rodzina Połanieckich (Henryk Sienkiewicz, 1894)

  • Bez dogmatu (Henryk Sienkiewicz, 1891)

  • Faraon, Tom 1 (Bolesław Prus, 1895)

  • Faraon, Tom 2 (Bolesław Prus, 1895)

  • Faraon, Tom 3 (Bolesław Prus, 1895)

  • Lalka, Tom 1 (Bolesław Prus, 1890)

  • Lalka, Tom 2 (Bolesław Prus, 1890)

  • Emancypantki, Tom 1 (Bolesław Prus, 1894)

  • Emancypantki, Tom 2 (Bolesław Prus, 1894)

  • Placówka (Bolesław Prus, 1886)

  • Zemsta (Bolesław Prus, 1908)

  • Stara baśń, Tom 1 (Józef Ignacy Kraszewski, 1876)

  • Stara baśń, Tom 2 (Józef Ignacy Kraszewski, 1876)

  • Stara baśń, Tom 3 (Józef Ignacy Kraszewski, 1876)

  • Nad Niemnem, Tom 1 (Eliza Orzeszkowa, 1888)

  • Nad Niemnem, Tom 2 (Eliza Orzeszkowa, 1888)

  • Nad Niemnem, Tom 3 (Eliza Orzeszkowa, 1888)

  • Cham (Eliza Orzeszkowa, 1888)

  • Marta (Eliza Orzeszkowa, 1873)

  • Kariera Nikodema Dyzmy (Tadeusz Dołęga-Mostowicz, 1932)

  • Znachor (Tadeusz Dołęga-Mostowicz, 1937)

  • Profesor Wilczur (Tadeusz Dołęga-Mostowicz, 1939)

  • Granica (Zofia Nałkowska, 1935)

  • Medaliony (Zofia Nałkowska, 1946)

Data Format and Metadata

The files are provided in plain text format. Each text is prepended with a YAML Front Matter block containing relevant meta

---
title: "Chłopi, Część pierwsza - Jesień"
author: "Władysław Reymont"
lang: "pl"
year: "1904"
source: "[https://wolnelektury.pl/katalog/lektura/chlopi-czesc-pierwsza-jesien](https://wolnelektury.pl/katalog/lektura/chlopi-czesc-pierwsza-jesien)"
license: "Public Domain"
---

Field Definitions:

  • title: The title of the literary work.

  • author: The author of the work.

  • lang: Language code ('pl' for Polish).

  • year: The year of initial publication.

  • source: The exact URL to the source material on Wolne Lektury.

  • license: The copyright status of the text itself.

Processing Methodology

  1. Source: Texts were fetched directly from the digital library Wolne Lektury (wolnelektury.pl).

  2. Cleaning: The script removed the publisher's legal footer and copyright notice from the end of each file to isolate the pure public domain literary text.

  3. Orthography: The texts utilize modern Polish orthography (post-1936 standard).

Copyright and License

The literary works themselves are in the public domain. Under Polish law, economic copyrights expire 70 years after the end of the year of the author's death. As of 2026, works by authors who died before 1956 are in the public domain.

  • Władysław Reymont (d. 1925)

  • Stefan Żeromski (d. 1925)

  • Henryk Sienkiewicz (d. 1916)

  • Bolesław Prus (d. 1912)

  • Józef Ignacy Kraszewski (d. 1887)

  • Eliza Orzeszkowa (d. 1910)

  • Tadeusz Dołęga-Mostowicz (d. 1939)

  • Zofia Nałkowska (d. 1954)

This dataset is released for the Mozilla Data Collective to aid in the development of free/libre/open-source language technologies.

Support Wolne Lektury

The digitization and proofreading of these texts were performed by the Modern Poland Foundation. If you find this text corpus useful, please consider supporting their mission to keep literature accessible to all: