Greek PhD Theses Corpus v1.0

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

EELLAK - GreekFOSS

Task: NLP

Release Date: 1/27/2026

Format: JASONL

Size: 7.02 GB


Share

Description

The Greek PhD Theses Corpus is a large-scale, AI-ready text dataset consisting of 55,423 Greek doctoral dissertations produced between 1975 and 2025. It represents the most comprehensive and technically homogenized collection of Greek PhD-level academic writing assembled to date. The corpus combines full dissertation texts with rich, structured metadata, processed through a modern, GPU-accelerated pipeline that includes advanced OCR, markdown normalization, and extensive quality assurance.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Mandatory License Adherence

Forbidden Usage

-

Processes

Intended Use

1. Large Language Model (LLM) Training & Fine-Tuning 2. Retrieval-Augmented Generation (RAG) Systems 3. Scientometrics and Research Trend Analysis

Metadata

The dataset is primarily sourced from OpenArchives.gr, enriched with additional smaller datasets from didaktorika.gr. OpenArchives.gr is a large repository of Greek scientific and academic texts originating from universities, research centers, and libraries in Greece and Cyprus. Didaktorika.gr is the National Archive of Doctoral Dissertations, which digitally aggregates doctoral theses from all Greek universities across all research fields.

The resulting dataset comprises 55,423 doctoral dissertations, covering the period 1975–2025, and includes 26 structured metadata fields.

Following data collection, the PDF files underwent three main stages of processing:

Text Extraction from PDFs Depending on the nature of each PDF file, two different approaches were applied: direct extraction of embedded text from text-based PDFs, or Optical Character Recognition (OCR) for image-based PDFs. Initial OCR processing was performed using Tesseract; however, the release of DeepSeek OCR resulted in a substantial improvement in accuracy, particularly for scientific symbols, polytonic Greek scripts, and complex document layouts. Consequently, all PDF files were reprocessed from scratch using DeepSeek OCR.

PDF-to-Markdown Conversion with Docling For files containing embedded text, Docling was used to perform the conversion to Markdown. Docling provides consistent Markdown output, reduces noise, preserves basic document structure (e.g., headings and subsections), and delivers relatively homogeneous results across large-scale batch processing. In cases where OCR was required, the Markdown conversion was performed in the second stage of the pipeline.

OCR, Cleaning, and Normalization via GlossAPI The final processing stages were implemented through a customized GlossAPI pipeline. This pipeline was specifically adapted to support parallel execution in a GPU environment, significantly reducing overall processing time. Four NVIDIA A10G GPUs were utilized for this purpose. All processing infrastructure (excluding web scraping) was deployed on AWS, using a g5.12xlarge instance, which simultaneously supported OCR, Markdown conversion, and data cleaning operations.