Greek PhD Theses Corpus v1.0
License:
CC-BY-NC-SA-4.0
Steward:
EELLAK - GreekFOSSTask: NLP
Release Date: 1/27/2026
Format: JASONL
Size: 7.02 GB
Share
Description
The Greek PhD Theses Corpus is a large-scale, AI-ready text dataset consisting of 55,423 Greek doctoral dissertations produced between 1975 and 2025. It represents the most comprehensive and technically homogenized collection of Greek PhD-level academic writing assembled to date. The corpus combines full dissertation texts with rich, structured metadata, processed through a modern, GPU-accelerated pipeline that includes advanced OCR, markdown normalization, and extensive quality assurance.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
Mandatory License Adherence
Forbidden Usage
-
Processes
Intended Use
1. Large Language Model (LLM) Training & Fine-Tuning 2. Retrieval-Augmented Generation (RAG) Systems 3. Scientometrics and Research Trend Analysis
Metadata
The dataset is primarily sourced from OpenArchives.gr, enriched with additional smaller datasets from didaktorika.gr. OpenArchives.gr is a large repository of Greek scientific and academic texts originating from universities, research centers, and libraries in Greece and Cyprus. Didaktorika.gr is the National Archive of Doctoral Dissertations, which digitally aggregates doctoral theses from all Greek universities across all research fields.
The resulting dataset comprises 55,423 doctoral dissertations, covering the period 1975–2025, and includes 26 structured metadata fields.
Following data collection, the PDF files underwent three main stages of processing:
Text Extraction from PDFs Depending on the nature of each PDF file, two different approaches were applied: direct extraction of embedded text from text-based PDFs, or Optical Character Recognition (OCR) for image-based PDFs. Initial OCR processing was performed using Tesseract; however, the release of DeepSeek OCR resulted in a substantial improvement in accuracy, particularly for scientific symbols, polytonic Greek scripts, and complex document layouts. Consequently, all PDF files were reprocessed from scratch using DeepSeek OCR.
PDF-to-Markdown Conversion with Docling For files containing embedded text, Docling was used to perform the conversion to Markdown. Docling provides consistent Markdown output, reduces noise, preserves basic document structure (e.g., headings and subsections), and delivers relatively homogeneous results across large-scale batch processing. In cases where OCR was required, the Markdown conversion was performed in the second stage of the pipeline.
OCR, Cleaning, and Normalization via GlossAPI The final processing stages were implemented through a customized GlossAPI pipeline. This pipeline was specifically adapted to support parallel execution in a GPU environment, significantly reducing overall processing time. Four NVIDIA A10G GPUs were utilized for this purpose. All processing infrastructure (excluding web scraping) was deployed on AWS, using a g5.12xlarge instance, which simultaneously supported OCR, Markdown conversion, and data cleaning operations.