Greek PhD Theses Corpus v1.0

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Mandatory License Adherence

Forbidden Usage

Processes

Intended Use

1. Large Language Model (LLM) Training & Fine-Tuning 2. Retrieval-Augmented Generation (RAG) Systems 3. Scientometrics and Research Trend Analysis

Metadata

The dataset is primarily sourced from OpenArchives.gr, enriched with additional smaller datasets from didaktorika.gr. OpenArchives.gr is a large repository of Greek scientific and academic texts originating from universities, research centers, and libraries in Greece and Cyprus. Didaktorika.gr is the National Archive of Doctoral Dissertations, which digitally aggregates doctoral theses from all Greek universities across all research fields.

The resulting dataset comprises 55,423 doctoral dissertations, covering the period 1975–2025, and includes 26 structured metadata fields.

Following data collection, the PDF files underwent three main stages of processing:

Text Extraction from PDFs Depending on the nature of each PDF file, two different approaches were applied: direct extraction of embedded text from text-based PDFs, or Optical Character Recognition (OCR) for image-based PDFs. Initial OCR processing was performed using Tesseract; however, the release of DeepSeek OCR resulted in a substantial improvement in accuracy, particularly for scientific symbols, polytonic Greek scripts, and complex document layouts. Consequently, all PDF files were reprocessed from scratch using DeepSeek OCR.

PDF-to-Markdown Conversion with Docling For files containing embedded text, Docling was used to perform the conversion to Markdown. Docling provides consistent Markdown output, reduces noise, preserves basic document structure (e.g., headings and subsections), and delivers relatively homogeneous results across large-scale batch processing. In cases where OCR was required, the Markdown conversion was performed in the second stage of the pipeline.

OCR, Cleaning, and Normalization via GlossAPI The final processing stages were implemented through a customized GlossAPI pipeline. This pipeline was specifically adapted to support parallel execution in a GPU environment, significantly reducing overall processing time. Four NVIDIA A10G GPUs were utilized for this purpose. All processing infrastructure (excluding web scraping) was deployed on AWS, using a g5.12xlarge instance, which simultaneously supported OCR, Markdown conversion, and data cleaning operations.

Description

Specifics

Considerations

Processes

Metadata