Skip to content

OCR4all-libraries – Introduction

Motivation and General Idea

  • Availability of Solutions: Numerous high-performance open-source solutions for ATR are already available, with new releases emerging continuously.
  • Diverse Use Cases: The highly heterogeneous nature of use cases necessitates the targeted deployment of specialized ATR solutions.
  • Requirement: There is a need for user-friendly frameworks that facilitate the flexible, integrable, and sustainable combination and application of both existing and future ATR solutions.
  • Objective: Our goal is to empower users to perform ATR independently, achieving high-quality results.
  • Foundation: This framework is built upon freely available tools, enhanced by our in-house developments.

OCR-D and OCR4all

  • OCR-D Initiative: The DFG-funded OCR-D initiative is dedicated to facilitating the mass full-text transformation of historical prints published in the German-speaking world.
  • Focus Areas: OCR-D emphasizes interoperability and connectivity, ensuring a high degree of flexibility and sustainability in its solutions.
  • Integrated Solutions: The initiative combines multiple ATR solutions within a unified framework, enabling precise adaptation to specific materials and use cases.
  • Open Source Commitment: All results from the OCR-D project are released as completely open-source.
  • OCR4all-Libraries Project: The DFG-funded OCR4all-libraries project has two primary goals:
    • Providing a user-friendly interface for OCR-D solutions via OCR4all, enabling independent use by non-technical users.
    • Enhancing the ATR output within OCR4all to offer added value to even the most technically experienced users.

System Architecture

  • Modularity and Interoperability: The framework is designed with a strong focus on modularity and interoperability, ensuring seamless integration and adaptability.
  • Distributed Infrastructure: The architecture features a distributed infrastructure, with a clear separation between the backend and frontend components.
    • Backend: Built with Java and Spring Boot.
    • Frontend: Developed using the Vue.js ecosystem.
  • Component Communication: Components communicate via a REST API, enabling efficient interaction between different parts of the system.
  • Integration of Third-Party Solutions: Service Provider Interfaces (SPIs) allow for the integration of third-party solutions, such as ATR processors.
  • Containerized Setup: The containerized architecture ensures easy distribution and deployment of all integrated components with minimal barriers.
  • Data Sovereignty: Users retain full control over their data, with no data leaving the instance without explicit user or administrator consent.
  • Reproducibility: Every step in the process is fully reproducible. A "transcript of records" feature stores detailed information about the processors and parameters used, ensuring transparency and repeatability.

Modules

Data Management and Processing

  • Separation of Functions: Data management and processing are strictly separated to ensure efficient handling and security.
  • Data Sharing: Data can be shared with different users or user groups as needed.

Processors and NodeFlow

  • Wide Array of Processors: A diverse range of ATR processors is available, including OCR-D and external options.
  • Ease of Integration: New processors can be easily implemented via a well-defined interface, with the user interface generated automatically.
  • NodeFlow: The graphical editor NodeFlow simplifies the creation of workflows, making it convenient for users to design and customize processing sequences.

LAREX

  • Result Correction and Training Data Creation: LAREX allows for the correction of all ATR workflow results and the creation of training data.
  • Visual Workflow Identification: LAREX helps users identify the most suitable workflows as a visual explanation component.

Datasets, Training, and Evaluation

  • Dataset Creation: Datasets can be created with the option to use tagging and import functionalities.
  • Dataset Enrichment: Datasets can be enriched with training data generated and tagged within the application, even across various projects and sources.
  • Model Training: Train models on selected datasets or subsets thereof, with options for in-app usage or exporting both models and associated training data.
  • Model Evaluation: Evaluate both trained and imported models using curated datasets to ensure quality and accuracy.

Working with OCR4all-libraries

Example Use Cases and Application Scenarios

Fully Automatic Mass Full-Text Digitalization

  • Objective: Maximize throughput with minimal manual effort.
  • Users: Libraries and archives processing large volumes of scanned materials.
  • Approach: Use the pro mode (NodeFlow, LAREX, and datasets) to identify the most suitable workflow.

Flawless Transcription of Source Material

  • Objective: Achieve maximum quality, accepting significant manual effort.
  • Users: Humanist researchers preparing text for a digital edition.
  • Approach: Utilize the base mode for iterative transcription with continually improving accuracy.

Building Corpora for Quantitative Applications

  • Objective: Maximize quality while minimizing manual effort.
  • Users: Researchers constructing corpora for training and evaluating quantitative methods.
  • Approach: Manage data and consistently retrain source-specific or mixed models using datasets and tagging functionalities.