OCR4all-libraries – Introduction

Motivation and General Idea

Availability of Solutions: Numerous high-performance open-source solutions for ATR are already available, with new releases emerging continuously.
Diverse Use Cases: The highly heterogeneous nature of use cases necessitates the targeted deployment of specialized ATR solutions.
Requirement: There is a need for user-friendly frameworks that facilitate the flexible, integrable, and sustainable combination and application of both existing and future ATR solutions.
Objective: Our goal is to empower users to perform ATR independently, achieving high-quality results.
Foundation: This framework is built upon freely available tools, enhanced by our in-house developments.

OCR-D and OCR4all

OCR-D Initiative: The DFG-funded OCR-D initiative is dedicated to facilitating the mass full-text transformation of historical prints published in the German-speaking world.
Focus Areas: OCR-D emphasizes interoperability and connectivity, ensuring a high degree of flexibility and sustainability in its solutions.
Integrated Solutions: The initiative combines multiple ATR solutions within a unified framework, enabling precise adaptation to specific materials and use cases.
Open Source Commitment: All results from the OCR-D project are released as completely open-source.
OCR4all-Libraries Project: The DFG-funded OCR4all-libraries project has two primary goals:
- Providing a user-friendly interface for OCR-D solutions via OCR4all, enabling independent use by non-technical users.
- Enhancing the ATR output within OCR4all to offer added value to even the most technically experienced users.

System Architecture

Modularity and Interoperability: The framework is designed with a strong focus on modularity and interoperability, ensuring seamless integration and adaptability.
Distributed Infrastructure: The architecture features a distributed infrastructure, with a clear separation between the backend and frontend components.
- Backend: Built with Java and Spring Boot.
- Frontend: Developed using the Vue.js ecosystem.
Component Communication: Components communicate via a REST API, enabling efficient interaction between different parts of the system.
Integration of Third-Party Solutions: Service Provider Interfaces (SPIs) allow for the integration of third-party solutions, such as ATR processors.
Containerized Setup: The containerized architecture ensures easy distribution and deployment of all integrated components with minimal barriers.
Data Sovereignty: Users retain full control over their data, with no data leaving the instance without explicit user or administrator consent.
Reproducibility: Every step in the process is fully reproducible. A "transcript of records" feature stores detailed information about the processors and parameters used, ensuring transparency and repeatability.

Modules

Data Management and Processing

Separation of Functions: Data management and processing are strictly separated to ensure efficient handling and security.
Data Sharing: Data can be shared with different users or user groups as needed.

Processors and NodeFlow

Wide Array of Processors: A diverse range of ATR processors is available, including OCR-D and external options.
Ease of Integration: New processors can be easily implemented via a well-defined interface, with the user interface generated automatically.
NodeFlow: The graphical editor NodeFlow simplifies the creation of workflows, making it convenient for users to design and customize processing sequences.

LAREX

Result Correction and Training Data Creation: LAREX allows for the correction of all ATR workflow results and the creation of training data.
Visual Workflow Identification: LAREX helps users identify the most suitable workflows as a visual explanation component.

Datasets, Training, and Evaluation

Dataset Creation: Datasets can be created with the option to use tagging and import functionalities.
Dataset Enrichment: Datasets can be enriched with training data generated and tagged within the application, even across various projects and sources.
Model Training: Train models on selected datasets or subsets thereof, with options for in-app usage or exporting both models and associated training data.
Model Evaluation: Evaluate both trained and imported models using curated datasets to ensure quality and accuracy.

Working with OCR4all-libraries

Example Use Cases and Application Scenarios

Fully Automatic Mass Full-Text Digitalization

Objective: Maximize throughput with minimal manual effort.
Users: Libraries and archives processing large volumes of scanned materials.
Approach: Use the pro mode (NodeFlow, LAREX, and datasets) to identify the most suitable workflow.

Flawless Transcription of Source Material

Objective: Achieve maximum quality, accepting significant manual effort.
Users: Humanist researchers preparing text for a digital edition.
Approach: Utilize the base mode for iterative transcription with continually improving accuracy.

Building Corpora for Quantitative Applications

Objective: Maximize quality while minimizing manual effort.
Users: Researchers constructing corpora for training and evaluating quantitative methods.
Approach: Manage data and consistently retrain source-specific or mixed models using datasets and tagging functionalities.

OCR4all-libraries – Introduction ​

Motivation and General Idea ​

OCR-D and OCR4all ​

System Architecture ​

Modules ​

Data Management and Processing ​

Processors and NodeFlow ​

LAREX ​

Datasets, Training, and Evaluation ​

Working with OCR4all-libraries ​

Example Use Cases and Application Scenarios ​

Fully Automatic Mass Full-Text Digitalization ​

Flawless Transcription of Source Material ​

Building Corpora for Quantitative Applications ​

OCR4all-libraries – Introduction

Motivation and General Idea

OCR-D and OCR4all

System Architecture

Modules

Data Management and Processing

Processors and NodeFlow

LAREX

Datasets, Training, and Evaluation

Working with OCR4all-libraries

Example Use Cases and Application Scenarios

Fully Automatic Mass Full-Text Digitalization

Flawless Transcription of Source Material

Building Corpora for Quantitative Applications