Skip to content

What is OCR4all?

OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material. At pretty much any stage of the workflow the user can interact with the results in order to minimize consequential errors and optimize the end result.

Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.

With the closure of the second project stage of the BMBF-funded joint project Kallimachos the software is now being established at the center for philology and digitally of the University of Würzburg, which opens the program up for the widest possible user group.

Workflow

The workflow starts with the Preprocessing of the relevant image files. Layout segmentation (so-called Region Segmentation carried out with LAREX and Line Segmentation follow. Next is the Text Recognition which is carried out with Calamari. The final stage is the correction of the recognized texts the so-called Ground Truth Production. This Ground Truth is then the foundation for creating work-specific OCR models in a training module. Therefore OCR4all entails a full-featured OCR workflow.

Workflow

Particularly due to the capacity to create and train work-specific text recognition models, OCR4all makes achieving high-quality results in the digitization of texts in nearly all printed documents possible.

SegmentationCorrection

Cooperation with OCR-D

In the summer of 2020, a co-operation between OCR4all and the coordinated funding initiative for further development of processes involving Optical Character Recognition (OCR-D) was arranged.

The main goal of the DFG-funded OCR-D project was the conceptual as well as technical preparation of the mass digitization of printed texts published in german-speaking areas from the 16th to the 18th century (VD16, VD17, VD18).

For this purpose, the automatic full-text recognition, analogous to the OCR4all approach, is divided into individual process steps that can be reproduced in the Open Source OCR-D software. This aims to create optimized workflows for the old prints to be processed and thus generating scientifically applicable full texts.

The aim of the co-operation is not only the continuous exchange of information mainly about interfaces, scalable software implementations, creation and provision of GT but the upcoming developments in the OCR field as well. Furthermore, it strives to achieve a technical convergence of the two projects. For this purpose, OCR4all will implement the OCR-D specifications in its OCR solution and realize its interfaces for OCR-D tools. With OCR4all's internal use of OCR-D solutions, OCR4all users will benefit from the extended selection of tools and the associated possibilities, whereas OCR-D will have a broader scope and, through simplified access, will also reach new user groups inside and outside VD mass digitization.

Reporting (assortment)

Cite

If you are using OCR4all please cite the corresponding paper:

Reul, C., Christ, D., Hartelt, A., Balbach, N., Wehner, M., Springmann, U., Wick, C., Grundig, Büttner, A., C.,
Puppe, F.: OCR4all — An open-source tool providing a (semi-) automatic OCR workflow for historical printings,
Applied Sciences 9(22) (2019)

Funding