How it works

Workflow

There are five (and a half) commands:

Each command may or may not require some preparation (e.g. cropping images or annotating lines). See the documentation for further details.

Output flow

Ocrocis itself is silent by default. All native Ocropus output is printed to standard output for convenient logging (e.g. with > logfile.txt).

Use --verbose to print verbose information to standard error (2>).

Project architecture

Data is structured

Ocrocis works on a fixed directory structure that allows it to efficiently manage the data generated in the training process:

Data is linked

Ocrocis uses hardlinks to consolidate data between the book repository, annotation sets, trainings sets and test sets. Data is always linked to its source directory, never copied. This avoids redundant disk usage.

There are two kinds of source directories:

  1. book: Original pages and line images, linked to by annotation, training and test
  2. annotation (for each iteration): Original ground truths, linked to by training and test

The link structure is tree-like, analog to the directory structure, not chain-like. Each link points to the original data.

Data is unique

Note: this is only valid for Linux systems. On OS X, data has to be copied due to Docker restrictions on hardlinking outside the docker container.

Changes to files (including links) anywhere in the link tree are reflected in the source directory. This is especially useful if you wish to modify the images (e.g. by cropping) at some iteration in the training process. The changes are reflected everywhere in the project.

Data is safe

The link structure also allows you to safely delete any link in the tree. The original data is kept, as long as there is still at least one link somewhere in the project. (Note that technically, the original data in book are also just links.)

This tree shows a part of an example project with two pages used for training and testing.

home
  ├── book
  │    ├── [original page images]
  │    ├── 0001
  │    │    └── [original line images]
  │    ├── 0002
  │    │    └── [original line images]
  │    └── ...
  │
  ├── iterations
  │    ├── 01
  │    │    ├── annotation
  │    │    │    ├── 0001
  │    │    │    │    ├── [links to line images in book]
  │    │    │    │    └── [original ground truths]
  │    │    │    ├── 0002
  │    │    │    │    ├── [links to line images in book]
  │    │    │    │    └── [original ground truths]
  │    │    │    └── ...
  │    │    ├── models
  │    │    │    └── [model files]
  │    │    └── index.html
  │    ├── 02
  │    │    └── ...
  │    └── ...
  │
  ├── training
  │    ├── 0001
  │    │    ├── [links to line images in book]
  │    │    └── [links to ground truths in annotation]
  │    └── ...
  │
  └── test
       ├── 0002
       │    ├── [links to line images in book]
       │    └── [links to ground truths in annotation]
       └── ...