Ocrocis - how it works

How it works

Workflow

There are five (and a half) commands:

Project initialization
1. ocrocis convert - convert a PDF file or a set of images to binary PNG images
2. ocrocis burst - segment each binary page image into a directory of line images
Iteration
1. ocrocis next - initialize the next project iteration
2. ocrocis train - prepare and run training
3. ocrocis predict --errors - run predictions and evaluations on annotated data
Model application
1. ocrocis predict --dir [directory] - run predictions on unannotated data

Each command may or may not require some preparation (e.g. cropping images or annotating lines). See the documentation for further details.

Output flow

Ocrocis itself is silent by default. All native Ocropus output is printed to standard output for convenient logging (e.g. with > logfile.txt).

Use --verbose to print verbose information to standard error (2>).

Project architecture

Data is structured

Ocrocis works on a fixed directory structure that allows it to efficiently manage the data generated in the training process:

book repository with pages and line images
iterations
- for each iteration: annotation set, annotation HTML, models
training set
test set

Data is linked

Ocrocis uses hardlinks to consolidate data between the book repository, annotation sets, trainings sets and test sets. Data is always linked to its source directory, never copied. This avoids redundant disk usage.

There are two kinds of source directories:

book: Original pages and line images, linked to by annotation, training and test
annotation (for each iteration): Original ground truths, linked to by training and test

The link structure is tree-like, analog to the directory structure, not chain-like. Each link points to the original data.

Data is unique

Note: this is only valid for Linux systems. On OS X, data has to be copied due to Docker restrictions on hardlinking outside the docker container.

Changes to files (including links) anywhere in the link tree are reflected in the source directory. This is especially useful if you wish to modify the images (e.g. by cropping) at some iteration in the training process. The changes are reflected everywhere in the project.

Data is safe

The link structure also allows you to safely delete any link in the tree. The original data is kept, as long as there is still at least one link somewhere in the project. (Note that technically, the original data in book are also just links.)

This tree shows a part of an example project with two pages used for training and testing.

home
  ├── book
  │    ├── [original page images]
  │    ├── 0001
  │    │    └── [original line images]
  │    ├── 0002
  │    │    └── [original line images]
  │    └── ...
  │
  ├── iterations
  │    ├── 01
  │    │    ├── annotation
  │    │    │    ├── 0001
  │    │    │    │    ├── [links to line images in book]
  │    │    │    │    └── [original ground truths]
  │    │    │    ├── 0002
  │    │    │    │    ├── [links to line images in book]
  │    │    │    │    └── [original ground truths]
  │    │    │    └── ...
  │    │    ├── models
  │    │    │    └── [model files]
  │    │    └── index.html
  │    ├── 02
  │    │    └── ...
  │    └── ...
  │
  ├── training
  │    ├── 0001
  │    │    ├── [links to line images in book]
  │    │    └── [links to ground truths in annotation]
  │    └── ...
  │
  └── test
       ├── 0002
       │    ├── [links to line images in book]
       │    └── [links to ground truths in annotation]
       └── ...