This walkthrough shows how to train models from the demo PDF.
Please refer to the --help
page of each command for references and further options, like using a specific model or several CPUs.
Note on Pass-through options: Each command takes one or more pass-through options as strings. These parameters are passed through unchanged to the underlying calls of convert or Ocropy. For help on these options, please refer to their original documentations.
First, let's bootstrap the very first models. Create a directory named demo and enter it:
mkdir demo && cd demo
Download the demo PDF to the current project directory. This way you have access to it, even if you use Docker.
If you run Ocrocis via Docker, the current directory is mounted into the Docker container. This means you only have access to files in or below the current directory, not above.
wget http://cistern.cis.lmu.de/ocrocis/demo.pdf
Step 1: Convert the demo PDF to binary page images. This populates the book
repository with binary page images:
ocrocis convert --verbose --pdf demo.pdf --convert "-density 300" --ocropus-nlbin "-nochecks"
Convert might throw some warnings here, depending on your version. Most often you can just ignore them.
Step 2: Burst each page image into a directory of line images:
ocrocis burst --verbose
Step 3: Initialize the next iteration with a subset of the page numbers in the book
repository. This creates the first annotation set inside the iterations/01/annotation
directory and populates it with links to the original line images for pages 1 and 2:
ocrocis next --verbose 1 2
From these, an annotation HTML is created as iterations/01/annotations/Correction.html
. Open it with your favourite web browser, either by double-clicking on it within a file manager or via command line, like so:
firefox iterations/01/annotation/Correction.html
Safari on Mac OS X cannot save modifies HTML files. It is recommended to use Firefox or Chrome.
Annotate the ground truths inside the text fields, one for each line. Save the HTML file (with sources) as the same file via your browser's Save as... feature.
Step 4: Run the training on a subset of the annotation set, in this case just page 1:
ocrocis train --verbose --ntrain 2000 --savefreq 1000 1
This command extracts the ground truths from the annotation HTML and saves them alongside the line images in the annotation set. It then creates the training directory and populates it with links to line images and ground truths in the annotation set. It then starts training on the whole training set.
This command automatically reuses the latest model of the previous iteration as a starting point, if it exists.
Step 5: After training, you can expand the global test set and run an evaluation, e.g. of the latest model:
ocrocis predict --verbose
This command automatically uses the remaining pages from the annotation set (here page 3). Just like train
, it creates links in the global test set, pointing to line images and ground truths in the annotation set.
Note that the test results will be quite bad due to the tiny demo training set.
If you want to continue training within the same iteration, just run train
again, this time with the last model as starting point:
ocrocis train --verbose --ntrain 2000 --savefreq 1000 --model iterations/01/models/model-00002000.pyrnn.gz 1
This does not make much sense for the demo PDF, but is quite useful for real life data.
Note that each time you call train
and test
, the annotation HTML will be parsed for any changes and the ground truths will be updated. In other words, you may change the annotation HTML whenever you want and be sure that the changes are used in all subsequent runs.
Step 5.x and beyond: Repeat steps 3 to 5 for several iterations, until the resulting models reach the desired quality. Each iteration will reuse the latest model from the previous iteration.
Step 6: If you want to apply a model on the whole book
repository, use the --book
switch. This command uses the latest model of the current project:
ocrocis predict --book --verbose
You will see that your current model performs well for the trained lines and bad for the untrained test lines. To improve this model, you would need more training data.
For further information on model application, see the section below.