Research Plan

Professionals


Rare language or dialect, handwriting legible by specialists only: an OCR/HTR model customized on your documents is capable of providing a qualitative transcription.

The Research Plan allows you to bring your own data to train a model to transcribe a very complex corpus, for the maximum price of 1€ per page.

The Research Plan in 4 steps

1. We define together the requirements of your project
Depending on the specificities of your documents, we estimate the expected recognition quality and the data required by the project.

1. You provide the transcribed data
If the data is to be created : we define together the data necessary for your model, we provide access to our semi-automated annotation tool online Calfa Vision and assistance provided by an annotation project manager.

3. We train a dedicated model specialized on your data
Depending on the results achieved, you can add a second set of data to your project for fine-tuning.

4. Your whole corpus is processed with the final model
You receive the whole transcription of your corpus.

Use Case

...
Text recognition of Arabic manuscripts from the 15th century

About 6,000 pages of a collection of Arabic manuscripts from the 15th century have been processed for automated transcription, using an HTR model customized with the data created by the research laboratory during a collaborative data creation workshop.

Client : CNRS (Groupement d'Intérêt Scientifique Moyen-Orient et mondes musulmans)

Features of the Research Plan

Languages: all
Layout: all
The Research Plan includes: assistance provided by an annotation project manager during data creation (if the step is needed), data reception and preparation, training of a customized model, text recognition of the corpus, data formatting and results delivery
Format de sortie : raw text (TXT, XML), text with coordinates in the page (XML)

Prices

Research Plan: 3,500€ including the processing of 3,500 pages (1€/page)
Additional pages: 0.40€/page rate

FAQ

The Research Plan is primarily intended for homogeneous corpora, regardless of language or complexity. For example, an entire corpus by the same author or from several similar hands and with a relatively consistent layout.

Before subscribing to this plan, we can offer an assessment of your corpus to check the feasibility of a specialized model training and whether it can yield good results.

The data required is a localized transcription of the text in XML format, that is the transcription of the text associated with localization polygons. This data can be generated using annotation tools such as Transkribus, Trancrire or Calfa Vision.

If data creation is required, the Research Plan includes support from one of our annotation project managers for this step.

You must provide the text equivalent of an average of 30 pages. More could be necessary for complex corpora and less for simpler ones.

Upon assessing your corpus, we will also provide you with an estimate of the volume of data required.

You can also choose to delegate us the task of data creation as part of a custom project.

The accuracy rate is not guaranteed as part of the Research Plan, as it depends entirely on the data provided by you. However, upon assessing your corpus, we will provide an estimate of the achievable quality rate, which generally ranges from 95% to 99%.

Yes, for example, by using the Calfa Vision tool, which allows you to add multiple users to the same project and organize collaborative annotation.

The developed model remains at your disposal without a time limit, even after the processing of your corpus is complete, as does any remaining balance of the 3,500 pre-paid pages. You can purchase additional pages at any time to process more

Please note that as part of the Research Plan, the decicated model created is not open source and cannot be downloaded.

To learn more

You have a question regarding the Research Plan or the feasibility of OCR/HTR processing on your corpus. Contact us.