How AI can help researchers to transcribe their manuscripts (HTR)?

Bulac Calfa Vision
Calfa Vision & Maghrebi Arabic (BULAC, 2021)

For several years, we have witnessed a massive digitization of handwritten collections, archives and documents. The digitization of resources and their storage in the Cloud contribute to the preservation and long-term archiving of data. While access to documents can therefore be achieved remotely, their real accessibility is not guaranteed. Without reliable and quality character recognition on these documents, keywords search, name entity recognition or automatic classification of documents are impossible, hence resulting in manual search work.

What does an OCR / HTR do?

An OCR (Optical Character Recognition) or HTR (Handwritten Text Recognition) is an automatic text recognition software, that will analyze a scan to extract the text with value. An effective pipeline consists of 3 steps: (1) layout analysis (identifying text regions, finding lines of text, attributing semantic tags to the lines and the regions), (2) simulteanous character recognition and word recognition and (3) post-processing with language models adapted to ensure the reliability and quality of the prediction.

Recognition process
Recognition process (ms W541, f. 113v)

There are several open source and proprietary solutions for extracting text from documents, especially for printed documents. Tesseract is the best-known, free open source software solution available in most languages, and Abbyy is the market leader for print recognition.

Each document has its specificity: particular layout, different states of conservation, specific handwriting, press format, etc. There is not always an OCR or HTR that suits your needs, or a versatile software solution. Moreover, text recognition for languages with non-Latin scripts such as Armenian, Syriac, Arabic, Chinese, Georgian, etc., is still in its infancy. The proposed architectures are rarely adapted to their specificity (ligatures, abbreviations, text direction, etc.). These languages belong to the family of so-called digitally under-resourced languages.

Then, how to proceed?

Today it is possible to train neural networks to analyze a very specific layout or process a very particular set of documents. However, to be effective, these neural networks need to be trained with large datasets to be efficient and robust. It is therefore necessary to annotate, often manually, documents similar to those that we wish to recognize (what is called the "creation of ground truth").

Manually annotating documents, choosing a neural architecture suited to your needs, and monitoring / evaluating the learning of a neural network to create a relevant model are costly and time-consuming activities, and which often require investment and machine learning experience, which is not adapted to a massive and rapid processing of documents. Especially when results are not satisfying enough.

Since 2014, Calfa has acquired expertise in tailor-made text recognition for oriental languages (printed and handwritten), and in the processing of under-resourced languages. Notably, we have developed polyvalent layout analysis and robust text recognition models for oriental languages. Calfa is supporting research in Document Analysis: We are releasing our annotation platform, Calfa Vision, to support you in your annotation projects and quickly create quality ground truth, compatible with most modern neural architectures.

Calfa Vision
Calfa Vision automated interface

How does it work?

Calfa Vision is a free web-based assisted annotation tool, that includes several models of automatic understanding of a printed or handwritten document. Models can be fine tunable in real time to match your needs very quickly. Basic steps consist in annotate documents (1) at the region level and line level, and (2) at the transcription level.

In summary, Calfa Vision:

  • is free. All you need to do is to create an account and start an annotation project by adding images to annotate;
  • incorporates generic layout analysis models capable of processing a large variety of documents to automatically pre-annotate your data and allow you to spare time and to focus on the transcription work;
  • is online and ready-to-use: you don't need to download or install anything on your laptop or somewhere else, everything is processed on our secured and reactive servers;
  • is collaborative, you can add collaborators to a project and work together on the annotation of your documents;
  • does not require previous knowledge of Machine Learning: the platform learns from your corrections to quickly specialize on your documents and quickly proposes models specifically adapted to your corrections;
  • is particularly suitable for processing old manuscripts with non-Latin scripts, but it can quickly specialize in processing very different documents such as press and other printed documents, thanks to the fine-tuning of models pre-trained by Calfa;
  • is interoperable: annotations can be exported in various formats describing the layout and transcription of documents (pageXML, JSON, txt, etc.).
No matter what happens, you remain the owner of your data, Calfa does not monitor or check on your projects and does not keep any data on its servers when you delete your projects.

fine-tuning impact
Impact of quick fine-tuning on Calfa Vision for a dedicated project

Does Calfa Vision include an OCR / HTR?

Yes, we provide our partners and customers with an integrated OCR / HTR to speed up the transcription work, in the languages already processed by Calfa (other languages on request). This OCR / HTR is also fine-tunable to perfectly match your script or the state of preservation of your documents, with very small dataset to quickly get a competitive model. Contact us to assess your OCR / HTR needs with you.

To know more about Calfa Vision and applications, see some of our last conferences:

Calfa present at the next ICDAR 2021 (presentation of Calfa Vision at the main conference September 9th).

2021/06 - Intelligence artificielle et khaṭṭ maghribī. Résultats d'un hackathon (Inalco, Bulac) (French) :

2021/01 - Digital Perspectives for Corpus Processing of Texts written in Armenian (Oxford Centre for Byzantine Research) (English) :

2020/12 - HTR/OCR pour graphies non latines : approches et bonnes pratiques (GIS MOMM - CNRS) (French) :

Calfa Team