Arabic OCR and HTR

What is OCR and HTR?

Optical Character Recognition (OCR) converts scanned documents, PDFs, or images into editable and searchable data.
Handwritten Text Recognition (HTR) is an advanced form of OCR that converts handwritten text into digital format.

Why use text recognition?

Automatic Text Recognition techniques achieve high accuracy for Arabic documents, enabling advanced document exploitation, metadata extraction, keyword searching, and creating searchable corpora. Recognized information can be automatically structured into databases, facilitating further analysis and integration with other systems.

Main recognition models for Arabic

There exist several models (free or premium) suitable for various use cases.

Tesseract

Open and free
designed for clean printed documents

Access the model

Transkribus

Premium (open for research)
for 18-20th c. manuscripts

Access the model

Calfa

Premium access
Generic model for Arabic scripts

Demo access

Some Applications

Explore some projects where we have applied Arabic OCR and HTR.

Printed documents: the Islamic History Open Data Platform

Led by Ghent University (ERC Grant), in collaboration with Calfa

Within the scope of the Mamlukisation of the Mamluk Sultanate II project led by Ghent University, we have processed modern printed editions of 15th-century authors (Arabic historiographical texts). The generic Arabic model typically covers common fonts and layouts, and the results were 99.5% accurate on average.

Learn More

Handling script variety: the TariMa – Tarih al-Maghrib project

Collaboration between BULAC, Collex-Persee, DISTAM, IREMAM and Calfa

A project focused on the history of the Arab Maghreb, through the analysis of a corpus of 10,000 pages from various historical manuscripts, lithographs, and printed documents (17th-20th centuries) from the BULAC collections. The corpus compiles a wide variety of genres, scripts, and hands, with an average recognition rate of 97.2%. Manuscripts are now searchable on BINA, the BULAC digital library.

Learn More

Middle Arabic: 98% accuracy on the Sīra of Baybars

Led by ANR LiPoL, in collaboration with IFPO and Calfa

Processing of the 35,000 pages of the Sīra of Baybars, a popular epic cycle from the Ottoman era in Middle Arabic, now searchable online. These manuscripts were copied in the 20th century by several hands, featuring a mix of classic scripts and very cursive ones, with a high variety of image quality. Documents can be consulted on Gallica (French National Library), and the recognition rate varies between 96% and 99%.

Learn More

Arabic Datasets

Several datasets have been released these last years to allow the training of new recognition models. They are mainly focused on historical manuscripts and scripts.

RASAM 1 & 2

Maghrebi Arabic manuscripts
A project by DISTAM, BULAC and Calfa

Access the Dataset

Tarima

Maghrebi Arabic manuscripts and old prints
A Collex-Persée project, annotation by Calfa

Access the Dataset

Iskandar

Oriental scripts
A project by DISTAM, LiPoL and Calfa

Access the Dataset

Baybars

Middle-Arabic, modern scripts
A project by LiPoL, Ifpo and Calfa

Access the Dataset

RASM

Oriental scripts, scientific manuscripts
A project by the British National Library

Access the Dataset

OpenITI

Ottoman/Persian/Urdu fonts
A project by the OpenITI consortium

Access the Dataset

Bibliography

Vidal-Gorène, C., Salah, C., Lucas, N., Decours-Perez, A., & Perrier, A. (2024). Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking. In Computational Humanities Research (CHR), 3834, 200-216.
PDF BibTeX
Lucas, N., Salah, C., & Vidal-Gorène, C. (2022). New Results for the Text Recognition of Arabic Maghribi Manuscripts--Managing an Under-resourced Script. arXiv preprint arXiv:2211.16147.
PDF BibTeX
Lucas, N. (2022). OCR/HTR et graphie arabe Les manuscrits arabes à l'heure de la reconnaissance automatique des écritures.
PDF BibTeX
Vidal-Gorène, C., Lucas, N., Salah, C., Decours-Perez, A., & Dupin, B. (2021). RASAM--a dataset for the recognition and analysis of scripts in Arabic Maghrebi. In International Conference on Document Analysis and Recognition (pp. 265-281). Springer.
PDF BibTeX
Vidal-Gorène, C., Dupin, B., Decours-Perez, A., & Riccioli, T. (2021). A modular and automated annotation platform for handwritings: evaluation on under-resourced languages. In Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 (pp. 507-522). Springer.
PDF BibTeX