
What is OCR and HTR?
Optical Character Recognition (OCR) converts scanned documents, PDFs, or images into editable and searchable data.
Handwritten Text Recognition (HTR) is an advanced form of OCR that converts handwritten text into digital format.
Why use text recognition?
Automatic Text Recognition techniques achieve high accuracy for Arabic documents, enabling advanced document exploitation, metadata extraction, keyword searching, and creating searchable corpora. Recognized information can be automatically structured into databases, facilitating further analysis and integration with other systems.
Main recognition models for Arabic
There exist several models (free or premium) suitable for various use cases.
Some Applications
Explore some projects where we have applied Arabic OCR and HTR.

Printed documents: the Islamic History Open Data Platform
Led by Ghent University (ERC Grant), in collaboration with Calfa
Within the scope of the Mamlukisation of the Mamluk Sultanate II project led by Ghent University, we have processed modern printed editions of 15th-century authors (Arabic historiographical texts). The generic Arabic model typically covers common fonts and layouts, and the results were 99.5% accurate on average.
Learn MoreHandling script variety: the TariMa – Tarih al-Maghrib project
Collaboration between BULAC, Collex-Persee, DISTAM, IREMAM and Calfa
A project focused on the history of the Arab Maghreb, through the analysis of a corpus of 10,000 pages from various historical manuscripts, lithographs, and printed documents (17th-20th centuries) from the BULAC collections. The corpus compiles a wide variety of genres, scripts, and hands, with an average recognition rate of 97.2%. Manuscripts are now searchable on BINA, the BULAC digital library.
Learn More


Middle Arabic: 98% accuracy on the Sīra of Baybars
Led by ANR LiPoL, in collaboration with IFPO and Calfa
Processing of the 35,000 pages of the Sīra of Baybars, a popular epic cycle from the Ottoman era in Middle Arabic, now searchable online. These manuscripts were copied in the 20th century by several hands, featuring a mix of classic scripts and very cursive ones, with a high variety of image quality. Documents can be consulted on Gallica (French National Library), and the recognition rate varies between 96% and 99%.
Learn MoreArabic Datasets
Several datasets have been released these last years to allow the training of new recognition models. They are mainly focused on historical manuscripts and scripts.


Tarima
Maghrebi Arabic manuscripts and old prints
A Collex-Persée project, annotation by Calfa



RASM
Oriental scripts, scientific manuscripts
A project by the British National Library

Bibliography
- Vidal-Gorène, C., Salah, C., Lucas, N., Decours-Perez, A., & Perrier, A. (2024). Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking. In Computational Humanities Research (CHR), 3834, 200-216.
- Lucas, N., Salah, C., & Vidal-Gorène, C. (2022). New Results for the Text Recognition of Arabic Maghribi Manuscripts--Managing an Under-resourced Script. arXiv preprint arXiv:2211.16147.
- Lucas, N. (2022). OCR/HTR et graphie arabe Les manuscrits arabes à l'heure de la reconnaissance automatique des écritures.
- Vidal-Gorène, C., Lucas, N., Salah, C., Decours-Perez, A., & Dupin, B. (2021). RASAM--a dataset for the recognition and analysis of scripts in Arabic Maghrebi. In International Conference on Document Analysis and Recognition (pp. 265-281). Springer.
- Vidal-Gorène, C., Dupin, B., Decours-Perez, A., & Riccioli, T. (2021). A modular and automated annotation platform for handwritings: evaluation on under-resourced languages. In Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 (pp. 507-522). Springer.