Automatic Text Recognition of Arabic scripts

Explore the latest in Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) for Arabic scripts

OCR/HTR Illustration

What is OCR and HTR?

Optical Character Recognition (OCR) converts scanned documents, PDFs, or images into editable and searchable data.
Handwritten Text Recognition (HTR) is an advanced form of OCR that converts handwritten text into digital format.

Why use text recognition?

Automatic Text Recognition techniques achieve high accuracy for Arabic documents, enabling advanced document exploitation, metadata extraction, keyword searching, and creating searchable corpora. Recognized information can be automatically structured into databases, facilitating further analysis and integration with other systems.

Main recognition models for Arabic

There exist several models (free or premium) suitable for various use cases.

Tesseract

Open and free
designed for clean printed documents

Access the model

Transkribus

Premium (open for research)
for 18-20th c. manuscripts

Access the model

Calfa

Premium access
Generic model for Arabic scripts

Demo access

Some Applications

Explore some projects where we have applied Arabic OCR and HTR.

Project 1
Printed documents: the Islamic History Open Data Platform

Led by Ghent University (ERC Grant), in collaboration with Calfa

Within the scope of the Mamlukisation of the Mamluk Sultanate II project led by Ghent University, we have processed modern printed editions of 15th-century authors (Arabic historiographical texts). The generic Arabic model typically covers common fonts and layouts, and the results were 99.5% accurate on average.

Learn More
Handling script variety: the TariMa – Tarih al-Maghrib project

Collaboration between BULAC, Collex-Persee, DISTAM, IREMAM and Calfa

A project focused on the history of the Arab Maghreb, through the analysis of a corpus of 10,000 pages from various historical manuscripts, lithographs, and printed documents (17th-20th centuries) from the BULAC collections. The corpus compiles a wide variety of genres, scripts, and hands, with an average recognition rate of 97.2%. Manuscripts are now searchable on BINA, the BULAC digital library.

Learn More
Project 1 Project 2
Project 3
Middle Arabic: 98% accuracy on the Sīra of Baybars

Led by ANR LiPoL, in collaboration with IFPO and Calfa

Processing of the 35,000 pages of the Sīra of Baybars, a popular epic cycle from the Ottoman era in Middle Arabic, now searchable online. These manuscripts were copied in the 20th century by several hands, featuring a mix of classic scripts and very cursive ones, with a high variety of image quality. Documents can be consulted on Gallica (French National Library), and the recognition rate varies between 96% and 99%.

Learn More

Arabic Datasets

Several datasets have been released these last years to allow the training of new recognition models. They are mainly focused on historical manuscripts and scripts.

Dataset 1
RASAM 1 & 2

Maghrebi Arabic manuscripts
A project by DISTAM, BULAC and Calfa

Access the Dataset
Dataset 2
Tarima

Maghrebi Arabic manuscripts and old prints
A Collex-Persée project, annotation by Calfa

Access the Dataset
Dataset 3
Iskandar

Oriental scripts
A project by DISTAM, LiPoL and Calfa

Access the Dataset
Dataset 4
Baybars

Middle-Arabic, modern scripts
A project by LiPoL, Ifpo and Calfa

Access the Dataset
Dataset 5
RASM

Oriental scripts, scientific manuscripts
A project by the British National Library

Access the Dataset
Dataset 5
OpenITI

Ottoman/Persian/Urdu fonts
A project by the OpenITI consortium

Access the Dataset

Bibliography

  • Vidal-Gorène, C., Salah, C., Lucas, N., Decours-Perez, A., & Perrier, A. (2024). Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking. In Computational Humanities Research (CHR), 3834, 200-216.
  • Lucas, N., Salah, C., & Vidal-Gorène, C. (2022). New Results for the Text Recognition of Arabic Maghribi Manuscripts--Managing an Under-resourced Script. arXiv preprint arXiv:2211.16147.
  • Lucas, N. (2022). OCR/HTR et graphie arabe Les manuscrits arabes à l'heure de la reconnaissance automatique des écritures.
  • Vidal-Gorène, C., Lucas, N., Salah, C., Decours-Perez, A., & Dupin, B. (2021). RASAM--a dataset for the recognition and analysis of scripts in Arabic Maghrebi. In International Conference on Document Analysis and Recognition (pp. 265-281). Springer.
  • Vidal-Gorène, C., Dupin, B., Decours-Perez, A., & Riccioli, T. (2021). A modular and automated annotation platform for handwritings: evaluation on under-resourced languages. In Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5--10, 2021, Proceedings, Part III 16 (pp. 507-522). Springer.