Middle Arabic: 98% of accuracy in automatic transcription on the sirā of Baybars

Between fall 2023 and summer 2024, we collaborated with ANR LiPoL and its partners (Bibliothèque nationale de France, Ifpo, MMSH, Orient & Méditerranée, and Université Saint-Joseph) to transcribe 35,000 handwritten pages of the Sīra of Baybars, a popular epic cycle from the Ottoman era in Middle Arabic.

The transcription was carried out using an HTR model specifically developed for this project, capable of detecting text in digitized pages despite variations in curvature, lighting, focus, cursive scripts and the veriety of hands.

Baybars manuscript Baybars manuscript
Two manuscripts of the sirā of Baybars

In October 2023, our team went to Beirut, hosted at IFPO, to lead a hackathon at the Institut des Lettres orientales of Université Saint-Joseph. We trained 16 people in annotation and transcription standards for Arabic HTR. These documents, transcribed manually, then fed our AI models. After validation by our experts, with recognition rates between 96% and 99%, the processing of the 35,000 pages was carried out. The results are available on the ANR LiPoL website, making the Sīra of Baybars now searchable.

Baybars manuscript
Automatic transcription of Ms3 of the sirā (96.60% accuracy), see the manuscript

To learn more about the results, visit the ANR LiPoL Blog (in French).

Open Access: The data produced during this hackathon are shared in XML format and documented on the ANR LiPoL GitLab. Since 2021, Calfa has been engaged in open science for Arabic HTR within the DISTAM consortium, sharing the main datasets for automatic transcription of handwritten Arabic: RASAM1 & 2, Tarima, Iskandar, and now Baybars. These datasets cover a wide range of Arabic scripts, languages, and document types, constituting the main basis for Arabic HTR.
To know more about these datasets: Distam blog (in French)

Calfa Team