Recognition of Arabic Maghrebi scripts (Hackathon results)

BULAC-MS_ARA_609
Arabic Maghrebi Manuscript (MS.ARA.609) © BULAC, Bina

Between January and April 2021, we have organized a hackathon with the BULAC and the Research Consortium Middle-East and Muslim Worlds (GIS MOMM - CNRS) for the collaborative transcription of manuscripts in Arabic Maghrebi scripts*.

This scientific event had a threefold objective:

  • developing and assessing the feasibility of automatic text recognition for this type of manuscripts, that showcase various difficulties, regarding layouts as well as character shapes, and who faces the same classical challenges met for the processing of Arabic manuscripts;
  • building a database for research purposes in order to facilitate the dissemination of advanced HTR technologies, to promote automatic processing of Arabic Maghrebi handwritten collections, and to train researchers to digital humanities;
  • evaluating the use of the Calfa Vision platform for the processing of a new non-Latin script language.

The hackathon gathered 14 participants, for a total of 300 annotated pictures and more than 450.000 transcripted characters.

Studied manuscripts and process of data creation

The manuscripts selected were ARA.609, ARA.1977 and ARA.417 from the BULAC, library which constitutes the second collection of Arabic manuscripts in France with more than 2.500 documents. The two first manuscripts are fully available on the BULAC digital library: Bina. These manuscripts cover a wide spectrum of common scripts, layouts and topics: ARA.609 consists of a treatise in verse on arithmetic, inheritance and wills (with numerous tables and fractions), whereas ARA.1977 consists of various historical and legal texts and ARA.417 presents the history of the Beys of Oran in the 13th century.

Manuscripts
MS.ARA.1977 (p.42), MS.ARA.609 (p.124) and MS.ARA.417 (f. 12v) © BULAC, Bina

The Arabic Maghrebi scripts raise several difficulties that exacerbate the issues met in digital humanities and the training of intelligent systems. In particular, the manuscripts often display curved text in the margins (see image), letters and words written above or under others. The lack of space in-between words and the plurality of shapes for a single character make reading difficult and often requires the knowledge of the word beforehand to decipher the character. Indeed the use of diacritic signs, sometimes erroneous or offset, results in strong ambiguities between letters that can already be easily confused, thus an erroneous word recognition. That is why the recognition of handwritten Arabic is an open research field, for which numerous solutions are tested.

In the context of the hackathon, we created and evaluated different types of models: layout analysis models (identification of text-regions, tables, titles, etc.), models for identification of lines of text inside a document, and last text recognition per se. Here, we favored an approach by baseline (fictive line of text on which the text is written) in order to correctly identify the curved lines and an approach by word in order to limit the impact of ambiguous characters.

During the eight sessions of the hackathon, the transcription work was held on our Calfa Vision platform, that has integrated automatic analysis models and enables collaborative work. Notably, the transcribers were invited to proofread the automatic analysis generated on the documents. The corrections made were then immediately added to the models in order to specialize them on Arabic Maghrebi (fine-tuning).

Results from handwritten Arabic HTR and from crowdsourcing campaign

The hackathon and the developments realized throughout the project enabled to achieve an average error rate at the character lever (CER) below 5% (4,8%), which is an equivalent score to the one usually achieved for Latin scripts.

Text prediction
Text Prediction vs Expected result (GT) - see our paper for more details

The main text and the baselines detection has gradually reached 97% of accuracy (as of 50 pages checked) resulting in a gain of time above 80% in the proofreading and the transcription compared to transcription without automatic and specialized annotation.

The outcomes show that the use of a collaborative platform for automatic transcription like Calfa Vision — that integrates generic models that can be specialized on a specific dataset — is a relevant strategy and appropriate for data creation and for training effective HTR intelligent systems, in particular for under-resourced languages.

Calfa Vision Layout Analysis
Calfa Vision Layout Analysis © Calfa

The results and the dataset will be presented at ICDAR 2021 in the context of the Arabic and Derived Script Analysis and Recognition (ASAR) workshop. An intermediate presentation of the results took place on the occasion of the Rendez-vous de la Philologie Numérique organized by the BULAC, it is available on YouTube (in French only).

To know more: Vidal-Gorène C., Lucas N., Salah C., Decours-Perez A., Dupin B. (2021) RASAM – A Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi. In: Barney Smith E.H., Pal U. (eds) Document Analysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in Computer Science, vol 12916. Springer, Cham. https://doi.org/10.1007/978-3-030-86198-8_19

RASAM Dataset: Access to the dataset

*This work was carried out with the financial support of the French Ministry of Higher Education, Research and Innovation. It is in line with the scientific focus on digital humanities defined by the Research Consortium Middle-East and Muslim Worlds (GIS MOMM). We would also like to thank all the transcribers and people who took part in the hackathon and ensured its successful completion.

Calfa Team