Models, datasets and publications for research

Open source models

List of our open source models publicly available

Models Type Team

hye-calfa-n

OCR model for Armenian : designed for damaged, historical and noisy printed documents

Text recognition (OCR)

Calfa

pie-hye-ud

Lemmatization, POS-tagging, morphological analysis and named entities recognition in Classical Armenian

Text analysis

Calfa, Dalih

pie-hyw-ud

Lemmatization, POS-tagging, morphological analysis and named entities recognition in Western Armenian

Text analysis

Calfa, Dalih

pie-xcl-ud

Lemmatization, POS-tagging, morphological analysis and named entities recognition in Eastern Armenian

Text analysis

Calfa, Dalih

Applications

Free tools for researchers developed by Calfa

...

Calfa Vision : free annotation tool for documents and images

Online, collaborative and free. Create your project, invite collaborators, annotate your documents and export the data.

...

Stamps and illuminations detection tool

Online, free, easy-to-use detector for illuminations, initials, and seals in manuscripts. Simply paste the IIIF manifest URL of a manuscript or upload your own images, and our models will do the rest.

Datasets

Open access datasets published by Calfa and partners

Dataset Lang. Team

Datalab Dulaurier

HTR ground-truth for Armenian cursive archives (Dulaurier collection - BnF)

Armenian HTR

Calfa, BnF Datalab, GREgORI

Patrologia Graeca

OCR ground-truth for noisy and dense printed greek historical documents

Greek OCR

Calfa, GREgORI

ChiKnowPo

HTR ground-truth for Chinese xylographic imperial editions

Chinese HTR

Calfa, Collex-Persée

RASAM 1 & 2

Recognition and Analysis of Scripts in Arabic Maghrebi

Arabic HTR

Calfa, DISTAM

TARIMA

Ground-truth of the TariMa project (HTR/OCR of Maghrebi Arabic documents).

Arabic HTR

Calfa, BULAC, Collex-Persée

Iskandar

Ground-truth produced during the Alexander Hackathon for the automatic transcription of manuscripts of the Alexander Romance in Middle Arabic.

Arabic HTR

Calfa, DISTAM, LiPoL

Baybars

Middle-Arabic, modern scripts

Arabic HTR

LiPoL, Ifpo, Calfa

Publications

OCR/HTR and Computer Vision

Chahan Vidal-Gorène and Aliénor Decours-Perez and Anahide Kasparian and Ani Tanelian and Agnès Ohanian. Armenian HTR: State of the art, transcription guidelines and good practices. 2025.

Vidal-Gorène, Chahan and Decours-Perez, Aliénor, Detecting and Deciphering Damaged Medieval Armenian Inscriptions Using YOLO and Vision Transformers. In Document Analysis and Recognition -- ICDAR 2024 Workshops, pp. 22--36, Cham, 2024. Springer Nature Switzerland.
PDF BibTeX

Bizais-Lillig, Marie and Vidal-Gorène, Chahan and Dupin, Boris, Optimizing HTR and Reading Order Strategies for Chinese Imperial Editions with Few-Shot Learning. In Document Analysis and Recognition -- ICDAR 2024 Workshops, pp. 37--56, Cham, 2024. Springer Nature Switzerland.
PDF BibTeX

Vidal-Gorène, Chahan and Dupin, Boris and Decours-Perez, Aliénor and Riccioli, Thomas, A modular and automated annotation platform for handwritings: evaluation on under-resourced languages. In International Conference on Document Analysis and Recognition, pp. 507--522, 2021.

Vidal-Gorène, Chahan and Lucas, Noëmie and Salah, Clément and Decours-Perez, Aliénor and Dupin, Boris, RASAM -- A Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi. In International Conference on Document Analysis and Recognition, pp. 265--281, 2021.
PDF BibTeX

Vidal-Gorène, Chahan and Salah, Clément and Lucas, Noëmie and Decours-Perez, Aliénor and Perrier, Antoine, Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking. In Computational Humanities Research (CHR), Aarhus, Denmark, 2024.
PDF BibTeX

Vidal-Gorène, Chahan and Lucas, No\"emie and Salah, Clément and Decours-Perez, Aliénor and Dupin, Boris, RASAM -- A Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi. In GitHub repository, 2021--2024. GitHub.

Kindt, Bastien and Vidal--Gorène, Chahan, From Manuscript to Tagged Corpora. An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East. In Armeniaca, pp. 73--96, 2022. Edizioni Ca' Foscari - Digital Publishing, Fondazione Università Ca' Foscari.
PDF BibTeX

Vidal-Gorène, Chahan, La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées. In The Programming Historian en français, 2023. ProgHist Ltd.
PDF

Lucas, Noëmie and Salah, Clément and Vidal-Gorène, Chahan, New Results for the Text Recognition of Arabic Maghribi Manuscripts--Managing an Under-resourced Script. In arXiv preprint arXiv:2211.16147, 2022.
PDF

Vidal-Gorène, Chahan, OCR / HTR technologies and Armenian Heritage Preservation. In Banber Hayastani gradaranneri . Gitamet'odakan handes, pp. 61-65, 2023. National Library of Armenia.
PDF BibTeX

Text Analysis

Vidal-Gorène, Chahan and Tomeh, Nadi and Khurshudyan, Victoria, Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pp. 438--449, Miami, USA, 2024. Association for Computational Linguistics.
PDF BibTeX

Vidal-Gorène, Chahan and Khurshudyan, Victoria and Donabédian-Demopoulos, Anaïd, Recycling and Comparing Morphological Annotation Models for Armenian Diachronic-Variational Corpus Processing. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 90--101, Barcelona, Spain (Online), 2020. International Committee on Computational Linguistics (ICCL).
PDF

Vidal-Gorène, Chahan and Kindt, Bastien, Lemmatization and POS-tagging process by using joint learning approach. Experimental results on Classical Armenian, Old Georgian, and Syriac. In Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, pp. 22--27, Marseille, France, 2020. European Language Resources Association (ELRA).
PDF BibTeX

Kindt, Bastien and Vidal-Gorène, Chahan and Delle Donne, Saulo, Analyse automatique du grec ancien par réseau de neurones. Évaluation sur le corpus De Thessalonica Capta. In Bulletin de l'Académie Belge pour l'Etude des Langues Anciennes et Orientales, pp. 537--562, 2022.
PDF

Lexicography

Vidal-Gorène, Chahan and Decours-Perez, Aliénor and Queuche, Baptiste and Ouzounian, Agnès and Riccioli, Thomas, Digitalization and Enrichment of the Nor Bargirk‘ Haykazean Lezui: Work in Progress for Armenian Lexicography. In Journal of the Society of Armenian Studies, pp. 224-244, 2020. Brepols.
PDF

Vidal-Gorène, Chahan and Decours-Perez, Aliénor, Languages Resources for Poorly Endowed Languages : The Case Study of Classical Armenian. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 3145--3152, Marseille, France, 2020. European Language Resources Association.
PDF

Subscribe to our mailing list to receive news about our projects