Heidelberg Research Architecture Turkology Annual Online
Project
The "Turkologischer Anzeiger/Turkology Annual" (TA), founded by Andreas Tietze (†) and György Hazai (†), is an indispensable systematic bibliography for Turkology and Ottoman Studies. Experts from all over the world contribute to its compilation, which is funded by several institutions including the UNESCO. The volumes edited by the Department of Oriental Studies of the University of Vienna have until now only appeared in printed form.
With funding from the Cluster of Excellence "Asia and Europe in a Global Context" a digitization project was started. Its collaboration partners, all at Heidelberg University, were the Department of Languages and Cultures of the Near East - Islamic Studies, the Department of Computational Linguistics, and the Heidelberg Research Architecture at the Cluster of Excellence. The project received funding in 2009-10 and was active through 2013. It established web-site “Turkology Annual Online” provide an online "re-published" version of the resource with expanded and efficient search functionalities.
Lessons learned: The TA contains entries in over 20 different languages, including transcriptions of Arabic and languages using the Cyrillic alphabet, and individual records may contain chunks in several different languages. We expected this to constitute a serious problem for the Optical Character Recognition (OCR). It turned out, however, that Syntax analysis (parsing) of the TA entries proved to be a much bigger challenge: Entry types and data structures are often only implicitly marked, and some of them change from volume to volume. Additionally, parsing had to cope with structural errors in the entries – mostly mistakes by editors which human readers would not even notice in most cases. Therefore, most of the project’s efforts went into tailoring syntax analysis in order to be comprehensive as well as robust.
Digital Output
The project digitized the first 26 printed volumes of Turkologischer Anzeiger (Turkology Annual) published between 1973 and 2012 and created an OCR-ed text version using Abbyy FineReader. Based on the OCR results that were parsed into a database, the project developed a methodology to enhance automatically extraction of structured data from this kind of structured multilingual text resource.
The approach was published in a peer reviewed paper:
Heckmann, Dustin, Anette Frank, Matthias Arnold, Peter Gietz, and Christian Roth. "Citation Segmentation from Sparse & Noisy Data: A Joint Inference Approach with Markov Logic Networks." Digital Scholarship in the Humanities 31, no. 2 (2016, First published online 8 December 2014): 333-356. DOI: 10.1093/llc/fqu061.
With the end of the project, maintenance of the database became difficult and the website had to be shut down finally. Only because of the personal engagement of former project member Dustin Heckmann not only the data was secured, but also the parsing itself significantly improved. As a result, almost all 61639 records are individually accessible. Since 2019 the Heidelberg University Library made efforts to transform the project output into a bibliographic database based on the vuFind software, and in 2021 a full data set was published. Eventually, all data of the “Turkology Annual Online” was migrated to the Specialised Information Service Middle East-, North Africa- and Islamic Studies (FID NahOst) at Halle University.
Data set
The Turkology Annual Online project digitised the first 26 volumes of the Turkologischer Anzeiger / Turkology Annual (TA) and parsed the OCR'ed content into individual bibliographic records. With the transformation of TA data into the digital format, many improvements of the data became possible. Individual records now relate to their respective TA specific subject heading, which the project also translated into English. A full list of cited publications is available, and related records, e.g. reviews to the main publications, are all digitally connected. Eventually, 61540 of the 61639 total records were successfully and automatically extracted and are provided in a json format.
The data set “Turkology Annual Online – Full bibliographic records” was published at heiDATA in 2021.
All TA data produced by the project was migrated to the Specialised Information Service Near East (FID Nah-Ost) where it can easily be searched online.