Heidelberg Research Architecture ECPO full text

The Project

While the database of Early Chinese Periodicals Online (ECPO) contains some full text passages, most of its content are still image scans with manually edited metadata. We at ECPO wish to change that and aim at producing the material in full text. 

Our experiments showed that the complex layout especially of the newspapers (xiaobao, or entertainment newspapers) still poses a huge challenge for OCR systems (e.g. Abbyy, Ocropus, or Tesseract).  To be able to further process the newspaper texts, pages first need to be split into individual text segments. In an early pilot we ran experiments with Pallas Ludens on the use of crowds for page segmentation and the grouping of segments: While crowds reliably created individual text segments, they failed to group the segments into meaningful semantic units (e.g. “article”), since that requires the knowledge of Chinese. 

To apply advanced machine learning and neural networks we created two sets of ground truth. We applied these in the pre-processing of the pages, to be able to identify and address the different segments of a page, like headers or marginalia, images, advertisements, and text. We also used a ground truth set to OCR pre-processed text segments. 

To work on our experiments we successfully applied for additional funding by the Research Council of Heidelberg University. We were also supported by the Konfuzius Institut an der Universität Heidelberg. 

Digital Output

We produced two main sets of ground truth from the newspaper 晶報 Jing bao (The Crystal). Set 1 contains geometry data and identifies individual text segments, that can be labeled and thus combined into semantic units, like “article”. We developed a web based annotation tool to help us processing the pages and published its source code on GitHub. 

The second ground truth set contains a correct set of full text data (Xie and Yip 2023). We took the sample data from three April months, namely 1939 (including advertisements), 1930, and 1920. We record all characters, the text’s reading direction, and link the texts to their items in the ECPO database.  

Based on the second ground truth set, we started the text recognition with pre-processed text segments and were able to achieve very good recognition results. For texts printed in a regular grid (Henke 2021) achieved a character error rate below 3%. With segments in non-regular grids as discussed in (Henke and Arnold 2022), our experiments achieved a recognition rate of about 90% correct characters.

Data Sets

We published various data sets in the context of our full text experiments within ECPO. 

Our full text ground truth is available in the ECPO Data repository on GitHub. 

We also published our ECPO full text experiments on GitHub, which are discussed in (Henke 2021). 

On GitHub you also find the source code of the annotation tool (ECPO Annotator) as well as the segmentation pipeline (ECPO Segment) which is based on DH segment (from EPF Lausanne). 

Video presentations

The presentation “Ground Truth, Neural Networks, OCR: Towards Full Text of Republican China Newspapers” was prepared for the Virtual Annual Conference of the Association for Asian Studies 2021 (AAS21). It introduces our approach towards full-text digitization of Republican China newspapers within the Early Chinese Periodicals Online project (ECPO). 

Since full pages cannot yet be automatically processed by OCR engines (dense document layout, special characters, imperfect visuals) we developed a workflow where individual processing steps can be adjusted to different publications. We created ground truth for annotations (labeled and grouped bounding boxes) and full text (blind double keying). We then trained a neural network to detect different visual features of the pages. This enables us to, for example, process “masthead” areas separately from “advertisement”. We OCR areas like “article” and build a dictionary to train the engine and improve text recognition. Our outcomes are quality ground truth data sets (texts and segmentation), trained models, and a full text data-set for Republican China newspapers, searchable within ECPO.