One goal of this project is to enable an intelligent full-text access to the printed historical documents from the Czech–Bavarian border region.Īccordingly, our original data sources are scanned texts from German historical newspapers printed with Fraktur from the second half of the nineteenth century.Īll proposed methods are evaluated and compared on the real data from the Porta fontium portal. This research is realized in the frame of the Modern Access to Historical Sources project, presented through the Porta fontium portal Footnote 1. We address all these issues, and we propose several approaches to solve these tasks. This problem includes two main tasks: page layout analysis (including text block and line segmentation) and optical character recognition (OCR). Therefore, this paper introduces a set of methods to convert historical scans into their textual representation for efficient information retrieval based on a minimal number of manually annotated documents. Nowadays, state-of-the-art methods are usually not adapted to the historical domain moreover, they usually need a significant amount of annotated documents which is very expensive and time-consuming to acquire. During the last a few decades, the amount of digitized archival material has increased rapidly, and therefore, an efficient method to convert these document images into a text form has become essential to allow information retrieval and knowledge extraction on such data. To sum up, this paper shows a way how to create an efficient OCR system for historical documents with a need for only a little annotated training data.ĭigitization of historical documents is an important task for preserving our cultural heritage. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. This corpus is freely available for research, and all proposed methods are evaluated on these data. We have created a novel real dataset for OCR from Porta fontium portal. Both approaches are state of the art in the relevant fields. Our segmentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. The presented complete OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Therefore, this paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. Nowadays, OCR methods are often not adapted to the historical domain moreover, they usually need a significant amount of annotated documents. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. As the number of digitized historical documents has increased rapidly during the last a few decades, it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |