Correcting OCR Recognition of the Historical Sources Texts Using Fuzzy Sets (on the Example of an Early 20th Century Newspaper)
- Authors: Galushko I.N.1
-
Affiliations:
- Issue: No 1 (2023)
- Pages: 102-113
- Section: Articles
- URL: https://journals.rcsi.science/2585-7797/article/view/367037
- DOI: https://doi.org/10.7256/2585-7797.2023.1.40387
- EDN: https://elibrary.ru/OCFBSP
- ID: 367037
Cite item
Full Text
Abstract
Our article is presenting an attempt to apply NLP methods to optimize the process of text recognition (in case of historical sources). Any researcher who decides to use scanned text recognition tools will face a number of limitations of the pipeline (sequence of recognition operations) accuracy. Even the most qualitatively trained models can give a significant error due to the unsatisfactory state of the source that has come down to us: cuts, bends, blots, erased letters - all these interfere with high-quality recognition. Our assumption is to use a predetermined set of words marking the presence of a study topic with Fuzzy sets module from the SpaCy to restore words that were recognized with mistakes. To check the quality of the text recovery procedure on a sample of 50 issues of the newspaper, we calculated estimates of the number of words that would not be included in the semantic analysis due to incorrect recognition. All metrics were also calculated using fuzzy set patterns. It turned out that approximately 119.6 words (mean for 50 issues) contain misprints associated with incorrect recognition. Using fuzzy set algorithms, we managed to restore these words and include them in semantic analysis.
References
Солощенко Н.В. Многотиражная газета «Бабаевец» как источник по истории пищевой промышленности СССР в годы первой пятилетки (опыт контент-анализа и сетевого анализа) // Историческая информатика. — 2021.-№ 2.-С.1-23. Kale, Sunil Digamberrao and Rajesh Shardanand Prasad. “A Systematic Review on Author Identification Methods.” Int. J. Rough Sets Data Anal. 4 (2017): 81-91. Гарскова И.М. Международная научная конференция «Аналитические методы и информационные технологии в исторических исследованиях: от оцифрованных данных к приращению знаний» // Историческая информатика. — 2018.-№ 4.-С.143-151. Tze-I Yang, A.J.Torget, R.Mihalcea. Topic modeling in historical newspapers. 2011 Assael, Y., Sommerschield, T., Shillingford, B. et al. Restoring and attributing ancient texts using deep neural networks. Nature 603, 280–283 (2022). Lopresti, Daniel. (2009). Optical character recognition errors and their effects on natural language processing. IJDAR. 12. 141-151. Papers with Code. URL: https://paperswithcode.com/sota Transkribus. Public models. URL: https://readcoop.eu/transkribus/public-models/ OCR-D. URL: https://ocr-d.de/en/ Доклад Р.Б. Кончакова (РАНХиГС) и С.В. Боловцова (РАНХиГС) «Распознавание отчетов начальников губерний Российской империи: вызовы и подходы» был представлен на семинаре «Искусственный интеллект в исторических исследованиях: автоматизированное распознавание текстов рукописных исторических источников», организованном ассоциацией «История и компьютер» и РАНХиГС на площадке РАНХиГС 11 февраля 2023 г.: https://ion.ranepa.ru/news/budushchee-istorii-kak-tsifrovye-navyki-otrazhayutsya-na-rabote-istorikov/ Солощенко Н.В. Многотиражная печать как источник по изучению процесса формирования «нового человека» в советской промышленности первых пятилеток // Исторический журнал: научные исследования. — 2019.-№ 3.-С.106-117. SpaCy. URL: https://spacy.io/ Russpelling. URL: https://github.com/ingoboerner/russpelling SpaCyR. URL: https://cran.r-project.org/web/packages/spacyr/vignettes/using_spacyr.html GitHub. URL: https://github.com/iodinesky/Fuzzy-sets-in-historical-sources-OCR
Supplementary files

