Correcting OCR Recognition of the Historical Sources Texts Using Fuzzy Sets (on the Example of an Early 20th Century Newspaper)

Ilia Nikolaevich Galushko; Галушко Илья Николаевич

doi:10.7256/2585-7797.2023.1.40387

Correcting OCR Recognition of the Historical Sources Texts Using Fuzzy Sets (on the Example of an Early 20th Century Newspaper)

Authors: Galushko I.N.¹
Affiliations:
Issue: No 1 (2023)
Pages: 102-113
Section: Articles
URL: https://journals.rcsi.science/2585-7797/article/view/367037
DOI: https://doi.org/10.7256/2585-7797.2023.1.40387
EDN: https://elibrary.ru/OCFBSP
ID: 367037

Cite item

Full Text

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

Our article is presenting an attempt to apply NLP methods to optimize the process of text recognition (in case of historical sources). Any researcher who decides to use scanned text recognition tools will face a number of limitations of the pipeline (sequence of recognition operations) accuracy. Even the most qualitatively trained models can give a significant error due to the unsatisfactory state of the source that has come down to us: cuts, bends, blots, erased letters - all these interfere with high-quality recognition. Our assumption is to use a predetermined set of words marking the presence of a study topic with Fuzzy sets module from the SpaCy to restore words that were recognized with mistakes. To check the quality of the text recovery procedure on a sample of 50 issues of the newspaper, we calculated estimates of the number of words that would not be included in the semantic analysis due to incorrect recognition. All metrics were also calculated using fuzzy set patterns. It turned out that approximately 119.6 words (mean for 50 issues) contain misprints associated with incorrect recognition. Using fuzzy set algorithms, we managed to restore these words and include them in semantic analysis.

Keywords

recognition of historical sources, OCR correction, fuzzy sets, NLP (natural language processing), text preprocessing, Birzhevye vedomosti, Levenshtein distance, content analysis, topic modeling, historical newspapers

About the authors

Ilia Nikolaevich Galushko

Email: i.galushko15@gmail.com

References

Солощенко Н.В. Многотиражная газета «Бабаевец» как источник по истории пищевой промышленности СССР в годы первой пятилетки (опыт контент-анализа и сетевого анализа) // Историческая информатика. — 2021.-№ 2.-С.1-23.
Kale, Sunil Digamberrao and Rajesh Shardanand Prasad. “A Systematic Review on Author Identification Methods.” Int. J. Rough Sets Data Anal. 4 (2017): 81-91.
Гарскова И.М. Международная научная конференция «Аналитические методы и информационные технологии в исторических исследованиях: от оцифрованных данных к приращению знаний» // Историческая информатика. — 2018.-№ 4.-С.143-151.
Tze-I Yang, A.J.Torget, R.Mihalcea. Topic modeling in historical newspapers. 2011
Assael, Y., Sommerschield, T., Shillingford, B. et al. Restoring and attributing ancient texts using deep neural networks. Nature 603, 280–283 (2022).
Lopresti, Daniel. (2009). Optical character recognition errors and their effects on natural language processing. IJDAR. 12. 141-151.
Papers with Code. URL: https://paperswithcode.com/sota
Transkribus. Public models. URL: https://readcoop.eu/transkribus/public-models/
OCR-D. URL: https://ocr-d.de/en/
Доклад Р.Б. Кончакова (РАНХиГС) и С.В. Боловцова (РАНХиГС) «Распознавание отчетов начальников губерний Российской империи: вызовы и подходы» был представлен на семинаре «Искусственный интеллект в исторических исследованиях: автоматизированное распознавание текстов рукописных исторических источников», организованном ассоциацией «История и компьютер» и РАНХиГС на площадке РАНХиГС 11 февраля 2023 г.: https://ion.ranepa.ru/news/budushchee-istorii-kak-tsifrovye-navyki-otrazhayutsya-na-rabote-istorikov/
Солощенко Н.В. Многотиражная печать как источник по изучению процесса формирования «нового человека» в советской промышленности первых пятилеток // Исторический журнал: научные исследования. — 2019.-№ 3.-С.106-117.
SpaCy. URL: https://spacy.io/
Russpelling. URL: https://github.com/ingoboerner/russpelling
SpaCyR. URL: https://cran.r-project.org/web/packages/spacyr/vignettes/using_spacyr.html
GitHub. URL: https://github.com/iodinesky/Fuzzy-sets-in-historical-sources-OCR

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

No 3 (2025)

No 3 (2025)

Correcting OCR Recognition of the Historical Sources Texts Using Fuzzy Sets (on the Example of an Early 20th Century Newspaper)

Full Text

Abstract

Keywords

About the authors

Ilia Nikolaevich Galushko

References

Supplementary files