<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root>
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" article-type="research-article" dtd-version="1.2" xml:lang="en"><front><journal-meta><journal-id journal-id-type="publisher-id">ARTIFICIAL INTELLIGENCE AND DECISION MAKING</journal-id><journal-title-group><journal-title xml:lang="en">ARTIFICIAL INTELLIGENCE AND DECISION MAKING</journal-title><trans-title-group xml:lang="ru"><trans-title>Искусственный интеллект и принятие решений</trans-title></trans-title-group></journal-title-group><issn publication-format="print">2071-8594</issn></journal-meta><article-meta><article-id pub-id-type="publisher-id">269757</article-id><article-id pub-id-type="doi">10.14357/20718594230410</article-id><article-id pub-id-type="edn">BAKBAF</article-id><article-categories><subj-group subj-group-type="toc-heading" xml:lang="en"><subject>Analysis of Textual and Graphical Information</subject></subj-group><subj-group subj-group-type="toc-heading" xml:lang="ru"><subject>Анализ текстовой и графической информации</subject></subj-group><subj-group subj-group-type="article-type"><subject>Research Article</subject></subj-group></article-categories><title-group><article-title xml:lang="en">Automatic Classification of Russian-Language Internet Texts by Genre</article-title><trans-title-group xml:lang="ru"><trans-title>Автоматическая классификация русскоязычных Интернет-текстов по жанрам</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><name-alternatives><name xml:lang="en"><surname>Lagutina</surname><given-names>Ksenia V.</given-names></name><name xml:lang="ru"><surname>Лагутина</surname><given-names>Ксения Владимировна</given-names></name></name-alternatives><address><country country="RU">Russian Federation</country></address><bio xml:lang="en"><p>Candidate of Technical Sciences, Senior Lecturer of the Department of Computing and Program Systems</p></bio><bio xml:lang="ru"><p>кандидат технических наук, старший преподаватель кафедры вычислительных и программных систем</p></bio><email>lagutinakv@mail.ru</email><xref ref-type="aff" rid="aff1"/></contrib><contrib contrib-type="author"><name-alternatives><name xml:lang="en"><surname>Boychuk</surname><given-names>Elena I.</given-names></name><name xml:lang="ru"><surname>Бойчук</surname><given-names>Елена Игоревна</given-names></name></name-alternatives><address><country country="RU">Russian Federation</country></address><bio xml:lang="en"><p>Doctor of Philological Sciences, Professor of the Department of Romance Languages</p></bio><bio xml:lang="ru"><p>доктор филологических наук, профессор кафедры романских языков</p></bio><email>elena-boychouk@rambler.ru</email><xref ref-type="aff" rid="aff2"/></contrib><contrib contrib-type="author"><name-alternatives><name xml:lang="en"><surname>Lagutina</surname><given-names>Nadezhda S.</given-names></name><name xml:lang="ru"><surname>Лагутина</surname><given-names>Надежда Станиславовна</given-names></name></name-alternatives><address><country country="RU">Russian Federation</country></address><bio xml:lang="en"><p>Candidate of Physical and Mathematical Sciences, Associate Professor of the Department of Computing and Program Systems</p></bio><bio xml:lang="ru"><p>кандидат физико-математических наук, доцент кафедры вычислительных и программных систем</p></bio><email>lagutinans@rambler.ru</email><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff-alternatives id="aff1"><aff><institution xml:lang="en">P.G. Demidov Yaroslavl State University</institution></aff><aff><institution xml:lang="ru">Ярославский государственный университет им. П. Г. Демидова</institution></aff></aff-alternatives><aff-alternatives id="aff2"><aff><institution xml:lang="en">Yaroslavl State Pedagogical University named after K.D. Ushinsky</institution></aff><aff><institution xml:lang="ru">Ярославский государственный педагогический университет им. К.Д. Ушинского</institution></aff></aff-alternatives><pub-date date-type="pub" iso-8601-date="2023-12-15" publication-format="electronic"><day>15</day><month>12</month><year>2023</year></pub-date><issue>4</issue><issue-title xml:lang="en"/><issue-title xml:lang="ru"/><fpage>103</fpage><lpage>114</lpage><history><date date-type="received" iso-8601-date="2024-11-12"><day>12</day><month>11</month><year>2024</year></date><date date-type="accepted" iso-8601-date="2024-11-12"><day>12</day><month>11</month><year>2024</year></date></history><permissions><copyright-statement xml:lang="en">Copyright ©; 2023, ФИЦ ИУ РАН</copyright-statement><copyright-statement xml:lang="ru">Copyright ©; 2023,</copyright-statement><copyright-year>2023</copyright-year><copyright-holder xml:lang="en">ФИЦ ИУ РАН</copyright-holder></permissions><self-uri xlink:href="https://journals.rcsi.science/2071-8594/article/view/269757">https://journals.rcsi.science/2071-8594/article/view/269757</self-uri><abstract xml:lang="en"><p>This article is devoted to the use of modern language models based on BERT and models based on three types of text linguistic features for automatic determination of the text genre, as well as a comparative analysis of these models from the points of view of computer and classical linguistics. The authors have collected their own corpus of Russian-language Internet texts in eight genres: VKontakte posts, comments, articles from the Habr portal, retail descriptions, news, scientific articles, advertising, movie reviews from the Kinopoisk website. Each text was represented as a vector of numerical features using each of the selected models: five BERT variations and linguistic features of character, structure and rhythm levels. Vectors based on linguistic features were also concatenated for two or three levels to obtain additional text models. Next, the vectors were classified into eight genres using neural network classifiers, a perceptron and LSTM. The results of the classification showed that BERT models achieved a high quality of genre detection: up to 91-99% of precision, recall, and F-measure. The combination of linguistic features made it possible to obtain the F-measure about 90%. An analysis of the classification results and text models from a linguistic point of view revealed the features of individual genres and possible reasons for both high results and classification errors.</p></abstract><trans-abstract xml:lang="ru"><p>Статья посвящена применению современных языковых моделей на основе BERT и трех типов лингвистических характеристик текста для автоматического определения жанра, а также сравнительному анализу данных моделей с точки зрения компьютерной и классической лингвистики. Собран корпус из русскоязычных Интернеттекстов восьми жанров: посты ВКонтакте, комментарии, статьи с портала Хабр, описания компаний, новости, научные статьи, реклама, отзывы на фильмы с сайта Кинопоиск. Каждый текст представлен в виде вектора числовых характеристик с помощью каждой из выбранных моделей: пяти вариаций BERT и лингвистических характеристик уровней символов, структуры и ритма.</p></trans-abstract><kwd-group xml:lang="en"><kwd>stylometry</kwd><kwd>natural language processing</kwd><kwd>rhythm features</kwd><kwd>genres</kwd><kwd>text classification</kwd><kwd>BERT</kwd></kwd-group><kwd-group xml:lang="ru"><kwd>стилометрия</kwd><kwd>обработка естественного языка</kwd><kwd>ритмические характеристики</kwd><kwd>жанры</kwd><kwd>классификация текстов</kwd><kwd>BERT</kwd></kwd-group><funding-group><funding-statement xml:lang="en">This work was supported by the scholarship of the President of the Russian Federation for young scientists and postgraduate students carrying out promising research and development in priority areas of modernisation of the Russian economy: No. SP-2109.2021.5</funding-statement><funding-statement xml:lang="ru">Работа поддержана стипендией Президента Российской Федерации для молодых ученых и аспирантов, осуществляющих перспективные научные исследования и разработки по приоритетным направлениям модернизации российской экономики: № СП-2109.2021.5</funding-statement></funding-group></article-meta></front><body></body><back><ref-list><ref id="B1"><label>1.</label><citation-alternatives><mixed-citation xml:lang="en">Bahtin, M. M., Kapanadze, L. A. Teoriya rechevyh zhanrov v kontekste lingvisticheskogo gradovedeniya [The theory of speech genres in the context of linguistic urban studies] // Sociolingvistika: yazykovoj oblik sovremennogo goroda 2-e izd., ispr. i dop. Uchebnik i praktikum dlya vuzov. [Sociolinguistics: the linguistic appearance of the modern city, 2nd ed., corr. and add. Textbook and workshop for universities]. Moscow: Izdatel'stvo Yurajt, 2022. P. 45–52.</mixed-citation><mixed-citation xml:lang="ru">Бахтин М. М., Капанадзе Л. А. Теория речевых жанров в контексте лингвистического градоведения //Социолингвистика: языковой облик современного города 2-е изд., испр. и доп. Учебник и практикум для вузов. М.: Издательство Юрайт, 2022. С. 45–52.</mixed-citation></citation-alternatives></ref><ref id="B2"><label>2.</label><citation-alternatives><mixed-citation xml:lang="en">Kuzman T., Rupnik P., Ljubešić N. The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild // Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022. P. 1584–1594.</mixed-citation><mixed-citation xml:lang="ru">Kuzman T., Rupnik P., Ljubešić N. The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild //Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022. P. 1584–1594.</mixed-citation></citation-alternatives></ref><ref id="B3"><label>3.</label><citation-alternatives><mixed-citation xml:lang="en">Galichkina E. N. Tipologiya rechevyh zhanrov setevoj komp'yuternoj kommunikacii [Typology of speech genres of network computer communication] // Izvestiya Volgogradskogo gosudarstvennogo pedagogicheskogo universiteta [News of the Volgograd State Pedagogical University]. 2019. No 2 (135). P. 97–100.</mixed-citation><mixed-citation xml:lang="ru">Галичкина Е. Н. Типология речевых жанров сетевой компьютерной коммуникации //Известия Волгоградского государственного педагогического университета. 2019. №. 2 (135). С. 97–100.</mixed-citation></citation-alternatives></ref><ref id="B4"><label>4.</label><citation-alternatives><mixed-citation xml:lang="en">Tarabarina Y. A. Rechevye zhanry kak professional'naya osnova soderzhaniya inoyazychnogo obucheniya studentov gradostroitel'nogo napravleniya podgotovki [Speech genres as a professional content element of foreign language teaching of urban planning students] // Sovremennoe pedagogicheskoe obrazovanie [Modern Pedagogical Education] 2023. No 1. P. 176–183.</mixed-citation><mixed-citation xml:lang="ru">Тарабарина Ю. А. Речевые жанры как профессиональная основа содержания иноязычного обучения студентов градостроительного направления подготовки //Современное педагогическое образование. 2023. №. 1. С. 176–183.</mixed-citation></citation-alternatives></ref><ref id="B5"><label>5.</label><citation-alternatives><mixed-citation xml:lang="en">Kuznetsov A. V., Pisanov T. V. Klassifikaciya zhurnalistskih zhanrov v Ispanii i Rossii: nezavisimye puti k edinomu podhodu [Classification of journalistic genres in Russia and Spain: independent solutions and consistent approach] // Vestnik Moskovskogo gosudarstvennogo lingvisticheskogo universiteta. Gumanitarnye nauki [Vestnik of Moscow State Linguistic University. Social sciences] 2020. No 7 (836). P. 102–112.</mixed-citation><mixed-citation xml:lang="ru">Кузнецов А. В., Писанова Т. В. Классификация журналистских жанров в Испании и России: независимые пути к единому подходу //Вестник Московского государственного лингвистического университета. Гуманитарные науки. 2020. №. 7 (836). С. 102–112.</mixed-citation></citation-alternatives></ref><ref id="B6"><label>6.</label><citation-alternatives><mixed-citation xml:lang="en">Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P. &amp; He, L. A survey on text classification: From traditional to deep learning // ACM Transactions on Intelligent Systems and Technology (TIST). 2022. V. 13. No 2. P. 1–41.</mixed-citation><mixed-citation xml:lang="ru">Li Q., Peng H., Li J., Xia C., Yang R., Sun L., Yu P. &amp; He L. A survey on text classification: From traditional to deep learning // ACM Transactions on Intelligent Systems and Technology (TIST). 2022. V. 13. No 2. P. 1–41.</mixed-citation></citation-alternatives></ref><ref id="B7"><label>7.</label><citation-alternatives><mixed-citation xml:lang="en">Onan A. An ensemble scheme based on language function analysis and feature engineering for text genre classification // Journal of Information Science. 2018. V. 44. No 1. P. 28–47.</mixed-citation><mixed-citation xml:lang="ru">Onan A. An ensemble scheme based on language function analysis and feature engineering for text genre classification //Journal of Information Science. 2018. V. 44. No 1. P. 28–47.</mixed-citation></citation-alternatives></ref><ref id="B8"><label>8.</label><citation-alternatives><mixed-citation xml:lang="en">Al-Yahya M. Stylometric analysis of classical Arabic texts for genre detection // The Electronic Library. 2018. V. 36. No 5. P. 842–855.</mixed-citation><mixed-citation xml:lang="ru">Al-Yahya M. Stylometric analysis of classical Arabic texts for genre detection //The Electronic Library. 2018. V. 36. No 5. P. 842–855.</mixed-citation></citation-alternatives></ref><ref id="B9"><label>9.</label><citation-alternatives><mixed-citation xml:lang="en">Batraeva I.A., Nartsev A.D., Lezgyan A.S. Ispol'zovanie analiza semanticheskoj blizosti slov pri reshenii zadachi opredeleniya zhanrovoj prinadlezhnosti tekstov metodami glubokogo obucheniya [Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning] // Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitel'naya tekhnika i informatika [Tomsk State University Journal of Control and Computer Science] 2020. No 50. P. 14–22.</mixed-citation><mixed-citation xml:lang="ru">Батраева И. А., Нарцев А. Д., Лезгян А. С. Использование анализа семантической близости слов при решении задачи определения жанровой принадлежности текстов методами глубокого обучения //Вестник Томского государственного университета. Управление, вычислительная техника и информатика. 2020. №. 50. С. 14–22.</mixed-citation></citation-alternatives></ref><ref id="B10"><label>10.</label><citation-alternatives><mixed-citation xml:lang="en">Le Mens, G., Kovács, B., Hannan, M. T., &amp; Pros, G. Using machine learning to uncover the semantics of concepts: how well do typicality measures extracted from a BERT text classifier match human judgments of genre typicality? // Sociological Science. 2023. V. 10. No 3. P.82–117.</mixed-citation><mixed-citation xml:lang="ru">Le Mens G., Kovács B., Hannan M. T., &amp; Pros G. Using machine learning to uncover the semantics of concepts: how well do typicality measures extracted from a BERT text classifier match human judgments of genre typicality? // Sociological Science. 2023. V. 10. No 3. P.82–117.</mixed-citation></citation-alternatives></ref><ref id="B11"><label>11.</label><mixed-citation>Sharoff S. Functional text dimensions for the annotation of web corpora // Corpora. 2018. V. 13. No 1. P. 65–95.</mixed-citation></ref><ref id="B12"><label>12.</label><citation-alternatives><mixed-citation xml:lang="en">Madjarov, G., Vidulin, V., Dimitrovski, I., &amp; Kocev, D. Web genre classification with methods for structured output prediction // Information Sciences. 2019. V. 503. P. 551–573.</mixed-citation><mixed-citation xml:lang="ru">Madjarov G., Vidulin V., Dimitrovski I., &amp; Kocev D. Web genre classification with methods for structured output prediction // Information Sciences. 2019. V. 503. P. 551–573.</mixed-citation></citation-alternatives></ref><ref id="B13"><label>13.</label><citation-alternatives><mixed-citation xml:lang="en">Rasheed, A., Umar, A. I., Shirazi, S. H., Khan, Z., &amp; Shahzad, M. Cover-based multiple book genre recognition using an improved multimodal network // International Journal on Document Analysis and Recognition (IJDAR). 2023. V. 26. No 1. P. 65–88.</mixed-citation><mixed-citation xml:lang="ru">Rasheed A., Umar A. I., Shirazi S. H., Khan Z., &amp; Shahzad M. Cover-based multiple book genre recognition using an improved multimodal network //International Journal on Document Analysis and Recognition (IJDAR). 2023. V. 26. No 1. P. 65–88.</mixed-citation></citation-alternatives></ref><ref id="B14"><label>14.</label><mixed-citation>Da Cunha I., Montané M. A. A corpus-based analysis of textual genres in the administration domain // Discourse Studies. 2020. V. 22. No 1. P. 3–31.</mixed-citation></ref><ref id="B15"><label>15.</label><citation-alternatives><mixed-citation xml:lang="en">Gorbich L.G., Zhivoderov А.А. Ispol'zovanie statisticheskih indeksov dlya razlicheniya nauchnyh i nauchno-populyarnyh tekstov na primere trudov A. E. Fersmana [Using statistical indexes to distinguish between scientific and popular science texts on the example of the works of A.E. Fersman.] // Programmnye produkty i sistemy. [Software &amp; Systems. 2020.] V. 33. No 4. P. 720–725.</mixed-citation><mixed-citation xml:lang="ru">Горбич Л. Г., Живодеров А. А. Использование статистических индексов для различения научных и научнопопулярных текстов на примере трудов А. Е. Ферсмана // Программные продукты и системы. 2020. V. 33. No 4. P. 720–725.</mixed-citation></citation-alternatives></ref><ref id="B16"><label>16.</label><mixed-citation>Corrêa Jr E. A., Marinho V. Q., Amancio D. R. Semantic flow in language networks discriminates texts by genre and publication date //Physica A: Statistical Mechanics and its Applications. 2020. V. 557. P. 124895.</mixed-citation></ref><ref id="B17"><label>17.</label><citation-alternatives><mixed-citation xml:lang="en">Lagutina, K., Poletaev, A., Lagutina, N., Boychuk, E., Paramonov, I. Automatic extraction of rhythm figures and analysis of their dynamics in prose of 19th-21st centuries. // Proceedings of the 26th Conference of Open Innovations Association FRUCT, Yaroslavl, Russia, 20-24 April 2020. IEEE. P. 247-255.</mixed-citation><mixed-citation xml:lang="ru">Lagutina K., Poletaev A., Lagutina N., Boychuk E., Paramonov I. Automatic extraction of rhythm figures and analysis of their dynamics in prose of 19th-21st centuries. // Proceedings of the 26th Conference of Open Innovations Association FRUCT, Yaroslavl, Russia, 20-24 April 2020. IEEE. P. 247-255.</mixed-citation></citation-alternatives></ref><ref id="B18"><label>18.</label><mixed-citation>Lagutina K., Lagutina N., Boychuk E., Larionov V., Paramonov I. Authorship verification of literary texts with rhythm features. // Proceedings of the 28th Conference of Open Innovations Association FRUCT, Moscow, Russia, 27-29 January 2021. IEEE. P. 240-251.</mixed-citation></ref><ref id="B19"><label>19.</label><mixed-citation>Kuratov, Y., Arkhipov, M. Adaptation of deep bidirectional multilingual transformers for Russian language. // Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2019”, Moscow, May 29—June 1, 2019. P. 333-339.</mixed-citation></ref><ref id="B20"><label>20.</label><mixed-citation>Kenton J. D. M. W. C., Toutanova L. K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. // Proceedings of NAACL-HLT, Minneapolis, Minnesota, June 2 June 7, 2019. P. 4171-4186.</mixed-citation></ref></ref-list></back></article>
