<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root>
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" article-type="research-article" dtd-version="1.2" xml:lang="en"><front><journal-meta><journal-id journal-id-type="publisher-id">ARTIFICIAL INTELLIGENCE AND DECISION MAKING</journal-id><journal-title-group><journal-title xml:lang="en">ARTIFICIAL INTELLIGENCE AND DECISION MAKING</journal-title><trans-title-group xml:lang="ru"><trans-title>Искусственный интеллект и принятие решений</trans-title></trans-title-group></journal-title-group><issn publication-format="print">2071-8594</issn></journal-meta><article-meta><article-id pub-id-type="publisher-id">278297</article-id><article-id pub-id-type="doi">10.14357/20718594240407</article-id><article-id pub-id-type="edn">DDBAJC</article-id><article-categories><subj-group subj-group-type="toc-heading" xml:lang="en"><subject>Analysis of Textual and Graphical Information</subject></subj-group><subj-group subj-group-type="toc-heading" xml:lang="ru"><subject>Анализ текстовой и графической информации</subject></subj-group><subj-group subj-group-type="article-type"><subject>Research Article</subject></subj-group></article-categories><title-group><article-title xml:lang="en">Methods for Rhetorical Structure Parsing in Russian</article-title><trans-title-group xml:lang="ru"><trans-title>Методы анализа риторических структур в текстах на русском языке</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><name-alternatives><name xml:lang="en"><surname>Chistova</surname><given-names>Elena V.</given-names></name><name xml:lang="ru"><surname>Чистова</surname><given-names>Елена Викторовна</given-names></name></name-alternatives><address><country country="RU">Russian Federation</country></address><bio xml:lang="en"><p>Junior Researcher</p></bio><bio xml:lang="ru"><p>Младший научный сотрудник</p></bio><email>chistova@isa.ru</email><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff-alternatives id="aff1"><aff><institution xml:lang="en">Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences</institution></aff><aff><institution xml:lang="ru">Федеральный исследовательский центр «Информатика и управление» Российской академии наук</institution></aff></aff-alternatives><pub-date date-type="pub" iso-8601-date="2024-12-10" publication-format="electronic"><day>10</day><month>12</month><year>2024</year></pub-date><issue>4</issue><issue-title xml:lang="en"/><issue-title xml:lang="ru"/><fpage>79</fpage><lpage>92</lpage><history><date date-type="received" iso-8601-date="2025-01-28"><day>28</day><month>01</month><year>2025</year></date><date date-type="accepted" iso-8601-date="2025-01-28"><day>28</day><month>01</month><year>2025</year></date></history><permissions><copyright-statement xml:lang="en">Copyright ©; ,</copyright-statement><copyright-statement xml:lang="ru">Copyright ©; ,</copyright-statement></permissions><self-uri xlink:href="https://journals.rcsi.science/2071-8594/article/view/278297">https://journals.rcsi.science/2071-8594/article/view/278297</self-uri><abstract xml:lang="en"><p>The paper examines the methods for discourse parsing for the Russian language within the framework of rhetorical structure theory. The development of a new corpus for full-text parsing of Russian-language texts of various genres is described. The applicability of various pre-trained encoding language models for rhetorical analysis using two Russian-language corpora is analyzed. We propose a method for training neural network models on a mix of expert-annotated data for rhetorical parsing. This approach allows the models to parse the texts effectively regardless of variations in rhetorical relation sets used in different corpora. It is evaluated on the two large multi-genre corpora of rhetorical annotation for the Russian language.</p></abstract><trans-abstract xml:lang="ru"><p>В работе анализируется опыт построения автоматических дискурсивных анализаторов для русского языка в рамках теории риторических структур (ТРС). Проводится анализ применимости различных предобученных кодирующих языковых моделей к риторическому анализу на основе двух русскоязычных корпусов. Предложен метод обучения нейросетевых моделей для автоматического анализа риторических структур на смешении любых данных экспертной ТРС-разметки, позволяющий не зависеть от различий между принятыми в них наборами риторических отношений. Метод оценен на материале двух больших мультижанровых корпусов риторической разметки для русского языка.</p></trans-abstract><kwd-group xml:lang="en"><kwd>discourse parsing</kwd><kwd>rhetorical structure theory</kwd><kwd>deep learning</kwd><kwd>Russian language</kwd></kwd-group><kwd-group xml:lang="ru"><kwd>дискурсивный анализ</kwd><kwd>теория риторических структур</kwd><kwd>глубокое обучение</kwd><kwd>русский язык</kwd></kwd-group><funding-group><funding-statement xml:lang="ru">Работа выполнена при поддержке Министерства науки и высшего образования РФ в рамках государственного задания на оказание государственных услуг в соответствии с дополнительным соглашением № 075-03-2024-490/2 (Молодежная лаборатория “Технологии анализа и контролируемого синтеза текстов”, НИР № 124042600053-7). Работа выполнялась с использованием инфраструктуры Центра коллективного пользования «Высокопроизводительные вычисления и большие данные» (ЦКП «Информатика») ФИЦ ИУ РАН (Москва).</funding-statement></funding-group></article-meta></front><body></body><back><ref-list><ref id="B1"><label>1.</label><mixed-citation>Mikolov T. et al. Distributed representations of words and phrases and their compositionality // Advances in neural information processing systems. 2013.</mixed-citation></ref><ref id="B2"><label>2.</label><mixed-citation>Kudo T., Richardson J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing // Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium. 2018. P. 66–71.</mixed-citation></ref><ref id="B3"><label>3.</label><mixed-citation>Devlin J. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. V. 1 (Long and Short Papers). Minneapolis, Minnesota. 2019. P. 4171–4186.</mixed-citation></ref><ref id="B4"><label>4.</label><mixed-citation>Apidianaki M. From Word Types to Tokens and Back: A Survey of Approaches to Word Meaning Representation and Interpretation // Computational Linguistics. 2023. V. 49. No 2. P. 465–523.</mixed-citation></ref><ref id="B5"><label>5.</label><citation-alternatives><mixed-citation xml:lang="en">Mann W.C., Thompson S.A. Rhetorical structure theory: Toward a functional theory of text organization // Text-interdisciplinary Journal for the Study of Discourse. 1988. V. 8. No 3. P. 243–281.</mixed-citation><mixed-citation xml:lang="ru">Mann W.C., Thompson S.A. Rhetorical structure theory: Toward a functional theory of text organization // Textinterdisciplinary Journal for the Study of Discourse. 1988. V. 8. No 3. P. 243–281.</mixed-citation></citation-alternatives></ref><ref id="B6"><label>6.</label><mixed-citation>Dachkovsky S., Stamp R., Sandler W. Mapping the body to the discourse hierarchy in sign language emergence // Language and Cognition. 2023. V. 15. No 1. P. 53–85.</mixed-citation></ref><ref id="B7"><label>7.</label><citation-alternatives><mixed-citation xml:lang="en">Pisarevskaya D. et al. Towards building a discoursennotated corpus of Russian // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”. 2017. P. 201–212.</mixed-citation><mixed-citation xml:lang="ru">Pisarevskaya D. et al. Towards building a discourse-annotated corpus of Russian // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”. 2017. P. 201–212.</mixed-citation></citation-alternatives></ref><ref id="B8"><label>8.</label><mixed-citation>Marcu D. The theory and practice of discourse parsing and summarization. MIT press, 2000.</mixed-citation></ref><ref id="B9"><label>9.</label><mixed-citation>Carlson L., Marcu D., Okurovsky M.E. Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory // Proceedings of the Second SIGdial Workshop on Discourse and Dialogue. 2001.</mixed-citation></ref><ref id="B10"><label>10.</label><mixed-citation>Marcus M.P., Santorini B., Marcinkiewicz M.A. Building a Large Annotated Corpus of English: The Penn Treebank // Computational Linguistics. 1993. V. 19. No 2. P. 313–330.</mixed-citation></ref><ref id="B11"><label>11.</label><mixed-citation>Li J., Xiao L. Neural-based RST Parsing And Analysis In Persuasive Discourse // Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). Online. 2021. P. 274–283.</mixed-citation></ref><ref id="B12"><label>12.</label><mixed-citation>Chistova E. et al. RST discourse parser for Russian: an experimental study of deep learning models // International Conference on Analysis of Images, Social Networks and Texts . Springer. 2020. P. 105–119.</mixed-citation></ref><ref id="B13"><label>13.</label><mixed-citation>Guz G., Huber P., Carenini G. Unleashing the Power of Neural Discourse Parsers - A Context and Structure Aware Approach Using Large Scale Pretraining // Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online). 2020. P. 3794–3805.</mixed-citation></ref><ref id="B14"><label>14.</label><mixed-citation>Kobayashi N. et al. A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing // Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates. 2022. P. 6725–6737.</mixed-citation></ref><ref id="B15"><label>15.</label><mixed-citation>Sagae K. Analysis of Discourse Structure with Syntactic Dependencies and Data-Driven ShiftReduce Parsing // Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09). Paris, France. 2009. P. 81–84.</mixed-citation></ref><ref id="B16"><label>16.</label><mixed-citation>Feng V.W., Hirst G. A Linear-Time Bottom-Up Discourse Parser with Constraints and PostEditing // Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (V. 1: Long Papers). Baltimore, Maryland. 2014. P. 511–521.</mixed-citation></ref><ref id="B17"><label>17.</label><mixed-citation>Guz G., Carenini G. Coreference for Discourse Parsing: A Neural Approach // Proceedings of the First Workshop on Computational Approaches to Discourse. Online: Association for Computational Linguistics. 2020. P. 160–167.</mixed-citation></ref><ref id="B18"><label>18.</label><mixed-citation>Braud C., Plank B., Søgaard A. Multi-view and multi-task training of RST discourse parsers // Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan: The COLING 2016 Organizing Committee. 2016. P. 1903–1913.</mixed-citation></ref><ref id="B19"><label>19.</label><mixed-citation>Joty S. et al. Combining Intra-and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis // Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (V. 1: Long Papers). Sofia, Bulgaria: Association for Computational Linguistics. 2013. P. 486–496.</mixed-citation></ref><ref id="B20"><label>20.</label><mixed-citation>Joty S., Carenini G., Ng R.T. CODRA: A Novel Discriminative Framework for Rhetorical Analysis // Computational Linguistics. 2015. V. 41. No 3. P. 385–435.</mixed-citation></ref><ref id="B21"><label>21.</label><mixed-citation>Hernault H. et al. HILDA: A discourse parser using support vector machine classification // Dialogue &amp; Discourse. 2010. V. 1. No 3. P. 1–33.</mixed-citation></ref><ref id="B22"><label>22.</label><mixed-citation>Wang Y., Li S., Wang H. A Two-Stage Parsing Method for Text-Level Discourse Analysis // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (V. 2: Short Papers). Vancouver, Canada. 2017. P. 184–188.</mixed-citation></ref><ref id="B23"><label>23.</label><mixed-citation>Li J., Li R., Hovy E. Recursive Deep Models for Discourse Parsing // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar. 2014. P. 2061–2069.</mixed-citation></ref><ref id="B24"><label>24.</label><citation-alternatives><mixed-citation xml:lang="en">Chistova E. et al. Towards the Data-driven System for Rhetorical Parsing of Russian Texts // Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019. Minneapolis, MN. 2019. P. 82–87.</mixed-citation><mixed-citation xml:lang="ru">Chistova E. et al. Towards the Data-driven System for Rhetorical Parsing of Russian Texts // Proceedings of the Work-shop on Discourse Relation Parsing and Treebanking 2019. Minneapolis, MN. 2019. P. 82–87.</mixed-citation></citation-alternatives></ref><ref id="B25"><label>25.</label><mixed-citation>Ji Y., Eisenstein J. Representation Learning for Text-level Discourse Parsing // Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (V. 1: Long Papers). Baltimore, Maryland: Association for Computational Linguistics. 2014. P. 13–24.</mixed-citation></ref><ref id="B26"><label>26.</label><mixed-citation>Maekawa A. et al. Can we obtain significant success in RST discourse parsing by using Large Language Models? // Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (V. 1: Long Papers). Association for Computational Linguistics. 2024. P. 2803–2815.</mixed-citation></ref><ref id="B27"><label>27.</label><mixed-citation>Zhang L. et al. A Top-down Neural Architecture towards Text-level Parsing of Discourse Rhetorical Structure // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online. 2020. P. 6386–6395.</mixed-citation></ref><ref id="B28"><label>28.</label><mixed-citation>Zhang L., Kong F., Zhou G. Adversarial Learning for Discourse Rhetorical Structure Parsing // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V. 1: Long Papers). Association for Computational Linguistics. 2021. P. 3946–3957.</mixed-citation></ref><ref id="B29"><label>29.</label><mixed-citation>Nguyen T.T. et al. RST Parsing from Scratch // Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online. 2021. P. 1613–1625.</mixed-citation></ref><ref id="B30"><label>30.</label><mixed-citation>Liu Z., Shi K., Chen N. DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing // Proceedings of the 2nd Workshop on Computational Approaches to Discourse. Punta Cana, Dominican Republic and Online. 2021. P. 154–164.</mixed-citation></ref><ref id="B31"><label>31.</label><citation-alternatives><mixed-citation xml:lang="en">Cardoso P.C.F., Maziero E.G. CSTNews — a discourse-annotated corpus for single and multi-document summarization of news texts in Brazilian Portuguese // 3rd RST Brazilian Meeting. 2011.</mixed-citation><mixed-citation xml:lang="ru">Cardoso P.C.F., Maziero E.G. CSTNews — a discoursennotated corpus for single and multi-document summarization of news texts in Brazilian Portuguese // 3rd RST Brazilian Meeting. 2011.</mixed-citation></citation-alternatives></ref><ref id="B32"><label>32.</label><citation-alternatives><mixed-citation xml:lang="en">Collovini S. et al. Summ-it: Um corpus anotado com informaç oes discursivas visandoa sumarizaç ao automática // Proceedings of TIL. 2007.</mixed-citation><mixed-citation xml:lang="ru">Collovini S. et al. Summit: Um corpus anotado com informaç oes discursivas visandoa sumarizaç ao automática // Proceedings of TIL. 2007.</mixed-citation></citation-alternatives></ref><ref id="B33"><label>33.</label><mixed-citation>Pardo T.A.S., Seno E.R.M. Rhetalho: um corpus de referência anotado retoricamente // Anais do V Encontro de Corpora. 2005. P. 24–25.</mixed-citation></ref><ref id="B34"><label>34.</label><mixed-citation>Pardo T.A.S., Volpe N.M. das Graças. A construção de um corpus de textos científicos em português do brasil e sua marcação retórica. 2003.</mixed-citation></ref><ref id="B35"><label>35.</label><mixed-citation>Stede M., Neumann A. Potsdam Commentary Corpus 2.0: Annotation for Discourse Research // LREC. 2014. P. 925–929.</mixed-citation></ref><ref id="B36"><label>36.</label><mixed-citation>Redeker G. et al. Multi-layer discourse annotation of a Dutch text corpus // LREC. 2012. V. 1. P. 2820–2825.</mixed-citation></ref><ref id="B37"><label>37.</label><mixed-citation>Iruskieta M. et al. The RST Basque TreeBank: an online search interface to check rhetorical relations // 4th workshop RST and discourse studies. 2013. P. 40–49.</mixed-citation></ref><ref id="B38"><label>38.</label><mixed-citation>Zeldes A. The GUM corpus: Creating multilayer resources in the classroom // Language Resources and Evaluation. 2017. V. 51. No 3. P. 581–612.</mixed-citation></ref><ref id="B39"><label>39.</label><mixed-citation>Cao S. et al. The RST Spanish-Chinese Treebank // Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAWMWE-CxG-2018). Santa Fe, New Mexico, USA. 2018. P. 156–166.</mixed-citation></ref><ref id="B40"><label>40.</label><mixed-citation>Li Y. et al. Building Chinese Discourse Corpus with Connective-driven Dependency Tree Structure // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar. 2014. P. 2105–2114.</mixed-citation></ref><ref id="B41"><label>41.</label><mixed-citation>Morey M., Muller P., Asher N. How much progress have we made on RST discourse parsing? A replication study of recent results on the RST-DT // Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark. 2017. P. 1319–1324.</mixed-citation></ref><ref id="B42"><label>42.</label><mixed-citation>Chistova E. et al. Classification models for RST discourse parsing of texts in Russian // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference ”Dialogue”. 2019. P. 163–176.</mixed-citation></ref><ref id="B43"><label>43.</label><mixed-citation>Chistova E., Smirnov I. Discourse-aware text classification for argument mining // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference ”Dialogue”. 2022. No 2022. P. 93.</mixed-citation></ref><ref id="B44"><label>44.</label><mixed-citation>Wang Z., Hamza W., Florian R. Bilateral multi-perspective matching for natural language sentences // Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017. P. 4144–4150.</mixed-citation></ref><ref id="B45"><label>45.</label><mixed-citation>Zmitrovich D. et al. A family of pretrained transformer language models for Russian // arXiv preprint arXiv:2309.10931. 2023.</mixed-citation></ref><ref id="B46"><label>46.</label><mixed-citation>Kuratov Y., Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”. 2019. P. 333–339.</mixed-citation></ref><ref id="B47"><label>47.</label><mixed-citation>Chistova E. Bilingual Rhetorical Structure Parsing with Large Parallel Annotations // Findings of the Association for Computational Linguistics ACL 2024. Bangkok, Thailand and virtual meeting: Association for Computational Linguistics. 2024. P. 9689–9706.</mixed-citation></ref><ref id="B48"><label>48.</label><mixed-citation>Liu Y. et al. Roberta: A robustly optimized bert pretraining approach // arXiv preprint arXiv:1907.11692. 2019.</mixed-citation></ref><ref id="B49"><label>49.</label><mixed-citation>Liu Z., Shi K., Chen N. Multilingual Neural RST Discourse Parsing // Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics. 2020. Dec. P. 6730–6738.</mixed-citation></ref><ref id="B50"><label>50.</label><mixed-citation>Costa-jussà M.R. et al. No language left behind: Scaling human-centered machine translation // arXiv preprint arXiv:2207.04672. 2022.</mixed-citation></ref></ref-list></back></article>
