ARTIFICIAL INTELLIGENCE AND DECISION MAKING

Искусственный интеллект и принятие решений

2071-8594

278297

10.14357/20718594240407

DDBAJC

Analysis of Textual and Graphical Information

Анализ текстовой и графической информации

Research Article

Methods for Rhetorical Structure Parsing in Russian

Методы анализа риторических структур в текстах на русском языке

Chistova

Elena V.

Чистова

Елена Викторовна

Russian Federation

Junior Researcher

Младший научный сотрудник

chistova@isa.ru

Federal Research Center “Computer Science and Control” of the Russian Academy of SciencesФедеральный исследовательский центр «Информатика и управление» Российской академии наук

10122024

79922801202528012025

https://journals.rcsi.science/2071-8594/article/view/278297

The paper examines the methods for discourse parsing for the Russian language within the framework of rhetorical structure theory. The development of a new corpus for full-text parsing of Russian-language texts of various genres is described. The applicability of various pre-trained encoding language models for rhetorical analysis using two Russian-language corpora is analyzed. We propose a method for training neural network models on a mix of expert-annotated data for rhetorical parsing. This approach allows the models to parse the texts effectively regardless of variations in rhetorical relation sets used in different corpora. It is evaluated on the two large multi-genre corpora of rhetorical annotation for the Russian language.

В работе анализируется опыт построения автоматических дискурсивных анализаторов для русского языка в рамках теории риторических структур (ТРС). Проводится анализ применимости различных предобученных кодирующих языковых моделей к риторическому анализу на основе двух русскоязычных корпусов. Предложен метод обучения нейросетевых моделей для автоматического анализа риторических структур на смешении любых данных экспертной ТРС-разметки, позволяющий не зависеть от различий между принятыми в них наборами риторических отношений. Метод оценен на материале двух больших мультижанровых корпусов риторической разметки для русского языка.

discourse parsingrhetorical structure theorydeep learningRussian language

дискурсивный анализтеория риторических структурглубокое обучениерусский язык

Работа выполнена при поддержке Министерства науки и высшего образования РФ в рамках государственного задания на оказание государственных услуг в соответствии с дополнительным соглашением № 075-03-2024-490/2 (Молодежная лаборатория “Технологии анализа и контролируемого синтеза текстов”, НИР № 124042600053-7). Работа выполнялась с использованием инфраструктуры Центра коллективного пользования «Высокопроизводительные вычисления и большие данные» (ЦКП «Информатика») ФИЦ ИУ РАН (Москва).

Mikolov T. et al. Distributed representations of words and phrases and their compositionality // Advances in neural information processing systems. 2013.

Kudo T., Richardson J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing // Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium. 2018. P. 66–71.

Devlin J. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. V. 1 (Long and Short Papers). Minneapolis, Minnesota. 2019. P. 4171–4186.

Apidianaki M. From Word Types to Tokens and Back: A Survey of Approaches to Word Meaning Representation and Interpretation // Computational Linguistics. 2023. V. 49. No 2. P. 465–523.

Mann W.C., Thompson S.A. Rhetorical structure theory: Toward a functional theory of text organization // Text-interdisciplinary Journal for the Study of Discourse. 1988. V. 8. No 3. P. 243–281.

Mann W.C., Thompson S.A. Rhetorical structure theory: Toward a functional theory of text organization // Textinterdisciplinary Journal for the Study of Discourse. 1988. V. 8. No 3. P. 243–281.

Dachkovsky S., Stamp R., Sandler W. Mapping the body to the discourse hierarchy in sign language emergence // Language and Cognition. 2023. V. 15. No 1. P. 53–85.

Pisarevskaya D. et al. Towards building a discoursennotated corpus of Russian // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”. 2017. P. 201–212.

Pisarevskaya D. et al. Towards building a discourse-annotated corpus of Russian // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”. 2017. P. 201–212.

Marcu D. The theory and practice of discourse parsing and summarization. MIT press, 2000.

Carlson L., Marcu D., Okurovsky M.E. Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory // Proceedings of the Second SIGdial Workshop on Discourse and Dialogue. 2001.

10.

Marcus M.P., Santorini B., Marcinkiewicz M.A. Building a Large Annotated Corpus of English: The Penn Treebank // Computational Linguistics. 1993. V. 19. No 2. P. 313–330.

11.

Li J., Xiao L. Neural-based RST Parsing And Analysis In Persuasive Discourse // Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). Online. 2021. P. 274–283.

12.

Chistova E. et al. RST discourse parser for Russian: an experimental study of deep learning models // International Conference on Analysis of Images, Social Networks and Texts . Springer. 2020. P. 105–119.

13.

Guz G., Huber P., Carenini G. Unleashing the Power of Neural Discourse Parsers - A Context and Structure Aware Approach Using Large Scale Pretraining // Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online). 2020. P. 3794–3805.

14.

Kobayashi N. et al. A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing // Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates. 2022. P. 6725–6737.

15.

Sagae K. Analysis of Discourse Structure with Syntactic Dependencies and Data-Driven ShiftReduce Parsing // Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09). Paris, France. 2009. P. 81–84.

16.

Feng V.W., Hirst G. A Linear-Time Bottom-Up Discourse Parser with Constraints and PostEditing // Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (V. 1: Long Papers). Baltimore, Maryland. 2014. P. 511–521.

17.

Guz G., Carenini G. Coreference for Discourse Parsing: A Neural Approach // Proceedings of the First Workshop on Computational Approaches to Discourse. Online: Association for Computational Linguistics. 2020. P. 160–167.

18.

Braud C., Plank B., Søgaard A. Multi-view and multi-task training of RST discourse parsers // Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan: The COLING 2016 Organizing Committee. 2016. P. 1903–1913.

19.

Joty S. et al. Combining Intra-and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis // Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (V. 1: Long Papers). Sofia, Bulgaria: Association for Computational Linguistics. 2013. P. 486–496.

20.

Joty S., Carenini G., Ng R.T. CODRA: A Novel Discriminative Framework for Rhetorical Analysis // Computational Linguistics. 2015. V. 41. No 3. P. 385–435.

21.

Hernault H. et al. HILDA: A discourse parser using support vector machine classification // Dialogue & Discourse. 2010. V. 1. No 3. P. 1–33.

22.

Wang Y., Li S., Wang H. A Two-Stage Parsing Method for Text-Level Discourse Analysis // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (V. 2: Short Papers). Vancouver, Canada. 2017. P. 184–188.

23.

Li J., Li R., Hovy E. Recursive Deep Models for Discourse Parsing // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar. 2014. P. 2061–2069.

24.

Chistova E. et al. Towards the Data-driven System for Rhetorical Parsing of Russian Texts // Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019. Minneapolis, MN. 2019. P. 82–87.

Chistova E. et al. Towards the Data-driven System for Rhetorical Parsing of Russian Texts // Proceedings of the Work-shop on Discourse Relation Parsing and Treebanking 2019. Minneapolis, MN. 2019. P. 82–87.

25.

Ji Y., Eisenstein J. Representation Learning for Text-level Discourse Parsing // Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (V. 1: Long Papers). Baltimore, Maryland: Association for Computational Linguistics. 2014. P. 13–24.

26.

Maekawa A. et al. Can we obtain significant success in RST discourse parsing by using Large Language Models? // Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (V. 1: Long Papers). Association for Computational Linguistics. 2024. P. 2803–2815.

27.

Zhang L. et al. A Top-down Neural Architecture towards Text-level Parsing of Discourse Rhetorical Structure // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online. 2020. P. 6386–6395.

28.

Zhang L., Kong F., Zhou G. Adversarial Learning for Discourse Rhetorical Structure Parsing // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V. 1: Long Papers). Association for Computational Linguistics. 2021. P. 3946–3957.

29.

Nguyen T.T. et al. RST Parsing from Scratch // Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online. 2021. P. 1613–1625.

30.

Liu Z., Shi K., Chen N. DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing // Proceedings of the 2nd Workshop on Computational Approaches to Discourse. Punta Cana, Dominican Republic and Online. 2021. P. 154–164.

31.

Cardoso P.C.F., Maziero E.G. CSTNews — a discourse-annotated corpus for single and multi-document summarization of news texts in Brazilian Portuguese // 3rd RST Brazilian Meeting. 2011.

Cardoso P.C.F., Maziero E.G. CSTNews — a discoursennotated corpus for single and multi-document summarization of news texts in Brazilian Portuguese // 3rd RST Brazilian Meeting. 2011.

32.

Collovini S. et al. Summ-it: Um corpus anotado com informaç oes discursivas visandoa sumarizaç ao automática // Proceedings of TIL. 2007.

Collovini S. et al. Summit: Um corpus anotado com informaç oes discursivas visandoa sumarizaç ao automática // Proceedings of TIL. 2007.

33.

Pardo T.A.S., Seno E.R.M. Rhetalho: um corpus de referência anotado retoricamente // Anais do V Encontro de Corpora. 2005. P. 24–25.

34.

Pardo T.A.S., Volpe N.M. das Graças. A construção de um corpus de textos científicos em português do brasil e sua marcação retórica. 2003.

35.

Stede M., Neumann A. Potsdam Commentary Corpus 2.0: Annotation for Discourse Research // LREC. 2014. P. 925–929.

36.

Redeker G. et al. Multi-layer discourse annotation of a Dutch text corpus // LREC. 2012. V. 1. P. 2820–2825.

37.

Iruskieta M. et al. The RST Basque TreeBank: an online search interface to check rhetorical relations // 4th workshop RST and discourse studies. 2013. P. 40–49.

38.

Zeldes A. The GUM corpus: Creating multilayer resources in the classroom // Language Resources and Evaluation. 2017. V. 51. No 3. P. 581–612.

39.

Cao S. et al. The RST Spanish-Chinese Treebank // Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAWMWE-CxG-2018). Santa Fe, New Mexico, USA. 2018. P. 156–166.

40.

Li Y. et al. Building Chinese Discourse Corpus with Connective-driven Dependency Tree Structure // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar. 2014. P. 2105–2114.

41.

Morey M., Muller P., Asher N. How much progress have we made on RST discourse parsing? A replication study of recent results on the RST-DT // Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark. 2017. P. 1319–1324.

42.

Chistova E. et al. Classification models for RST discourse parsing of texts in Russian // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference ”Dialogue”. 2019. P. 163–176.

43.

Chistova E., Smirnov I. Discourse-aware text classification for argument mining // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference ”Dialogue”. 2022. No 2022. P. 93.

44.

Wang Z., Hamza W., Florian R. Bilateral multi-perspective matching for natural language sentences // Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017. P. 4144–4150.

45.

Zmitrovich D. et al. A family of pretrained transformer language models for Russian // arXiv preprint arXiv:2309.10931. 2023.

46.

Kuratov Y., Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”. 2019. P. 333–339.

47.

Chistova E. Bilingual Rhetorical Structure Parsing with Large Parallel Annotations // Findings of the Association for Computational Linguistics ACL 2024. Bangkok, Thailand and virtual meeting: Association for Computational Linguistics. 2024. P. 9689–9706.

48.

Liu Y. et al. Roberta: A robustly optimized bert pretraining approach // arXiv preprint arXiv:1907.11692. 2019.

49.

Liu Z., Shi K., Chen N. Multilingual Neural RST Discourse Parsing // Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics. 2020. Dec. P. 6730–6738.

50.

Costa-jussà M.R. et al. No language left behind: Scaling human-centered machine translation // arXiv preprint arXiv:2207.04672. 2022.