Методы анализа риторических структур в текстах на русском языке
- Авторы: Чистова Е.В.1
-
Учреждения:
- Федеральный исследовательский центр «Информатика и управление» Российской академии наук
- Выпуск: № 4 (2024)
- Страницы: 79-92
- Раздел: Анализ текстовой и графической информации
- URL: https://journals.rcsi.science/2071-8594/article/view/278297
- DOI: https://doi.org/10.14357/20718594240407
- EDN: https://elibrary.ru/DDBAJC
- ID: 278297
Цитировать
Полный текст
Аннотация
В работе анализируется опыт построения автоматических дискурсивных анализаторов для русского языка в рамках теории риторических структур (ТРС). Проводится анализ применимости различных предобученных кодирующих языковых моделей к риторическому анализу на основе двух русскоязычных корпусов. Предложен метод обучения нейросетевых моделей для автоматического анализа риторических структур на смешении любых данных экспертной ТРС-разметки, позволяющий не зависеть от различий между принятыми в них наборами риторических отношений. Метод оценен на материале двух больших мультижанровых корпусов риторической разметки для русского языка.
Ключевые слова
Об авторах
Елена Викторовна Чистова
Федеральный исследовательский центр «Информатика и управление» Российской академии наук
Автор, ответственный за переписку.
Email: chistova@isa.ru
Младший научный сотрудник
Россия, МоскваСписок литературы
- Mikolov T. et al. Distributed representations of words and phrases and their compositionality // Advances in neural information processing systems. 2013.
- Kudo T., Richardson J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing // Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium. 2018. P. 66–71.
- Devlin J. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. V. 1 (Long and Short Papers). Minneapolis, Minnesota. 2019. P. 4171–4186.
- Apidianaki M. From Word Types to Tokens and Back: A Survey of Approaches to Word Meaning Representation and Interpretation // Computational Linguistics. 2023. V. 49. No 2. P. 465–523.
- Mann W.C., Thompson S.A. Rhetorical structure theory: Toward a functional theory of text organization // Textinterdisciplinary Journal for the Study of Discourse. 1988. V. 8. No 3. P. 243–281.
- Dachkovsky S., Stamp R., Sandler W. Mapping the body to the discourse hierarchy in sign language emergence // Language and Cognition. 2023. V. 15. No 1. P. 53–85.
- Pisarevskaya D. et al. Towards building a discourse-annotated corpus of Russian // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”. 2017. P. 201–212.
- Marcu D. The theory and practice of discourse parsing and summarization. MIT press, 2000.
- Carlson L., Marcu D., Okurovsky M.E. Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory // Proceedings of the Second SIGdial Workshop on Discourse and Dialogue. 2001.
- Marcus M.P., Santorini B., Marcinkiewicz M.A. Building a Large Annotated Corpus of English: The Penn Treebank // Computational Linguistics. 1993. V. 19. No 2. P. 313–330.
- Li J., Xiao L. Neural-based RST Parsing And Analysis In Persuasive Discourse // Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). Online. 2021. P. 274–283.
- Chistova E. et al. RST discourse parser for Russian: an experimental study of deep learning models // International Conference on Analysis of Images, Social Networks and Texts . Springer. 2020. P. 105–119.
- Guz G., Huber P., Carenini G. Unleashing the Power of Neural Discourse Parsers - A Context and Structure Aware Approach Using Large Scale Pretraining // Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online). 2020. P. 3794–3805.
- Kobayashi N. et al. A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing // Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates. 2022. P. 6725–6737.
- Sagae K. Analysis of Discourse Structure with Syntactic Dependencies and Data-Driven ShiftReduce Parsing // Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09). Paris, France. 2009. P. 81–84.
- Feng V.W., Hirst G. A Linear-Time Bottom-Up Discourse Parser with Constraints and PostEditing // Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (V. 1: Long Papers). Baltimore, Maryland. 2014. P. 511–521.
- Guz G., Carenini G. Coreference for Discourse Parsing: A Neural Approach // Proceedings of the First Workshop on Computational Approaches to Discourse. Online: Association for Computational Linguistics. 2020. P. 160–167.
- Braud C., Plank B., Søgaard A. Multi-view and multi-task training of RST discourse parsers // Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan: The COLING 2016 Organizing Committee. 2016. P. 1903–1913.
- Joty S. et al. Combining Intra-and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis // Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (V. 1: Long Papers). Sofia, Bulgaria: Association for Computational Linguistics. 2013. P. 486–496.
- Joty S., Carenini G., Ng R.T. CODRA: A Novel Discriminative Framework for Rhetorical Analysis // Computational Linguistics. 2015. V. 41. No 3. P. 385–435.
- Hernault H. et al. HILDA: A discourse parser using support vector machine classification // Dialogue & Discourse. 2010. V. 1. No 3. P. 1–33.
- Wang Y., Li S., Wang H. A Two-Stage Parsing Method for Text-Level Discourse Analysis // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (V. 2: Short Papers). Vancouver, Canada. 2017. P. 184–188.
- Li J., Li R., Hovy E. Recursive Deep Models for Discourse Parsing // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar. 2014. P. 2061–2069.
- Chistova E. et al. Towards the Data-driven System for Rhetorical Parsing of Russian Texts // Proceedings of the Work-shop on Discourse Relation Parsing and Treebanking 2019. Minneapolis, MN. 2019. P. 82–87.
- Ji Y., Eisenstein J. Representation Learning for Text-level Discourse Parsing // Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (V. 1: Long Papers). Baltimore, Maryland: Association for Computational Linguistics. 2014. P. 13–24.
- Maekawa A. et al. Can we obtain significant success in RST discourse parsing by using Large Language Models? // Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (V. 1: Long Papers). Association for Computational Linguistics. 2024. P. 2803–2815.
- Zhang L. et al. A Top-down Neural Architecture towards Text-level Parsing of Discourse Rhetorical Structure // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online. 2020. P. 6386–6395.
- Zhang L., Kong F., Zhou G. Adversarial Learning for Discourse Rhetorical Structure Parsing // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V. 1: Long Papers). Association for Computational Linguistics. 2021. P. 3946–3957.
- Nguyen T.T. et al. RST Parsing from Scratch // Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online. 2021. P. 1613–1625.
- Liu Z., Shi K., Chen N. DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing // Proceedings of the 2nd Workshop on Computational Approaches to Discourse. Punta Cana, Dominican Republic and Online. 2021. P. 154–164.
- Cardoso P.C.F., Maziero E.G. CSTNews — a discoursennotated corpus for single and multi-document summarization of news texts in Brazilian Portuguese // 3rd RST Brazilian Meeting. 2011.
- Collovini S. et al. Summit: Um corpus anotado com informaç oes discursivas visandoa sumarizaç ao automática // Proceedings of TIL. 2007.
- Pardo T.A.S., Seno E.R.M. Rhetalho: um corpus de referência anotado retoricamente // Anais do V Encontro de Corpora. 2005. P. 24–25.
- Pardo T.A.S., Volpe N.M. das Graças. A construção de um corpus de textos científicos em português do brasil e sua marcação retórica. 2003.
- Stede M., Neumann A. Potsdam Commentary Corpus 2.0: Annotation for Discourse Research // LREC. 2014. P. 925–929.
- Redeker G. et al. Multi-layer discourse annotation of a Dutch text corpus // LREC. 2012. V. 1. P. 2820–2825.
- Iruskieta M. et al. The RST Basque TreeBank: an online search interface to check rhetorical relations // 4th workshop RST and discourse studies. 2013. P. 40–49.
- Zeldes A. The GUM corpus: Creating multilayer resources in the classroom // Language Resources and Evaluation. 2017. V. 51. No 3. P. 581–612.
- Cao S. et al. The RST Spanish-Chinese Treebank // Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAWMWE-CxG-2018). Santa Fe, New Mexico, USA. 2018. P. 156–166.
- Li Y. et al. Building Chinese Discourse Corpus with Connective-driven Dependency Tree Structure // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar. 2014. P. 2105–2114.
- Morey M., Muller P., Asher N. How much progress have we made on RST discourse parsing? A replication study of recent results on the RST-DT // Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark. 2017. P. 1319–1324.
- Chistova E. et al. Classification models for RST discourse parsing of texts in Russian // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference ”Dialogue”. 2019. P. 163–176.
- Chistova E., Smirnov I. Discourse-aware text classification for argument mining // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference ”Dialogue”. 2022. No 2022. P. 93.
- Wang Z., Hamza W., Florian R. Bilateral multi-perspective matching for natural language sentences // Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017. P. 4144–4150.
- Zmitrovich D. et al. A family of pretrained transformer language models for Russian // arXiv preprint arXiv:2309.10931. 2023.
- Kuratov Y., Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language // Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”. 2019. P. 333–339.
- Chistova E. Bilingual Rhetorical Structure Parsing with Large Parallel Annotations // Findings of the Association for Computational Linguistics ACL 2024. Bangkok, Thailand and virtual meeting: Association for Computational Linguistics. 2024. P. 9689–9706.
- Liu Y. et al. Roberta: A robustly optimized bert pretraining approach // arXiv preprint arXiv:1907.11692. 2019.
- Liu Z., Shi K., Chen N. Multilingual Neural RST Discourse Parsing // Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics. 2020. Dec. P. 6730–6738.
- Costa-jussà M.R. et al. No language left behind: Scaling human-centered machine translation // arXiv preprint arXiv:2207.04672. 2022.
Дополнительные файлы
