Automatic Generation of Semantically Annotated Collocation Corpus
- Authors: Zaripova D.A.1, Lukashevich N.V.1
-
Affiliations:
- Issue: No 11 (2023)
- Pages: 113-125
- Section: Articles
- URL: https://journals.rcsi.science/2409-8698/article/view/380014
- DOI: https://doi.org/10.25136/2409-8698.2023.11.44007
- EDN: https://elibrary.ru/QRBQOI
- ID: 380014
Cite item
Full Text
Abstract
Word Sense Disambiguation (WSD) is a crucial initial step in automatic semantic analysis. It involves selecting the correct sense of an ambiguous word in a given context, which can be challenging even for human annotators. Supervised machine learning models require large datasets with semantic annotation to be effective. However, manual sense labeling can be a costly, labor-intensive, and time-consuming task. Therefore, it is crucial to develop and test automatic and semi-automatic methods of semantic annotation. Information about semantically related words, such as synonyms, hypernyms, hyponyms, and collocations in which the word appears, can be used for these purposes. In this article, we describe our approach to generating a semantically annotated collocation corpus for the Russian language. Our goal was to create a resource that could be used to improve the accuracy of WSD models for Russian. This article outlines the process of generating a semantically annotated collocation corpus for Russian and the principles used to select collocations. To disambiguate words within collocations, semantically related words defined based on RuWordNet are utilized. The same thesaurus is also used as the source of sense inventories. The methods described in the paper yield an F1-score of 80% and help to add approximately 23% of collocations with at least one ambiguous word to the corpus. Automatically generated collocation corpuses with semantic annotation can simplify the preparation of datasets for developing and testing WSD models. These corpuses can also serve as a valuable source of information for knowledge-based WSD models.
About the authors
Diana Aleksandrovna Zaripova
Email: diana.ser.sar96@gmail.com
ORCID iD: 0000-0003-1121-1420
Natal'ya Valentinovna Lukashevich
Email: louk_nat@mail.ru
References
- Pu X., Pappas N., Henderson J., Popescu-Belis A. Integrating Weakly Supervised Word Sense Disambiguation into Neural Machine Translation // Transactions of the Association for Computational Linguistics. 2018. V. 6. Pp. 635-649.
- Blloshmi R., Pasini T., Campolungo N., Banerjee S., Navigli R., Pasi G. IR like a SIR: Sense-enhanced Information Retrieval for Multiple Languages // Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. Pp. 1030-1041.
- Seifollahi S., Shajari M. Word Sense Disambiguation Application in Sentiment Analysis of News Headlines: an Applied Approach to FOREX Market Prediction // Journal of Intelligent Information Systems. 2019. V. 52. Pp. 57-83.
- Maru M., Scozzafava F., Martelli F., Navigli R. SyntagNet: Challenging Supervised Word Sense Disambiguation with Lexical-semantic Combinations // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019. Pp. 3534-3540.
- Yarowsky D. One Sense per Collocation // Proceedings of the Workshop on Human Language Technology. 1993. Pp. 266-271.
- Martinez D., Agirre E. One Sense per Collocation and Genre/Topic Variations // 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. 2000. Pp. 207-215.
- Leech G.N. 100 Million Words of English: the British National Corpus (BNC) // Language Research. 1992. No. 28(1). Pp. 1-13.
- Haveliwala T.H. Topic-sensitive PageRank // Proceedings of the 11th International Conference on World Wide Web. 2002. Pp. 517-526.
- Agirre E., López de Lacalle O., Soroa A. Random Walks for Kknowledge-based Word Sense Disambiguation // Computational Linguistics. 2014. V. 40. No. 1. Pp. 57-84.
- Yuan D., Richardson J., Doherty R., Evans C., Altendorf E. Semi-supervised Word Sense Disambiguation with Neural Models // Proceedings of COLING. 2016. Pp. 1374-1385.
- Bolshina A., Loukachevitch N. Monosemous Relatives Approach to Automatic Data Labelling for Word Sense Disambiguation in Russian // Linguistic Forum 2020: Language and Artificial Intelligence. 2020. Pp. 12-13.
- Kirillovich A., Loukachevitch N., Kulaev M., Bolshina A., Ilvovsky D. Sense-Annotated Corpus for Russian // Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022). 2022. Pp. 130-136.
- Loukachevitch N., Lashevich G., Gerasimova A., Ivanov V., Dobrov B. Creating Russian WordNet by Conversion // Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference “Dialogue”. 2016. Pp. 405-415.
- Liu X.Y., Zhou Y.M., Zheng R.S. Measuring Semantic Similarity in WordNet // 2007 International Conference on Machine Learning and Cybernetics. 2007. V. 6. Pp. 3431-3435.
- Mikolov T., Sutskever I., Chen K., Corrado G., Dean J. Distributed Representations of Words and Phrases and their Compositionality // Advances in Neural Information Processing Systems. 2013. V. 26. Pp. 3111-3119.
- Kutuzov A., Kuzmenko E. WebVectors: a Toolkit for Building Web Interfaces for Vector Semantic Models // Analysis of Images, Social Networks and Texts: 5th International Conference, AIST 2016, Revised Selected Papers 5. Springer International Publishing. 2017. V. 661. Pp. 155-161.
- Bojanowski P., Grave E., Joulin A., Mikolov T. Enriching Word Vectors with Subword Information // Transactions of the Association for Computational Linguistics. 2017. V. 5. Pp. 135-146.
- Devlin J., Chang M.W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of NAACL-HLT. 2019. V. 1. Pp. 4171-4186.
Supplementary files

