Corpora and corpus-based studies of the languages of the Russian Federation
- Authors: Davidyuk T.I.1,2, Kibrik A.A.1,2, Mordashova D.D.1
-
Affiliations:
- Institute of Linguistics of the Russian Academy of Sciences
- Lomonosov Moscow State University
- Issue: Vol 94, No 9 (2024)
- Pages: 804-813
- Section: С КАФЕДРЫ ПРЕЗИДИУМА РАН
- URL: https://journals.rcsi.science/0869-5873/article/view/268314
- DOI: https://doi.org/10.31857/S0869587324090039
- EDN: https://elibrary.ru/FCFZQL
- ID: 268314
Cite item
Abstract
The article describes corpus resources for the languages of Russia and their use in linguistic research. The linguistic diversity of the country is quite substantial: currently 155 languages are identified as languages of Russia. Many of them are under threat of extinction, which makes the task of creating corpora particularly relevant as a tool for language preservation. In this study we conducted a survey among the staff of the Institute of Linguistics of the Russian Academy of Sciences and other colleagues, which helped us collect the data about 73 corpus resources representing various languages and dialects of Russia. The sample covers both major languages and languages with relatively few speakers, including unwritten languages. The article examines various parameters by which corpora may differ, and offers examples of research based on materials from the corpora. The final part of the article discusses the organizational aspects of creating and maintaining corpus resources. The results of the study suggest that corpus resources not only play an important role in preserving the linguistic diversity of Russia but also represent a valuable tool for various research tasks, as well as for creating other language resources.
Full Text

About the authors
T. I. Davidyuk
Institute of Linguistics of the Russian Academy of Sciences; Lomonosov Moscow State University
Author for correspondence.
Email: davidyuk@iling-ran.ru
младший научный сотрудник; программист, аспирант
Russian Federation, Moscow; MoscowA. A. Kibrik
Institute of Linguistics of the Russian Academy of Sciences; Lomonosov Moscow State University
Email: aakibrik@iling-ran.ru
доктор филологических наук, директор, заведующий отделом типологии и ареальной лингвистики; профессор
Russian Federation, Moscow; MoscowD. D. Mordashova
Institute of Linguistics of the Russian Academy of Sciences
Email: d.mordashova@iling-ran.ru
младший научный сотрудник
Russian Federation, MoscowReferences
- Koryakov Yu.B., Davidyuk T.I., Haritonov V.S., Evstigneeva A.P., Syuryun A.A. A list of languages of Russia and their vitality statuses. Preprint. Moscow: Institute of Linguistics RAS, 2023. http://jazykirf.iling-ran.ru/(2023)_Spisok_jazykov_Rossii_Monograph.pdf (accessed 25.05.2024).
- The Routledge handbook of corpus linguistics / Еd. by A. O’Keeffe, M.J. McCarthy. Abingdon, New York: Routledge, 2021.
- Kibrik A.A. A program for the preservation and revitalization of the languages of Russia // Russian Journal of Linguistics. 2021, vol. 25, no. 2, pp. 507–527.
- Linguistic diversity of Russia and opportunities for its preservation / Еd. by E.Yu. Gruzdeva, A.A. Syuryun. Preprint. Moscow: Institute of Linguistics, Russian Academy of Sciences, 2023. https://iling-ran.ru/library/revitalization/gruzdeva_et_al_language_diversity_2023.pdf (accessed 25.05.2024).
- Gatbonton E., Pelczer I., Cook C., Venkatesh V., Nochasak C., Andersen H. A pedagogical corpus to support a language teaching curriculum to revitalize an endangered language: the case of Labrador Inuttitut // International Journal of Computer-Assisted Language Learning and Teaching. 2015, no. 5(4), pp. 16–36.
- Sichinava D.V. On parallel texts within the Russian National Corpus: new languages and new challenges // Proceedings of the V.V. Vinogradov Russian Language Institute. 2019, no. 21, pp. 41–60.
- Arkhangelsky T.A. The corpus platform Tsakorpus and the languages of Russia // Electronic Writing Systems of the Peoples of the Russian Federation – 2021 and IWCLUL 2021. Proceedings of the International Scientific and Practical Conference, Syktyvkar, September 23–24, 2021. Syktyvkar: Komi Republic Academy of Public Administration and Management, 2022. P. 23–24.
- Bright W. Contextualizing a grammar // Perspectives on grammar writing / Ed. by Th. Payne, D. Weber. Amsterdam: John Bejamins, 2007. P. 11–17.
- Mosel U. Corpus linguistic and documentary approaches in writing a grammar of a previously undescribed language // The Art and Practice of Grammar Writing (LD&C Special Publication 8) / Ed. by T. Nakayama, K. Rice. 2014. P. 135–157.
- Bachaeva S.E. Lexical collocations of adjectives denoting the small size (based on the materials of the National Corpus of the Kalmyk language) // DSPU Journal. 2016, vol. 10, no. 4, pp. 42–47.
- Khanina O.V. Advantages of digital technologies: a description of front vowels allophones, of a glottal stop, and of verbal object cross-reference in Enets // Ural-Altaic Studies. 2017, no. 3(26), pp. 186–207.
- Serdobolskaya N. A corpus analysis of differential object marking in Beserman Udmurt // Linguistica Uralica. 2020, vol. 56, no. 4, pp. 275–308.
- Russkih A.A., Oskolskaya S.A. Additive particle in Turkic Languages of the Volga-Kama Sprachbund // Oriental Studies. 2021, vol. 14, no. 6, pp. 1324–1352.
- Ganenkov D.S. A corpus-based study of infinitive constructions in Lezgian // Acta Linguistica Petropolitana. Transactions of the Institute for Linguistic Studies. 2016, vol. 12, part 1, pp. 310–322.
- Plungian V.A. The parallel corpus as a grammar database and the New Testament as a parallel corpus (Preface) // Acta Linguistica Petropolitana. Transactions of the Institute for Linguistic Studies. 2023, vol. 19, part 3, pp. 15–38.
- Burkova S.I., Filimonova E.V. Reduplication in Russian sign language // Russian Language and Linguistic Theory. 2014, no. 2(28), pp. 202–258.
- Burkova S.I. The ways of expressing nominal plurality in the Russian sign language // Siberian Journal of Philology. 2015, no. 2, pp. 174–184.
- Dybo A.V., Krylov Ph.S., Maltseva V.S., Sheimovich A.V. Segmental rules in the automatic parser for the Khakas corpus // Ural-Altaic Studies. 2019, no. 1(32), pp. 48–69.
- Dybo A.V., Maltseva V.S., Sultrekova E.V., Sheimovich A.V., Krylov Ph.S. The structure of the Khakas word form and restrictions on the compatibility of affixes in the automatic parser for the Khakas language // Ural-Altaic Studies. 2023, no. 2(49), pp. 42–75.
- Khusainov A.F., Suleymanov D.Sh. Overview of speech corpora and software for the Tatar speech synthesis // Speech Technology. 2020, no. 1, pp. 63–72.
- Sabantsev G.L., Chemyshev A.V. Yandex.Translate and the languages of Russia // Electronic Writing Systems of the Peoples of the Russian Federation – 2021 and IWCLUL 2021. Proceedings of the International Scientific and Practical Conference, Syktyvkar, September 23–24, 2021. Syktyvkar: Komi Republic Academy of Public Administration and Management, 2022. P. 178–181.
- Forker D., Gadzhimuradov G.A. Sanzhi tales and stories. With Sanzhi-Russian and Russian-Sanzhi dictionaries. Makhachkala: A4 Printing House, 2017.
- Tulumbaev V.Z. Corpus linguistics technologies in teaching Bashkir // Modern Problems and Prospects of Natural Sciences Development. Proceedings of a National Scientific and Practical Conference. Ufa, June 8–9, 2020. Ufa: Bashkir State Pedagogical University named after M. Akmulla, 2020. P. 309–312.
- Kibrik A.A., Maisak T.A. Discourse transcription rules for descriptive and documentary studies // Rhema. 2021, no. 2, pp. 23–45.
- Baranov A.N. Introduction to applied linguistics. Moscow: Editorial URSS, 2001.
Supplementary files
