Two-step semantic clustering of embeddings as an alternative to LDA for infometric analysis of industry news.
- Authors: Konnikov E.A.1, Kryzhko D.A.1
-
Affiliations:
- Issue: No 3 (2025)
- Pages: 10-19
- Section: Articles
- URL: https://journals.rcsi.science/2454-0714/article/view/359336
- DOI: https://doi.org/10.7256/2454-0714.2025.3.75348
- EDN: https://elibrary.ru/HSKPLS
- ID: 359336
Cite item
Full Text
Abstract
The subject of the research is the development and validation of an alternative approach to thematic modeling of texts aimed at overcoming the limitations of classical Latent Dirichlet Allocation (LDA). The object of the study is short Russian-language news texts about nuclear energy, presented in the form of the "AtomicNews" corpus. The authors thoroughly examine various aspects of the topic, such as the impact of sparsity on the quality of thematic modeling, issues of theme interpretability, and the limitations of a priori fixing the number of topics. Special attention is paid to the geometric interpretation of text semantics, in particular, the transformation of lexical units into the space of pre-trained embeddings and subsequent clustering aimed at forming document thematic profiles. The research focuses on the comparative analysis of the new method and LDA using coherence, perplexity, and thematic diversity metrics. The proposed approach aims to create an interpretable, computationally lightweight, and noise-resistant model suitable for online monitoring of news flows. The research methodology is based on a two-stage semantic smoothing process—embedding representation of lemmas using Sentence-BERT and agglomerative cosine clustering, followed by the application of K-means to the thematic profiles of documents. The scientific novelty of the study lies in the development and empirical justification of a thematic modeling scheme that replaces probabilistic word generation with geometric smoothing of embeddings. The proposed approach departs from the assumptions of the "bag of words" and a fixed number of topics, forming thematic coordinates of documents through density clusters in semantic space. This enhances theme interpretability, reduces sensitivity to text sparsity, and avoids the collapse of topic distribution in short messages. Experiments on the "AtomicNews" corpus demonstrated a statistically significant improvement compared to classical LDA: a 5% reduction in perplexity, a 0.15-point increase in topic coherence, and an increase in thematic diversity. The method also demonstrated computational efficiency—the entire procedure takes seconds on a CPU, making it suitable for application in resource-constrained environments. Thus, the transition from probabilistic decomposition to geometric analysis of embeddings represents a promising direction in thematic modeling of industry texts.
About the authors
Evgenii Aleksandrovich Konnikov
Email: konnikov.evgeniy@gmail.com
Darya Aleksandrovna Kryzhko
Email: darya.kryz@yandex.ru
References
Alattar F., Shaalan K. Emerging research topic detection using filtered-lda // AI. – 2021. – Т. 2. – № 4. – С. 578-599. Kim M., Kim D. A suggestion on the LDA-Based topic modeling technique based on ElasticSearch for Indexing Academic Research Results // Applied Sciences. – 2022. – Т. 12. – № 6. – С. 3118. Qiu M. et al. A topic modeling based on prompt learning // Electronics. – 2024. – Т. 13. – № 16. – С. 3212. Ogunleye B. et al. Comparison of topic modelling approaches in the banking context // Applied Sciences. – 2023. – Т. 13. – № 2. – С. 797. Vargas C., Ponce H. Recurrent embedded topic model // Applied Sciences. – 2023. – Т. 13. – № 20. – С. 11561. Krasnov F., Sen A. The number of topics optimization: Clustering approach // Machine Learning and Knowledge Extraction. – 2019. – Т. 1. – № 1. – С. 25. Williams L. et al. Topic modelling: Going beyond token outputs // Big Data and Cognitive Computing. – 2024. – Т. 8. – № 5. – С. 44. doi: 10.3390/bdcc8050044 EDN: WGBWYP Родионов Д. Г. и др. Автоматизированный алгоритм квантификации наиболее вероятного значения региона профессионального становления представителя научно-исследовательского коллектива для целей калькулирования коэффициента мультикультурализма // Экономические науки. – 2021. – № 202. – С. 154-163. doi: 10.14451/1.202.154 EDN: LETTFT Murakami R., Chakraborty B. Investigating the efficient use of word embedding with neural-topic models for interpretable topics from short texts // Sensors. – 2022. – Т. 22. – № 3. – С. 852. doi: 10.3390/s22030852 EDN: GXMHBG Koltcov S. et al. Analyzing the influence of hyper-parameters and regularizers of topic modeling in terms of renyi entropy // Entropy. – 2020. – Т. 22. – № 4. – С. 394. doi: 10.3390/E22040394 EDN: KXJCBE Родионов Д. Г. и др. Тематическое моделирование информационной среды медиакомпаний: инструментальный комплекс LDA-TF-IDF // Мягкие измерения и вычисления. – 2024. – Т. 76, № 3. – С. 72-84. doi: 10.36871/2618-9976.2024.03.006 EDN: COCJYG Конников Е. А. и др. Методическая детализация процесса моделирования свойств сущностно-содержательного посыла, кодируемого в форме символьных конструктов данных // Экономический вестник. – 2024. – Т. 3, № 2. – С. 8-18. Cheng H. et al. A neural topic modeling study integrating SBERT and data augmentation // Applied Sciences. – 2023. – Т. 13. – № 7. – С. 4595. Qiu M. et al. A topic modeling based on prompt learning // Electronics. – 2024. – Т. 13. – № 16. – С. 3212. Um T., Kim N. A study on performance enhancement by integrating neural topic attention with transformer-based language model // Applied Sciences. – 2024. – Т. 14. – № 17. – С. 7898. Nanyonga A. et al. Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment // Technologies. – 2025. – Т. 13. – № 5. – С. 209. Родионов Д. Г., Карпенко П. А., Пашинина П. А. Квантификация информационной среды как инструмент инвестиционного анализа // Экономические науки. – 2021. – № 204. – С. 144-153. doi: 10.14451/1.204.144 EDN: FOZMSH Марков А. К. и др. Сравнительный анализ применяемых технологий обработки естественного языка для улучшения качества классификации цифровых документов // International Journal of Open Information Technologies. – 2024. – Т. 12. – № 3. – С. 66-77. EDN: TUBOSI Pais N., Ravishanker N., Rajasekaran S. Supervised Dynamic Correlated Topic Model for Classifying Categorical Time Series // Algorithms. – 2024. – Т. 17. – № 7. – С. 275. doi: 10.3390/a17070275 EDN: JFXYZW Farkhod A. et al. LDA-based topic modeling sentiment analysis using topic/document/sentence (TDS) model // Applied Sciences. – 2021. – Т. 11. – № 23. – С. 11091.
Supplementary files

