Two-step semantic clustering of embeddings as an alternative to LDA for infometric analysis of industry news.

Evgenii Aleksandrovich Konnikov; Конников Евгений Александрович; Darya Aleksandrovna Kryzhko; Крыжко Дарья Александровна

doi:10.7256/2454-0714.2025.3.75348

Two-step semantic clustering of embeddings as an alternative to LDA for infometric analysis of industry news.

Authors: Konnikov E.A.¹, Kryzhko D.A.¹
Affiliations:
Issue: No 3 (2025)
Pages: 10-19
Section: Articles
URL: https://journals.rcsi.science/2454-0714/article/view/359336
DOI: https://doi.org/10.7256/2454-0714.2025.3.75348
EDN: https://elibrary.ru/HSKPLS
ID: 359336

Cite item

Full Text

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

The subject of the research is the development and validation of an alternative approach to thematic modeling of texts aimed at overcoming the limitations of classical Latent Dirichlet Allocation (LDA). The object of the study is short Russian-language news texts about nuclear energy, presented in the form of the "AtomicNews" corpus. The authors thoroughly examine various aspects of the topic, such as the impact of sparsity on the quality of thematic modeling, issues of theme interpretability, and the limitations of a priori fixing the number of topics. Special attention is paid to the geometric interpretation of text semantics, in particular, the transformation of lexical units into the space of pre-trained embeddings and subsequent clustering aimed at forming document thematic profiles. The research focuses on the comparative analysis of the new method and LDA using coherence, perplexity, and thematic diversity metrics. The proposed approach aims to create an interpretable, computationally lightweight, and noise-resistant model suitable for online monitoring of news flows. The research methodology is based on a two-stage semantic smoothing process—embedding representation of lemmas using Sentence-BERT and agglomerative cosine clustering, followed by the application of K-means to the thematic profiles of documents. The scientific novelty of the study lies in the development and empirical justification of a thematic modeling scheme that replaces probabilistic word generation with geometric smoothing of embeddings. The proposed approach departs from the assumptions of the "bag of words" and a fixed number of topics, forming thematic coordinates of documents through density clusters in semantic space. This enhances theme interpretability, reduces sensitivity to text sparsity, and avoids the collapse of topic distribution in short messages. Experiments on the "AtomicNews" corpus demonstrated a statistically significant improvement compared to classical LDA: a 5% reduction in perplexity, a 0.15-point increase in topic coherence, and an increase in thematic diversity. The method also demonstrated computational efficiency—the entire procedure takes seconds on a CPU, making it suitable for application in resource-constrained environments. Thus, the transition from probabilistic decomposition to geometric analysis of embeddings represents a promising direction in thematic modeling of industry texts.

Keywords

topic modeling, word embeddings, cosine clustering, topic coherence, Latent Dirichlet allocation, Sentence-BERT, textual information, lemmatization, topic profile, media monitoring

About the authors

Evgenii Aleksandrovich Konnikov

Email: konnikov.evgeniy@gmail.com

Darya Aleksandrovna Kryzhko

Email: darya.kryz@yandex.ru

References

Alattar F., Shaalan K. Emerging research topic detection using filtered-lda // AI. – 2021. – Т. 2. – № 4. – С. 578-599.
Kim M., Kim D. A suggestion on the LDA-Based topic modeling technique based on ElasticSearch for Indexing Academic Research Results // Applied Sciences. – 2022. – Т. 12. – № 6. – С. 3118.
Qiu M. et al. A topic modeling based on prompt learning // Electronics. – 2024. – Т. 13. – № 16. – С. 3212.
Ogunleye B. et al. Comparison of topic modelling approaches in the banking context // Applied Sciences. – 2023. – Т. 13. – № 2. – С. 797.
Vargas C., Ponce H. Recurrent embedded topic model // Applied Sciences. – 2023. – Т. 13. – № 20. – С. 11561.
Krasnov F., Sen A. The number of topics optimization: Clustering approach // Machine Learning and Knowledge Extraction. – 2019. – Т. 1. – № 1. – С. 25.
Williams L. et al. Topic modelling: Going beyond token outputs // Big Data and Cognitive Computing. – 2024. – Т. 8. – № 5. – С. 44. doi: 10.3390/bdcc8050044 EDN: WGBWYP
Родионов Д. Г. и др. Автоматизированный алгоритм квантификации наиболее вероятного значения региона профессионального становления представителя научно-исследовательского коллектива для целей калькулирования коэффициента мультикультурализма // Экономические науки. – 2021. – № 202. – С. 154-163. doi: 10.14451/1.202.154 EDN: LETTFT
Murakami R., Chakraborty B. Investigating the efficient use of word embedding with neural-topic models for interpretable topics from short texts // Sensors. – 2022. – Т. 22. – № 3. – С. 852. doi: 10.3390/s22030852 EDN: GXMHBG
Koltcov S. et al. Analyzing the influence of hyper-parameters and regularizers of topic modeling in terms of renyi entropy // Entropy. – 2020. – Т. 22. – № 4. – С. 394. doi: 10.3390/E22040394 EDN: KXJCBE
Родионов Д. Г. и др. Тематическое моделирование информационной среды медиакомпаний: инструментальный комплекс LDA-TF-IDF // Мягкие измерения и вычисления. – 2024. – Т. 76, № 3. – С. 72-84. doi: 10.36871/2618-9976.2024.03.006 EDN: COCJYG
Конников Е. А. и др. Методическая детализация процесса моделирования свойств сущностно-содержательного посыла, кодируемого в форме символьных конструктов данных // Экономический вестник. – 2024. – Т. 3, № 2. – С. 8-18.
Cheng H. et al. A neural topic modeling study integrating SBERT and data augmentation // Applied Sciences. – 2023. – Т. 13. – № 7. – С. 4595.
Qiu M. et al. A topic modeling based on prompt learning // Electronics. – 2024. – Т. 13. – № 16. – С. 3212.
Um T., Kim N. A study on performance enhancement by integrating neural topic attention with transformer-based language model // Applied Sciences. – 2024. – Т. 14. – № 17. – С. 7898.
Nanyonga A. et al. Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment // Technologies. – 2025. – Т. 13. – № 5. – С. 209.
Родионов Д. Г., Карпенко П. А., Пашинина П. А. Квантификация информационной среды как инструмент инвестиционного анализа // Экономические науки. – 2021. – № 204. – С. 144-153. doi: 10.14451/1.204.144 EDN: FOZMSH
Марков А. К. и др. Сравнительный анализ применяемых технологий обработки естественного языка для улучшения качества классификации цифровых документов // International Journal of Open Information Technologies. – 2024. – Т. 12. – № 3. – С. 66-77. EDN: TUBOSI
Pais N., Ravishanker N., Rajasekaran S. Supervised Dynamic Correlated Topic Model for Classifying Categorical Time Series // Algorithms. – 2024. – Т. 17. – № 7. – С. 275. doi: 10.3390/a17070275 EDN: JFXYZW
Farkhod A. et al. LDA-based topic modeling sentiment analysis using topic/document/sentence (TDS) model // Applied Sciences. – 2021. – Т. 11. – № 23. – С. 11091.

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

No 3 (2025)

No 3 (2025)

Two-step semantic clustering of embeddings as an alternative to LDA for infometric analysis of industry news.

Full Text

Abstract

Keywords

About the authors

Evgenii Aleksandrovich Konnikov

Darya Aleksandrovna Kryzhko

References

Supplementary files