Reducing the search space for optimal clustering parameters using a small amount of labeled data

Vitaly I. Yuferev; Юферев Виталий Иванович; Nikolai A. Razin; Разин Николай Алексеевич

doi:10.14357/20718594240109

Reducing the search space for optimal clustering parameters using a small amount of labeled data

Авторлар: Yuferev V.I.¹, Razin N.A.¹
Мекемелер:
1. The Central Bank of the Russian Federation
Шығарылым: № 1 (2024)
Беттер: 103-117
Бөлім: Analysis of Textual and Graphical Information
URL: https://journals.rcsi.science/2071-8594/article/view/269794
DOI: https://doi.org/10.14357/20718594240109
EDN: https://elibrary.ru/WWPOLG
ID: 269794

Дәйексөз келтіру

Аннотация

The paper presents a method for reducing the search space for optimal clustering parameters. This is achieved by selecting the most appropriate data transformation methods and dissimilarity measures at the stage prior to performing the clustering itself. To compare the selected methods, it is proposed to use the silhouette coefficient, which considers class labels from a small labeled data set as cluster labels. The results of an experimental test of the proposed approach for clustering news texts are presented.

Негізгі сөздер

clustering, parameter search, search space reduction, dissimilarity measures, machine learning

Толық мәтін

Авторлар туралы

Vitaly Yuferev

The Central Bank of the Russian Federation

Хат алмасуға жауапты Автор.
Email: YuferevVI@cbr.ru

Consultant, Innovative Laboratory “Novosibirsk”, Department of Information Technologies

Ресей, Moscow

Nikolai Razin

The Central Bank of the Russian Federation

Email: RazinNA@cbr.ru

Candidate of Physical and Mathematical Sciences, Head of the Center of Competence in Artificial Intelligence and Advanced Analytics, Data Management Department

Ресей, Moscow

Әдебиет тізімі

Ackerman M., Adolfsson A., Brownstein N. An effective and efficient approach for clusterability evaluation. arXiv:1602.06687. 2016.
Bergstra J., Bengio Y. Random search for hyper-parameter optimization // Journal of Machine Learning Research. 2012. V. 13. No 2. P. 281-305.
Bora M.D.J., Gupta D.A.K. Effect of Different Distance Measures on the Performance of K-Means Algorithm: An Experimental Study in Matlab // Internatinonal Journal of Computer Science and Information Techonolgies. 2014. V. 5. No 2. P. 2501–2506.
Brazdil P., Giraud-Carrier C., Soares C., Vilalta R. Metalearning: Applications to Data Mining. Berlin, Heidelberg: Springer Science & Business Media, 2008. doi: 10.1007/978-3-540-73263-1.
Dash M., Choi K., Scheuermann P., Liu H. Feature selection for clustering-a filter solution // 2002 IEEE International Conference on Data Mining. Proceedings IEEE. 2002. P. 115–122.
Data Clustering: Algorithms and Applications. Ed. by C.C. Aggarwal, C.K. Reddy. New York: Chapman and Hall/CRC, 2014. doi: 10.1201/9781315373515
Feurer M., Hutter F. Hyperparameter Optimization // Automated Machine Learning. Ed. by F. Hutter et al. Cham: Springer, 2019. P. 3-33. doi: 10.1007/978-3-030-05318-5_1.
Hernández-Reyes E., García-Hernández R.A., CarrascoOchoa J.A., Martínez-Trinidad J.F. Document Clustering Based on Maximal Frequent Sequences // Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science. Berlin: Springer, 2006. V. 4139. P. 257–267. doi: 10.1007/11816508_27
Holder C., Middlehurst M., Bagnal A. A Review and Evaluation of Elastic Distance Functions for Time Series Clustering // Knowledge and Information Systems. 2023. V. 66, P. 765-809, 2023
Hui X., Li Z. Clustering Validation Measures // Data Clustering: Algorithms and Applications. Boca Raton: CRC Press, 2014. P. 571-606.
Jain A.K., Murty M.N., Flynn P.J. Data Clustering: a review // ACM Computing Surveys. New York: Association for Computing Machinery, 1999. V. 31. P. 264-323.
Kassambara A. Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning. Sthda, 2017. V. 1. ISBN: 978-1-5424-6270-9.
Kaufman L., Rousseeuw P. Clustering by Means of Medoids // Data Analysis based on the L1-Norm and Related Methods. Ed. by Y. Dodge. North-Holland. 1987. P. 405-416.
Li Y., Zhang Y., Wei X. Hyper-parameter estimation method with particle swarm optimization. arXiv:2011.11944v2. 2020.
Mahdavi K. Enhanced clustering analysis pipeline for performance analysis of parallel applications: Tesi doctoral, Universitat Politècnica de Catalunya, Departament d'Arquitectura de Computadors. Barcelona, 2022. doi: 10.5821/dissertation-2117-375586.
Nelder J.A., Mead R. A simplex method for function optimization // The Computer Journal. 1965. V. 7. No 4. P. 308-313.
Nguyen Q.H., Rayward-Smith V.J. Internal quality measures for clustering in metric spaces // International Journal of Business Intelligence and Data Mining. 2008. V.3. No 1. P. 4–29.
Romano S., Vinh N.X., Bailey J., Verspoor K. Adjusting for chance clustering comparison measures // Journal of Machine Learning Research. 2016. V. 17. No 1. P. 4635–4666.
Rousseeuw P. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis // Journal of Computational and Applied Mathematics. 1987. V. 20. P. 53-65. doi: 10.1016/0377-0427(87)90125-7.
Schneider M., Grinsell J., Russell T., Hickman R., Thomson R. Identifying Indicators of Bias in Data Analysis Using Proportionality and Separability Metrics // Proceedings of SBPBRiMS conference. Washington, 2019. URL: http://sbp-brims.org/2019/proceedings/papers/working_papers/Schneider.pdf (accessed: 30.01.2024).
Thornton C., Hutter F., Hoos H.H., Leyton-Brown K. AutoWEKA: Combined selection and hyperparameter optimization of classification algorithms // Proceedings of the 19th ACM SIGKDD International Conference of Knowledge Discovery and Data Mining. Chicago, 2013. P. 847-855.
Tong Y., Hong Z. Hyper-Parameter Optimization: A Review of Algorithms and Applications. arXiv:2003.05689. 2020.
Vincent A.M., Jidesh P. An improved hyperparameter optimization framework for AutoML systems using evolutionary algorithms // Scientific Reports. 2023. V. 13. No 1. P. 4737. doi: 10.1038/s41598-023-32027-3.
Vinh N.X., Epps J., Bailey J. Information theoretic measures for clustering comparison: is a correction for chance necessary? // Proceedings of the 26th Annual International Conference on Machine Learning – ICML’09. Montreal, 2009. P. 1073–1080. doi: 10.1145/1553374.1553511.
Vysala A., Gomes J. Evaluating and Validating Cluster Results // Proceedings of 9th International Conference on Advanced Information Technologies and Applications (ICAITA 2020). 2020. V. 10. No 9. P. 37-45. doi: 10.5121/csit.2020.100904
Wu J., Chen X.-Y., Zhang H., Xiong L.-D., Lei H., Deng S. Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimization // Journal of Electronic Science and Technology. 2019. V. 17. No 1. P. 26-40. doi: 10.11989/JEST.1674-862X.80904120.
Xu R., Wunsch D. Survey of clustering algorithms // IEEE Transactions on Neural Networks, 2005. V. 16. No 3. P. 645-678. doi: 10.1109/TNN.2005.845141.
Yang L., Shami A. On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice. arXiv:2007.15745v3. 2022.

Қосымша файлдар

Әрекет

1. JATS XML

Жүктеу

2. Fig. 1. An iterative scheme for finding the best clustering process parameters

Жүктеу (21KB)

Метадеректер

3. Fig. 2. The proposed scheme for finding the best parameters of the clustering process

Жүктеу (132KB)

Метадеректер

4. Fig. 3. Distribution of news by the values of the "tag" field

Жүктеу (20KB)

Метадеректер

5. Fig. 4. Distribution of news by their lengths in characters

Жүктеу (24KB)

Метадеректер

6. Fig. 5. Dependence of AMI on the parameter "number of clusters" for clustering in different regions

Жүктеу (412KB)

Метадеректер

7. Fig. 6. Dependence of SilCsoftmax(x) on the size of the data portion

Жүктеу (424KB)

Метадеректер

8. Fig. 7. Dependence of the mean value and standard deviation of the Pearson correlation coefficient of the SilCsoftmax(x) estimate and the AMI estimate for "good" PR on the size of the dataset used to calculate SilCsoftmax(x)

Жүктеу (22KB)

Метадеректер

9. Figure 8. Dependence of the mean value and standard deviation of the Pearson correlation coefficient of the SilCsoftmax(x) estimate and the AMI estimate for "bad" PR on the size of the dataset used to calculate SilCsoftmax(x)

Жүктеу (18KB)

Метадеректер

Ескертпе

* This article reflects the personal position of the authors. The content and results of this study should not be considered, including quoted in any publications, as the official position of the Bank of Russia or an indication of the official policy or decisions of the regulator. Any errors in this material are solely copyrighted.

Пайдаланушының аты
Құпиясөз
Мені есте сақтау

Құпия сөзді ұмыттыңыз ба?	Тіркеу

Пайдаланушының аты
Құпиясөз
Мені есте сақтау

Құпия сөзді ұмыттыңыз ба?	Тіркеу