Vietnamese Text Classification Algorithm using Long Short Term Memory and Word2Vec

H. N Phat; Фат Х. Н; N. T.M Anh; Ань Н. Т.М

doi:10.15622/ia.2020.19.6.5

Vietnamese Text Classification Algorithm using Long Short Term Memory and Word2Vec

Authors: Phat H.N¹, Anh N.T.¹
Affiliations:
1. Hanoi University of Science and Technology (HUST)
Issue: Vol 19, No 6 (2020)
Pages: 1255-1279
Section: Artificial intelligence, knowledge and data engineering
URL: https://journals.rcsi.science/2713-3192/article/view/266292
DOI: https://doi.org/10.15622/ia.2020.19.6.5
ID: 266292

Cite item

Full Text

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

In the context of the ongoing forth industrial revolution and fast computer science development the amount of textual information becomes huge. So, prior to applying the seemingly appropriate methodologies and techniques to the above data processing their nature and characteristics should be thoroughly analyzed and understood. At that, automatic text processing incorporated in the existing systems may facilitate many procedures. So far, text classiﬁcation is one of the basic applications to natural language processing accounting for such factors as emotions’ analysis, subject labeling etc. In particular, the existing advancements in deep learning networks demonstrate that the proposed methods may fit the documents’ classifying, since they possess certain extra efficiency; for instance, they appeared to be eﬀective for classifying texts in English. The thorough study revealed that practically no research effort was put into an expertise of the documents in Vietnamese language. In the scope of our study, there is not much research for documents in Vietnamese. The development of deep learning models for document classiﬁcation has demonstrated certain improvements for texts in Vietnamese. Therefore, the use of long short term memory network with Word2vec is proposed to classify text that improves both performance and accuracy. The here developed approach when compared with other traditional methods demonstrated somewhat better results at classifying texts in Vietnamese language. The evaluation made over datasets in Vietnamese shows an accuracy of over 90%; also the proposed approach looks quite promising for real applications.

Keywords

Word2Vec, Text Classiﬁcation, Natural Language Processing, Data Processing, Long short term memory, Word2Vec

About the authors

H. N Phat

Hanoi University of Science and Technology (HUST)

Author for correspondence.
Email: phat.nguyenhuu@hust.edu.vn
Dai Co Viet str. 1

N. T.M Anh

Hanoi University of Science and Technology (HUST)

Email: anh.ntm165774@sis.hust.edu.vn
Dai Co Viet str. 1

References

Hochreiter S., Schmidhuber J. Long short-term memory // Neural computation. 1997. vol. 9. pp. 1735–1780.
Sak H., Senior A., Beaufays F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition // arXiv preprint arXiv:1402.1128.2014.
Phuong L.-H., Nguyen H., Roussanaly A., Ho T. A hybrid approach to word segmentation of vietnamese texts // Lecture Notes in Computer Science. 2013. vol. 5196. pp. 240–249.
Hoang V.C.D., Dinh D., Nguyen N. le, Ngo H.Q. A comparative study on Vietnamese text classiﬁcation methods // 2007 IEEE International Conference on Research, Innovation and Vision for the Future. 2007. pp. 267–273.
Ngo Q.H., Dien D., Winiwarter W. A hybrid method for word segmentation with english- vietnamese bilingual text // 2013 International Conference on Control, Automation and Information Sciences (ICCAIS). 2013. pp. 48–52.
Jindal P., Jindal B. Line and word segmentation of handwritten text documents written in Gurmukhi script using mid point detection technique // 2015 2nd International Con- ference on Recent Advances in Engineering Computational Sciences (RAECS). 2015. pp. 1–6.
Gao Y. et al. Wacnet: Word segmentation guided characters aggregation net for scene text spotting with arbitrary shapes // 2019 IEEE International Conference on Image Processing (ICIP). 2019. pp. 3382–3386.
Charoenpornsawat P., Schultz T. Improving word segmentation for Thai speech translation // 2008 IEEE Spoken Language Technology Workshop. 2008. pp. 241–244.
Yu C. et al. Term extraction from Chinese texts without word segmentation // 2017 IEEE 11th International Conference on Application of Information and Communication Technologies (AICT). 2017. pp. 1–4.
Nguyen T., Le A. A hybrid approach to Vietnamese word segmentation // 2016 IEEE RIVF International Conference on Computing Communication Technologies, Research, Innovation, and Vision for the Future (RIVF). 2016. pp. 114–119.
Zhang Z. et al. Eﬀective subword segmentation for text comprehension // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2019. vol. 27. no. 11. pp. 1664–1674.
Bal A., Saha R. An improved method for handwritten document analysis using segmentation, baseline recognition and writing pressure detection // Procedia Computer Science. 2016. vol. 93. pp. 403–415.
Nguyen T.V., Tran H.K., Nguyen T.T.T., Nguyen H. Word segmentation for Vietnamese text categorization: An online corpus approach // RIVF06. 2005. vol. 172. pp. 1–6.
Nguyen T., Lung V.D. Extracting the main content of Vietnamese scientiﬁc documents based on the structure // Vietnam Journal of Science and Technology (VJST). 2014. vol. 52. no. 3. pp. 269–280.
Xiao L., Wang G., Zuo Y. Research on patent text classiﬁcation based on Word2Vec and LSTM // 2018 11th International Symposium on Computational Intelligence and Design (ISCID). 2018. vol. 01. pp. 71–74.
Hassan A., Mahmood A. Eﬃcient deep learning model for text classiﬁcation based on recurrent and convolutional layers // 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). 2017. pp. 1108–1113.
Sarkar A., Chatterjee S., Das W., Datta D. Text classiﬁcation using support vector machine // International Journal of Engineering Science Invention. 2015. vol. 4. no. 11. pp. 33–37.
Linh B.K. et al. Vietnamese text classiﬁcation based on topic modeling // 9th Fundamental and Applied IT Research (FAIR). 2016. vol. 01. pp. 532–537.
De T.C., Khang P.N. Classify text with supported vector learning machine and decision tree // Can Tho University Journal of Science. 2012. vol. 21. no. a. pp. 269–280.
Radhika K., Bindu K.R. A text classiﬁcation model using convolution neural network and recurrent neural network // International Journal of Pure and Applied Mathematics. 2018. vol. 119. pp. 1549–1554.
Fischer T., Krauss C. Deep learning with long short-term memory networks for ﬁnancial market predictions // European Journal of Operational Research. 2018. vol. 270. no. 2. pp. 654–669.
Sebastiani F. Machine learning in automated text categorization // ACM Computing Surveys. 2001. vol. 34. pp. 1–47.
Yasotha R., Charles E.Y.A. Automated text document categorization // 2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICI- CIS). 2015. pp. 522–528.
Farhoodi M., Yari A. Applying machine learning algorithms for automatic Persian text classiﬁcation // 2010 6th International Conference on Advanced In-formation Manage- ment and Service (IMS). 2010. pp. 318–323.
Krendzelak M., Jakab F. Text categorization with machine learning and hierarchical structures // 2015 13th International Conference on Emerging eLearning Technologies and Applications (ICETA). 2015. pp. 1–5.
Giang N.L., Hien N.M. Classiﬁcation of Vietnamese documents using support vector machine // VNU Journal of Science: Computer Science and Communication Engineering. 2005. pp. 1–10.
Nguyen P., Hong T., Nguyen K., Nguyen N. Deep learning versus traditional classiﬁers on Vietnamese students’ feedback corpus // 2018 5th NAFOSTED Conference on Information and Computer Science (NICS). 2018. pp. 75–80.
Vo Q., Nguyen H., Le B., Nguyen M. Multi-channel LSTM-CNN model for Vietnamese sentiment analysis // 9th International Conference on Knowledge and Systems Engineering (KSE). 2017. pp. 24–29.
Vnexpress, The most read Vietnamese newspaper. 2020. URL: https://e.vnexpress.net/ (дата обращения: 05.12.2019).
Tuoitre, Tuoitre news. 2020. URL: https://tuoitre.vn/ (дата обращения: 05.12.2019).
Thanhnien, Thanhnien online newspaper. 2020. URL: https://thanhnien.vn/a (дата обращения: 05.12.2019).
NLD, Nguoilaodong online newspaper. 2020. URL: https://nld.com.vn/ (дата обращения: 05.12.2019).
Trung T.V. Python Vietnamese Core NLP Toolkit. 2019. URL: https://github.com/trungtv/pyvi (дата обращения: 05.12.2019).
Nguyen D.Q. et al. A fast and accurate Vietnamese word segmenter // Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018. pp. 2582–2587.
Nguyen D.Q., Verspoor K. An improved neural network model for joint post tagging and dependency parsing // Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. 2018. pp. 1–11.
Nguyen C.-T. et al. Vietnamese word segmentation with CRFs and SVMs: An investigation // Proceedings of the 20th Paciﬁc Asia Conference on Language, Information and Computation 2006. pp. 215–222.
Le V.-D. Detailed explanation of Word2Vector Skip-gram. 2015. URL: http://www.programmersought.com/article/8383114826/ (дата обращения: 05.12.2019).
Ma L., Zhang Y. Using word2vec to process big text data // 2015 IEEE International Conference on Big Data (Big Data). 2015. pp. 2895–2897.
Barazza L. How does Word2Vec’s Skip-Gram work? 2017. URL: https://becominghuman.ai (дата обращения: 19.02.2017).
Landthaler J. et al. Extending thesauri using word embedding’s and the inter-section method // ASAIL@ ICAIL. 2017. vol. 8. no. 1. pp. 112–119.
An S. Recurrent Neural Networks. 2017. URL: https://www.cc.gatech.edu/ san37/post/dlhc-rnn/ (дата обращения: 10.10.2019).
Zhang Y., Wallace B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classiﬁcation // arXiv preprint arXiv:1510.03820. 2015.
Le V.-D. Vietnamese stopwords, 2015. URL: https://github.com/stopwords/vietnamese- stopwords (дата обращения: 05.12.2019).
Ting K.M. Confusion Matrix. Boston // MA: Springer US. 2010. pp. 209–209.
Nguyen P., Hong T., Nguyen K., Nguyen N. Deep learning versus traditional classiﬁers on Vietnamese students’ feedback corpus // 2018 5th NAFOSTED Conference on Information and Computer Science (NICS). 2018. pp. 75–80.
Nguyen K.V. et al. UIT-VSFC: Vietnamese students’ feedback corpus for sentiment analysis // 2018 10th International Conference on Knowledge and Systems Engineering (KSE). 2018. pp. 19–24.
Van T.P., Thanh T.M. Vietnamese news classiﬁcation based on bow with key-words extraction and neural network // 2017 21st Asia Paciﬁc Symposium on Intelligent and Evolutionary Systems (IES). 2017. pp. 43–48.

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

Vol 24, No 5 (2025)

Vol 24, No 5 (2025)

Vietnamese Text Classification Algorithm using Long Short Term Memory and Word2Vec

Full Text

Abstract

Keywords

About the authors

H. N Phat

N. T.M Anh

References

Supplementary files