Efficient natural language classification algorithm for detecting duplicate unsupervised features

S. Altaf; Алтаф С.; S. Iqbal; Икбал С.; M. Soomro; Соомро М.

doi:10.15622/ia.2021.3.5

Efficient natural language classification algorithm for detecting duplicate unsupervised features

Authors: Altaf S.¹, Iqbal S.², Soomro M.³
Affiliations:
1. AUT University
2. Pakistan Space and Upper Atmosphere Research Commission (SUPARCO), Pakistan
3. Manukau Institute of Technology
Issue: Vol 20, No 3 (2021)
Pages: 623-653
Section: Artificial intelligence, knowledge and data engineering
URL: https://journals.rcsi.science/2713-3192/article/view/266316
DOI: https://doi.org/10.15622/ia.2021.3.5
ID: 266316

Cite item

Full Text

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

This paper focuses on capturing the meaning of Natural Language Understanding (NLU) text features to detect the duplicate unsupervised features. The NLU features are compared with lexical approaches to prove the suitable classification technique. The transfer-learning approach is utilized to train the extraction of features on the Semantic Textual Similarity (STS) task. All features are evaluated with two types of datasets that belong to Bosch bug and Wikipedia article reports. This study aims to structure the recent research efforts by comparing NLU concepts for featuring semantics of text and applying it to IR. The main contribution of this paper is a comparative study of semantic similarity measurements. The experimental results demonstrate the Term Frequency–Inverse Document Frequency (TF-IDF) feature results on both datasets with reasonable vocabulary size. It indicates that the Bidirectional Long Short Term Memory (BiLSTM) can learn the structure of a sentence to improve the classification.

Keywords

Par2Vec, Clustering, Information Retrieval, large data sets, Natural Language Texts, lexical ap-proaches, classification, training.

References

Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D. Task Effects on Linguistic Complexity and Accuracy: A Large-Scale Learner Corpus Analysis Employing Natural Language Processing Techniques. Language Learning, 67(S1), pp. 180–208. 2017.
Keersmaekers, A. Creating a richly annotated corpus of papyrological Greek: The possibilities of natural language processing approaches to a highly inflected historicallanguage. Digital Scholarship in The Humanities. 2019.
Pajak, B., Fine, A., Kleinschmidt, D., & Jaeger, Learning Additional Languages as Hierarchical Probabilistic Inference: Insights from First Language Processing. Language Learning, 66(4), pp. 900–944. 2016.
Merkx, D., & Frank, S. Learning semantic sentence representations from a visually grounded language without lexical knowledge. Natural Language Engineering, 25(4), pp. 451–466. 2019.
Huang, F., Ahuja, A., Downey, D., Yang, Y., Guo, Y., & Yates, A. (2014). Learning Representations for Weakly Supervised Natural Language Processing Tasks. Computational Linguistics, 40(1), pp. 85–120.
Kozachok, A. V., Kopylov, S. A., Meshcheryakov, R. V., Evsutin, O. O., & Tuan, L. M. An approach to a robust watermark extraction from images containing text. SPIIRAS Proceedings, 5(60), 128 p. 2018.
Nazari, P., Khorram, E., & Tarzanagh, D. Adaptive online distributed optimization in dynamic environments. Optimization Methods and Software, pp. 1–25. 2019.
Altaf, S., Waseem, M., & Kazmi, L. IDCUP Algorithm to Classifying Arbitrary Shapes and Densities for Center-based Clustering Performance Analysis. Interdisciplinary Journal of Information, Knowledge, And Management, 15, pp. 91 – 108. 2020.
Chen, R., Dai, R., & Wang, M. Transcription Factor Bound Regions Prediction: Word2Vec Technique with Convolutional Neural Network. Journal of Intelligent LearningSystems and Applications, 12(01), pp. 1–13. 2020.
Mitra, B., & Craswell, N. An Introduction to Neural Information Retrieval t. Foun-dations And Trends, In Information Retrieval, 13(1), pp. 1-126. 2018.
Savyanavar, P., & Mehta, B. Multi-Document Summarization Using TF-IDF Algorithm. International Journal of Engineering and Computer Science. 2016.
Liang, P. Learning executable semantic parsers for natural language understanding. Com-munications of the ACM, 59(9), pp. 68–76. 2016.
Berant, J., & Liang, P. Imitation Learning of Agenda-based Semantic Parsers. Transactions Of the Association for Computational Linguistics, 3, pp. 545–558. 2015.
Merkx, D., & Frank, S. Learning semantic sentence representations from a visually grounded language without lexical knowledge. Natural Language Engineering, 25(4), pp. 451–466. 2019.
Roberts, L. Individual Differences in Second Language Sentence Processing. Language Learning, 62, pp. 172–188. 2012.
Dontsov, D. O. Algorithm of thesaurus extension generation for enterprise search. SPIIRAS Proceedings, 7(30), 189 p. 2014.
Aswani Kumar, C., Radvansky, M., & Annapurna, J. Analysis of a Vector Space Model, Latent Semantic Indexing and Formal Concept Analysis for Information Retrieval. Cybernetics And Information Technologies, 12(1), pp. 34–48. 2012.
Ch, A. (2006). Latent Semantic Indexing based Intelligent Information Retrieval System for Digital Libraries. Journal Of Computing and Information Technology.
Susanto, G., & Purwanto, H. Information Retrieval Menggunakan Latent Semantic Indexing Pada Ebook. SMATIKA JURNAL, 8(02), pp. 74–79. 2018.
Blynova, N. Latent semantic indexing (LSI) and its impact on copywriting. Communications And Communicative Technologies, (19), pp. 4–12. 2019.
Rataj, Karolina. “Electrophysiology of Semantic Violations and Lexical Ambiguity Resolution in Bilingual Sentence Processing.” Bilingual Lexical Ambiguity Resolution, pp. 250–72. 2020.
Qu, C., Yang, L., Qiu, M., Croft, W. B., Zhang, Y., & Iyyer, M. BERT with History Answer Embedding for Conversational Question Answering. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019.
Ghavidel, H., Zouaq, A., & Desmarais, M. Using BERT and XLNET for the Automatic Short Answer Grading Task. Proceedings of the 12th International Conference on Computer Supported Education. 2020.
Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y., Zettlemoyer, L. QuAC: Question Answering in Context. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018.
Reddy, S., Chen, D., & Manning, C. D. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, 7, pp. 249–266. 2019.
Sur, C. RBN: Enhancement in language attribute prediction using global representation of natural language transfer learning technology like Google BERT. SN Applied Sciences, 2(1). 2019.

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

Efficient natural language classification algorithm for detecting duplicate unsupervised features

Full Text

Abstract

Keywords

About the authors

S. Altaf

S. Iqbal

M. Soomro

References

Supplementary files