An Unsupervised Learning Embedding Method Based on Semantic Hashing

Document Type : Research Paper


1 Faculty of computer engineering and information technology, Sadjad University, Mashhad, Iran.

2 Faculty of Electrical and Computer Engineering, Semnan University, Semnan, Iran.


Embedding learning is an essential issue in Natural Language Processing (NLP) applications. Most existing methods measure the similarity between text chunks in a context using pre-trained word embedding. However, providing labeled data for model training is costly and time-consuming. So, these methods face downward performance when limited amounts of training data are available. This paper presents an unsupervised sentence embedding method that effectively integrates semantic hashing into the Kernel Principal Component Analysis (KPCA) to construct embeddings of lower dimensions that can be applied to any domain. The experiments conducted on benchmark datasets highlighted that the generated embeddings are general-purpose and can capture semantic meanings from both small and large corpora.


Main Subjects

[1] J.E. Font, M.R. Costa-Jussa, Equalizing gender biases in neural machine translation with word embeddings techniques, arXiv preprint arXiv:1901.03116, (2019).
[2] R.A. Stein, P.A. Jaques, J.F. Valiati, An analysis of hierarchical text classification using word embeddings, Information Sciences, 471 (2019) 216-232.
[3] E. Biswas, K. Vijay-Shanker, L. Pollock, Exploring word embedding techniques to improve sentiment analysis of software engineering texts, in:  Proceedings of the 16th International Conference on Mining Software Repositories, IEEE Press, 2019, pp. 68-78.
[4] F. Incitti, F. Urli, L. Snidaro, Beyond word embeddings: A survey, Information Fusion, 89 (2023) 418-436.
[5] R. JeffreyPennington, C. Manning, Glove: Global vectors for word representation, in:  Conference on Empirical Methods in Natural Language Processing, 2014.
[6] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in:  Advances in neural information processing systems, 2013, pp. 3111-3119.
[7] D.S. Asudani, N.K. Nagwani, P. Singh, Impact of word embedding models on text analytics in deep learning environment: a review, Artificial Intelligence Review, (2023) 1-81.
[8] J. Qiang, F. Zhang, Y. Li, Y. Yuan, Y. Zhu, X. Wu, Unsupervised statistical text simplification using pre-trained language modeling for initialization, Frontiers of Computer Science, 17 (2023) 171303.
[9] J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word representation, in, pp. 1532-1543.
[10] Y. Zhang, R. He, Z. Liu, K.H. Lim, L. Bing, An unsupervised sentence embedding method by mutual information maximization, arXiv preprint arXiv:2009.12061, (2020).
[11] B. Li, H. Zhou, J. He, M. Wang, Y. Yang, L. Li, On the sentence embeddings from pre-trained language models, arXiv preprint arXiv:2011.05864, (2020).
[12] B. Wang, C.C.J. Kuo, Sbert-wk: A sentence embedding method by dissecting bert-based word models, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28 (2020) 2146-2157.
[13] J. Wieting, M. Bansal, K. Gimpel, K. Livescu, Towards universal paraphrastic sentence embeddings, arXiv preprint arXiv:1511.08198, (2015).
[14] R. Socher, E.H. Huang, J. Pennin, C.D. Manning, A.Y. Ng, Dynamic pooling and unfolding recursive autoencoders for paraphrase detection, in:  Advances in neural information processing systems, 2011, pp. 801-809.
[15] B. Min, H. Ross, E. Sulem, A.P.B. Veyseh, T.H. Nguyen, O. Sainz, E. Agirre, I. Heinz, D. Roth, Recent advances in natural language processing via large pre-trained language models: A survey, arXiv preprint arXiv:2111.01243, (2021).
[16] S. Li, X. Puig, C. Paxton, Y. Du, C. Wang, L. Fan, T. Chen, D.-A. Huang, E. Akyürek, A. Anandkumar, Pre-trained language models for interactive decision-making, Advances in Neural Information Processing Systems, 35 (2022) 31199-31212.
[17] R.K. Kaliyar, A multi-layer bidirectional transformer encoder for pre-trained word embedding: a survey of bert, in, IEEE, pp. 336-340.
[18] M.S. Charikar, Similarity estimation techniques from rounding algorithms, in, 2002, pp. 380-388.
[19] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, science, 313 (2006) 504-507.
[20] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, Advances in neural information processing systems, 21 (2008).
[21] Y. Li, F. Liu, Z. Du, D. Zhang, A simhash-based integrative features extraction algorithm for malware detection, Algorithms, 11 (2018) 124.
[22] J. Leskovec, A. Rajaraman, J.D. Ullman, Mining of massive data sets, Cambridge university press, 2020.
[23] F. Hill, K. Cho, A. Korhonen, Learning distributed representations of sentences from unlabelled data, arXiv preprint arXiv:1602.03483, (2016).
[24] T. Mikolov, W.-t. Yih, G. Zweig, Linguistic regularities in continuous space word representations, in:  Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 746-751.
[25] O. Levy, Y. Goldberg, Linguistic regularities in sparse and explicit word representations, in:  Proceedings of the eighteenth conference on computational natural language learning, 2014, pp. 171-180.
[26] S. Arora, Y. Li, Y. Liang, T. Ma, A. Risteski, Linear algebraic structure of word senses, with applications to polysemy, Transactions of the Association of Computational Linguistics, 6 (2018) 483-495.
[27] W. Blacoe, M. Lapata, A comparison of vector-based representations for semantic composition, in:  Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, Association for Computational Linguistics, 2012, pp. 546-556.
[28] J. Mitchell, M. Lapata, Vector-based models of semantic composition, proceedings of ACL-08: HLT, (2008) 236-244.
[29] K.S. Tai, R. Socher, C.D. Manning, Improved semantic representations from tree-structured long short-term memory networks, arXiv preprint arXiv:1503.00075, (2015).
[30] R. Socher, B. Huval, C.D. Manning, A.Y. Ng, Semantic compositionality through recursive matrix-vector spaces, in:  Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, Association for Computational Linguistics, 2012, pp. 1201-1211.
[31] R. Socher, A. Perelygin, J. Wu, J. Chuang, C.D. Manning, A. Ng, C. Potts, Recursive deep models for semantic compositionality over a sentiment treebank, in:  Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631-1642.
[32] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in:  International conference on machine learning, 2014, pp. 1188-1196.
[33] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, arXiv preprint arXiv:1404.2188, (2014).
[34] R. Kiros, Y. Zhu, R.R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, S. Fidler, Skip-thought vectors, in:  Advances in neural information processing systems, 2015, pp. 3294-3302.
[35] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised learning of universal sentence representations from natural language inference data, arXiv preprint arXiv:1705.02364, (2017).
[36] S.R. Bowman, G. Angeli, C. Potts, C.D. Manning, A large annotated corpus for learning natural language inference, arXiv preprint arXiv:1508.05326, (2015).
[37] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016.
[38] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, Communications of the ACM, 60 (2017) 84-90.
[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems, 30 (2017).
[40] T.J. Sejnowski, The unreasonable effectiveness of deep learning in artificial intelligence, Proceedings of the National Academy of Sciences, 117 (2020) 30033-30038.
[41] S. Lamsiyah, A. El Mahdaouy, B. Espinasse, S. El Alaoui Ouatik, An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings, Expert Systems with Applications, 167 (2021) 114152.
[42] P. Gupta, Unsupervised learning of sentence embeddings using compositional n-gram features, in, 2018.
[43] M. Pagliardini, P. Gupta, M. Jaggi, Unsupervised learning of sentence embeddings using compositional n-gram features, arXiv preprint arXiv:1703.02507, (2017).
[44] S. Arora, Y. Liang, T. Ma, A simple but tough-to-beat baseline for sentence embeddings, (2016).
[45] A. Roshanzamir, H. Aghajan, M. Soleymani Baghshah, Transformer-based deep neural network language models for Alzheimer’s disease risk assessment from targeted speech, BMC Medical Informatics and Decision Making, 21 (2021) 1-14.
[46] J. Lu, X. Zhan, G. Liu, X. Zhan, X. Deng, BSTC: A Fake Review Detection Model Based on a Pre-Trained Language Model and Convolutional Neural Network, Electronics, 12 (2023) 2165.
[47] Z. Dai, J. Callan, Deeper text understanding for IR with contextual neural language modeling, in, pp. 985-988.
[48] N. Azzouza, K. Akli-Astouati, R. Ibrahim, Twitterbert: Framework for twitter sentiment analysis based on pre-trained language model representations, in, Springer, pp. 428-437.
[49] H. Christian, D. Suhartono, A. Chowanda, K.Z. Zamli, Text based personality prediction from multiple social media data sources using pre-trained language model and model averaging, Journal of Big Data, 8 (2021).
[50] V. Suresh, D.C. Ong, Using knowledge-embedded attention to augment pre-trained language models for fine-grained emotion recognition, in, IEEE, pp. 1-8.
[51] L.K. Şenel, I. Utlu, V. Yücesoy, A. Koc, T. Cukur, Semantic structure and interpretability of word embeddings, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26 (2018) 1769-1779.
[52] J.J. Lastra-Díaz, J. Goikoetxea, M.A.H. Taieb, A. García-Serrano, M.B. Aouicha, E. Agirre, A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art, Engineering Applications of Artificial Intelligence, 85 (2019) 645-665.
[53] A. Bakarov, A survey of word embeddings evaluation methods, arXiv preprint arXiv:1801.09536, (2018).
[54] V. Lampos, B. Zou, I.J. Cox, Enhancing feature selection using word embeddings: The case of flu surveillance, in, pp. 695-704.
[55] P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in:  Proceedings of the thirtieth annual ACM symposium on Theory of computing, ACM, 1998, pp. 604-613.
[56] M.S. Charikar, Similarity estimation techniques from rounding algorithms, in:  Proceedings of the thirty-fourth annual ACM symposium on Theory of computing, ACM, 2002, pp. 380-388.
[57] B. Schölkopf, A. Smola, K.-R. Müller, Kernel principal component analysis, in:  International conference on artificial neural networks, Springer, 1997, pp. 583-588.
[58] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084, (2019).