Nội dung text C5_Deep learning for text and sequences.pdf
DEEP LEARNING Deep Learning for Text and Sequences
Introduction • Many software artifacts are in the form of text or sequences • Recurrent Neural Networks (RNNs): designed to deal with textual inputs and inputs in sequence – SentiStrength [1] predicts positive or negative sentiment for informal English text – Predicted sentiment is further used to extract problematic API features by work [2] – A recent study [3] showed that SentiStrength achieved recall and precision lower than 40% on negative sentences 2 [1] Thelwall et al. 2010. Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology. [2] Zhang et al. 2013. Extracting problematic API features from forum discussions. In 21st International Conference on Program Comprehension (ICPC). [3] Lin et al. 2018. Sentiment Analysis for Software Engineering: How Far Can We Go?. In Proceedings of 40th International Conference on Software Engineering (ICSE).
Introduction • Text sequence → word embeddings → RNN model → prediction • Word embeddings are the dominating factor in model accuracy in RNN applications [4, 5, 6] • Same ML model using different word embeddings can have divergent accuracy ranging from 62.95% to 88.90% [5] [4] Baroni et al. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL). [5] Schnabel et al. 2015. Evaluation methods for unsupervised word embeddings. In Conference on Empirical Methods in Natural Language Processing (EMNLP). [6] Yu et al. 2017. Refining word embeddings for sentiment analysis. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 3 For a type of bugs in which problematic word embeddings lead to suboptimal model accuracy
Introduction • Text sequence → word embeddings → RNN model → prediction • The quality of word embeddings can be measured using neighboring words 4 Original: original embeddings Regulated: regulated embeddings (by our tool) Nearest words measured by cosine similarity Target word