Long Short-Term Memory Models

View

Background Information

Preprocessing the Text Data

The abundance of text data provides many opportunities for training the NLP models. However, the unstructured nature of the text data requires preprocessing. Lowercasing, spelling corrections, punctuation, and stop word removal are some of these preprocessing steps. These operations could be easily implemented in Python language using NumPy, pandas, textblob, or nltk libraries. Then, tokenization, stemming, and lemmatization processes are realized to convert raw text data to smaller units with removing redundancy. The tokenization process splits the stream of text into words. Extracting the root of a word is done using stemming techniques. Similar to stemming, the lemmatizing process extracts the base form of a word. These preprocessing steps could be implemented using textblob or nltk libraries as well. The data used in this study are taken from the public IMDB dataset. It has binary labeled 50000 reviews. To train the SA models, 17500 reviews are chosen for training, 7500 for validation, and 25000 for testing purposes.


Feature Engineering

After preprocessing, feature extraction steps are implemented to transform words or characters to computer understandable format. This step includes vectorization. Vectorization of words provides corresponding vectors for further processing. Relating the representations of the words with similar meanings is achieved using word embedding. Each word is represented with a vector using feature engineering methods such as N-grams, count vectorizing, and term frequency-inverse document frequency (TF-IDF). Word embedding methods are developed to capture the semantic relation between the words. Word2vec and fastText frameworks are developed to train word embeddings. Skip-Gram and Continuous Bag of Words (CBOW) are commonly used models. This study benefits from the pre-trained GloVe word embeddings to increase the performance of the sentiment analysis models.