Predicting Rotten or Fresh Reviews with Naive Bayes Classifier

asa5604
Apr 16, 2023
3 min read

E-commerce is built on the strength of product reviews. Reviews are crucial for making informed selections, whether you're choosing an Amazon product to buy or a new restaurant to try. But for a person, going through thousands of evaluations might be overwhelming. Thankfully, machine learning algorithms can assist us in making sense of massive datasets and forecasting reviews. We will look at how to use the Naive Bayes Classifier to determine whether a review is fresh or rotting in this blog article.

Naive Bayes Classifier

A well-liked machine learning approach for classification tasks is the naive Bayes classifier. It is founded on Bayes' Theorem, a cornerstone of probability theory. Based on information or evidence already known, the probability of an event occurring is described by Bayes' theorem. The theorem is used in the context of machine learning to determine the likelihood that a data point, given its attributes, belongs to a particular class.

The Naive Bayes classifier makes the assumption that a data point's features are independent of one another, which makes calculating the probabilities easier. Naive Bayes can nevertheless succeed in a variety of real-world classification tasks despite this simplification.

https://editor.analyticsvidhya.com/uploads/23385Capture6.PNG — fig 1

Data

The dataset for Rotten Tomatoes movie reviews will be used. Movie reviews from reviewers are included in the dataset and are classified as "fresh" or "rotten." With the use of this dataset, we hope to build a classifier that can determine if a new review is good or bad.

The dataset will first be combined and divided into three sections: train, development, and test. Our model will be trained using the train set, developed using the development set to fine-tune the hyperparameters, and tested using the test set to determine the ultimate accuracy of our model.

The dataset can be accessed here

My collab and github

Building a Vocabulary

We will then create a vocabulary list made up of each word in our dataset. To cut down on the number of terms in our vocabulary, we will delete uncommon words that appear fewer than five times. Additionally, we'll make a reverse index that functions as a dictionary, with each word serving as a key and having its index in the vocabulary list as its matching value.

To preprocess the data before calculating the probabilities, we can apply the following steps:

1) Tokenization: Convert each review into a list of tokens or words.

2) Stop Word Removal: Remove stop words, which are common words that do not carry much meaning such as "the", "and", "a", etc.

3) Stemming or Lemmatization: Convert words to their base form, such as "running" to "run", using either stemming or lemmatization.

Here's an example of how to implement these steps using the nltk library:

Calculating Probabilities

By multiplying the probabilities of each word in the text given the class, the Naive Bayes Classifier determines the likelihood that a document belongs to a specific class. For each word in our lexicon, we will determine two probabilities: the likelihood that the word will appear and the conditional probability based on the attitude.

The number of papers containing a word divided by the total number of documents indicates how often that word will appear in the documents. The conditional probability based on sentiment is calculated by dividing the total number of positive (or negative) review documents by the number of documents containing the word in question.

Calculating Accuracy

The development dataset will be used to determine our model's accuracy. To attain the best level of accuracy, I used alpha = 1.

As u can see in the above results the precision is the one that really matters and it's at 60 % for the development set

Top Predictive Words

The best 10 words for each class (fresh and rotting) will be determined. In order to do this, we will use the test dataset to determine the likelihood of each word given to the class. And the result is as follows inefficient data preprocessing due to lack of resources and time constraints these are the relevant top 10 words.

Conclusion and Contribution

In conclusion, the Naive Bayes Classifier is a straightforward yet effective text categorization technique. In this blog article, we use the Rotten Tomatoes reviews dataset to create a Naive Bayes Classifier from scratch. By tokenizing, deleting stop words, stemming, and lemmatizing the reviews, we first preprocessed the data. We next created a vocabulary of frequently occurring words and, using Laplace smoothing, determined the conditional probabilities of each word given the recentness of the review. Also used batch processing for faster computing. In order to examine the effects of smoothing and determine the top words that predict each class, we tested the classifier on a development set. Finally, we computed the final accuracy on a test set using the ideal hyperparameters.

References

The implementation of the Naive Bayes Classifier in this assignment is based on the following sources: