How to write Naive Bayes classification algorithm in Ruby
Let's look at some real problems that can be solved with this solution:
- decide which comment is positive or negative
- deciding which email is OK or spam
- sentiment analysis
- recommendation algorithm for products, movies
- fraud detection by determining the probability that a given transaction is a fraud based on transaction characteristics and payment history
How the algorithm works
The Bayesian algorithm uses this relationship to calculate the probability of a class given a set of features, and then chooses the class with the highest probability as the classification for the item. The algorithm can be used for various classification tasks, including spam filtering, sentiment analysis and text classification.
Here's a brief overview of the Bayes algorithm process:
- Collect a set of data that you want to classify. This data should be labelled, meaning that each piece of data has an associated category or class.
- Split the data into two sets: one for training and one for testing. The training set will be used to train the algorithm, while the test set will be used to evaluate the accuracy of the algorithm.
- Calculate the probability of each class based on the frequency of occurrence in the training set. This is known as the prior probability and is used in the Bayesian computation.
- Calculate the probability of each feature (e.g. a word in a document) given each class. This is the probability of observing a feature given a class and is calculated by dividing the number of times the feature appears in a class by the total number of features in that class.
- Use the prior probability and the probability to calculate the posterior probability of each class given a feature. The posterior probability represents the probability of a class given the features of a test item.
- Use the posterior probabilities to classify the test items by choosing the class with the highest probability.
- Evaluate the accuracy of the algorithm by comparing the predicted classes to the actual classes of the test items.
Note that this is a simplified overview of the Bayes algorithm, and there are various variations and nuances of the algorithm depending on the specific use case and implementation.
Sample implementation in Ruby
Let's provide a simple implementation of a categorization analysis - for example, detect email category (spam or OK) based on the title.
LineParser
is a simple class that takes the text as input and returns the processed text, stripping all non-alphanumeric characters and whitespace, converting the text to lower case, and stripping any leading or trailing whitespace.DataLoader
a class that allows you to add and store data in a hash. The data consists of categories and words, and for each category the frequency of each word is stored. It also keeps a list of unique categories. It's easy to modify this class to retrieve and store the data in the database rather than in-memory.CategorizerAnalysis
is a class that uses the sample data from the DataLoader
class and the test line to perform a categorization analysis. The class performs a Naive Bayes analysis by calculating the probability of each category given the test data and normalizing the results. The class outputs a hash containing the probability of each category and the winning category, i.e. the category with the highest probability factor.
So, the algorithm is not difficult to understand. But let's see the results in practice. First, we need to provide some data to 'train'.
Next is to load the data into CategorizerAnalysis
and run it!
Let's test the results for these sample email titles.
I was shocked at how correctly it was categorized and the proportions between them. The results look pretty cool compared to the simplicity of the algorithm. I highly recommend it for simple things.
Full gist can be found here.
Happy coding!