Sentiment classification using BigQuery ML
BigQuery has a nice feature for creating and executing machine learning models standard SQL commands. This allows you to bring the models to data instead of the other way around. BigQuery ML supports different models starting from linear and logistic regression to deep neural networks. You can also upload your previously trained Tensorflow model and use BigQuery ML for making predictions.
I wanted to test how you use it for sentiment classification. This post walks you through the (simple) process of loading the data into BigQuery and creating a classification model. Let’s get started!
I used the IMDB dataset from Kaggle which contains 50,000 movie reviews and the sentiment (positive/negative) for them. This dataset is quite small, only ~65 MB, but with Biquery, there is no limit on the size of the dataset since Bigquery can analyze petabytes of data quickly.
You can download the dataset from Kaggle and upload it to BigQuery. The easiest way is to upload the CSV to Cloud Storage and create a BigQuery table that queries the Cloud Storage bucket.
Once we have the data in BiqQuery, let’s look at the data. I’m using modulo and hash operators to split the data into training and test sets and the following query returns a few examples from the training set.
We need to convert the text into a numerical format to feed it to our ML model. This can be done, for example, by extracting the top n-grams from the texts and then calculating the TF-IDF scores for them to find the best words for training. Another approach would be to use pre-trained language models, such as gnews-swivel, and convert the texts into embeddings. This is a bit more complex with BigQuery ML but might also yield better results since the embeddings encapsulate a lot of information.
But let’s go with n-grams now. BigQuery has ML.NGRAMS function which can easily convert the given word arrays into n-grams. The query below will first split the text into words and ML.NGRAM function will do the rest. The range is set as [1, 3] so it will output unigrams, bigrams and trigrams. Hence the query will output the 10,000 most common unigrams, bigrams, and trigrams in the training data. I also noticed some of the texts contained HTML elements so I’m removing them as well.
Some of the top n-grams were:
The most frequent words are not good enough since they don’t reveal anything about the importance of the words. There are different ways to find the features that correlate with the target the most but we’ll just use words with the highest TF-IDF scores without exploring any other possibilities. TF-IDF is short for term frequency–inverse document frequency, which is a numerical statistic that is intended to reflect how important a word is to a document in a corpus.
There are plenty of libraries for calculating TF-IDF scores but it’s easy to do with BigQuery as well. The following query is split into subqueries that do the following:
- Calculate the term frequencies in each text for all extracted n-grams
- Calculate the inverted term frequencies in the whole corpus
- Find the terms with the highest TF-IDF scores.
Here are the results. These terms will be the features of the model.
There are many different models to choose from but we will pick one of the simplest ones, logistic regression. The terms we extracted earlier will be the only features of the model. When creating a model with string array inputs, Bigquery ML performs “multi-hot encoding transformation”:
For all non-numerical ARRAY columns, BigQuery ML performs a multi-hot encoding transformation. This transformation generates a separate feature for each unique element in the ARRAY.
Not sure exactly what happens behind the scenes but I guess it must create a one-hot encoded vector whose dimension is the total number of unique terms in the given array.
Creating the actual model with BigQuery is easy. Just add CREATE OR REPLACE MODEL <MODEL_NAME> … AS in front of your query which loads the data. But remember that BigQuery ML will treat all the columns as features so your query should not return anything else since otherwise it will be used for training the model.
Training duration depends on the amount of data you have but with a small dataset of 50k records, it didn’t take more than a couple of minutes. You can see the training progress directly from Bigquery UI:
We also instructed Bigquery to use 10% of the data for validation by setting the DATA_SPLIT_EVAL_FRACTION parameter to 0.1 earlier. BigQuery will then also show the basic validation metrics in the UI. You can see the loss, precision, recall, and confusion matrix, among others.
Only 80% of the data was used for training so we can use the rest for evaluating the model’s performance on data it hasn’t seen before. We can get the rest of the data by using the following WHERE clause:
WHERE MOD(ABS(FARM_FINGERPRINT(text)), 10) >= 8
The query for evaluating the model is the same as the query for training except the “CREATE OR REPLACE MODEL <MODEL_NAME>” is replaced by “SELECT * FROM ML.EVALUATE(MODEL <MODEL_NAME>”
The performance is even slightly better on test data so we could probably have trained it for a bit longer without risk of overfitting.
Quite cool, huh?