MULTI LABEL CLASSIFICATION WITH NLTK¶
In this tutorial, I will show you how to predict tags for a text. In this post, we will build a multi-label model thatās capable of detecting different types of toxicity in a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate
The data set used can be downloaded at Kaggle . there is a disclaimer on the data set that the dataset contains texts that may be considered profane, vulgar, or offensive. this is applicable to this tutorial as well.
I would like to acknowledge the National Research University Higher School of Economics of Moscow for most of my codes borrowed from their github. I would also like to acknowledge Susan Li for her article from which I borrowed the codes used to do the analysis of the tags. Susan Li’s Article.
To solve this task you will use multilabel classification approach.
Libraries and file used¶
In this task I will need the following libraries:
- Metrics file used to plot the roc_au curves.
- Numpy ā a package for scientific computing.
- Pandas ā a library providing high-performance, easy-to-use data structures and data analysis tools for the Python
- scikit-learn ā a tool for data mining and data analysis.
- NLTK ā a platform to work with natural language.
Let’s first download the list of stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
The data set contains text data and corresponding tags. for easy operation on the data, let’s first load pandas and numpy that we will to structure our data and do operations on the data. let’s also deploy matplotlib that we will use for visualisation
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Let’s create a function and call it read_data that we can use to load data. I’ve already split the data into training data set and test data set that I then saved as Pickle file. you can see the whole process Here.
def read_data(filename):
data = pd.read_pickle(filename)
return data
train = read_data('data/train.pkl')
test = read_data('data/test.pkl')
Now that our data is deployed, let’s see some statics about our data.
train.describe()
Let’s now print a snapshot of our data to see the structure
train.head()
Now that we have a slight idea of what our data contains. let’s first split the training set into train and validation set. the validation set will help us to verify whether our model is learning before we can fit the model to the test set. this is crucial since we should not use the test set to adjust the learning parameters of our models. doing so will leak some features of the test set and our model will fit well the test set and might fail to generalise to unknow data. I’ve chosen to use a sample of 10% of the training set as the validation data set.
from sklearn.model_selection import train_test_split
train, validation = train_test_split(train, random_state=42, test_size=0.1, shuffle=True)
Now that we have split the train and validation sets, let’s explore our data and find out whether the split operation haven’t altered the distribution of our data. we will first examine whether the mean and standard deviation of our training data has remained in the same range.
train.describe()
We can see that the mean and std for all our class columns have remained in the same range. this a first good sign. However we need to explore deeper and try to understand whether the validation set represents the sample population of our training. to do that we will explore the mean and std and then we will check how each individual class is distributed in both the training set and the validation set
validation.describe()
We can see that our validation set is indeed doing a good job in representing our sample population. Good, but let’s see if it is really good as it shows. let’s now list how many comments are tagged by each class(tag). I will call this tags a category. the rule is that the ratio between the classes will remain almost the same on both data sets.
def count_labels_per_category(df):
df_toxic = df.drop(['id', 'comment_text'], axis=1)
counts = []
categories = list(df_toxic.columns.values)
for i in categories:
counts.append((i, df_toxic[i].sum()))
df_stats = pd.DataFrame(counts, columns=['category', 'number_of_comments'])
return df_stats
Let’s display the count of comments tagged with each category on the training set.
df_stats_train = count_labels_per_category(train)
df_stats_train
let’s do the same with the validation set
df_stats_valid = count_labels_per_category(validation)
df_stats_valid
You can see that the ration of comments attributed for each tag have remained the same.
Let’s now plot this data since it is easier to see via graphs how comments are assigned to tags than to just read numbers.
def plot_count_labels_per_category(df_stats):
df_stats.plot(x='category', y='number_of_comments', kind='bar', legend=False, grid=True, figsize=(8, 5))
plt.title("Number of comments per category")
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('category', fontsize=12)
plot_count_labels_per_category(df_stats_train)
plot_count_labels_per_category(df_stats_valid)
Although we now know how many comments are assigned to each tag, we still lack the big picture since as we know in any multi label problem, one text can be assigned to multiple tags. so we need to understand how many comments have zero tags (or safe comments), how many comments have only one tag, how many comments assigned just two tags, etc…
import seaborn as sns
rowsums = train.iloc[:,2:].sum(axis=1)
x=rowsums.value_counts()
plt.figure(figsize=(8,5))
ax = sns.barplot(x.index, x.values)
plt.title("Multiple categories per comment")
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('# of categories', fontsize=12)
This indeed just revealed something not surprising but important to know. most of our comments are safe and so they don’t have a tag. no comment has all tags at a time. looking at the graph we can see that our data set is sparse. so we need to keep this in mind when we evaluate our model. I will talk about this in the section of model evaluation. to understand how sparse our model is, let’s calculate the percentage of comments that have no tag on them.
print('Percentage of comments that are not labelled:')
print(len(train[(train['toxic']==0) & (train['severe_toxic']==0) & (train['obscene']==0) & (train['threat']== 0) & (train['insult']==0) & (train['identity_hate']==0)]) / len(train))
We can see that ~ 90% of our data are safe comments. during the evaluation, we need to keep this in mind so that we don’t think that our model is doing a good job when in reality it is just assigning all comments to a safe tag.
So far we have been looking at the distribution of the tags. what about the composition of our comments data? as a matter of curiosity, let see how big are texts in general
lens = train.comment_text.str.len()
lens.hist(bins = np.arange(0,5000,50))
The majority of our texts have between 200 and 300 characters. we can see few texts that have more than 1500 characters.
Let’s now see if no comment is empty. in which we could delete it
print('Number of missing comments in comment text on Training set:')
train['comment_text'].isnull().sum()
print('Number of missing comments in comment text on Validation set:')
validation['comment_text'].isnull().sum()
Let’s split the input data from the ground truth labels.
X_train, y_train = train['comment_text'].values, train.iloc[:,2:].values
X_val, y_val = validation['comment_text'].values, validation.iloc[:,2:].values
X_test, y_test = test['comment_text'].values, test.iloc[:,2:].values
print('X_train shape ', X_train.shape)
print('y_train shape ', y_train.shape)
print('X_val shape ', X_val.shape)
print('y_val shape ', y_val.shape)
print('X_test shape ', X_test.shape)
print('y_test shape', y_test.shape)
y_train
As it can be seen, our labels are still not yet formatted. we need to list all tags of a comment is easy to understand way. for example: comment1 will have a new tag safe which means safe comment. and comment3 will have an array containing ‘severe toxic’,’obscene’,’identity_hate’
train.columns
classes = {0:'toxic',1:'severe_toxic',2:'obscene',3:'threat',4:'insult',5:'identity_hate'}
Create a function called convertClass that convert our labels as described above. for comments which are safe, add a new tag called safe
def convertClass(tags,classes):
result = []
for i,tag in enumerate(tags):
if tag > 0:
result.append(classes[i])
if len(result) == 0:
result.append('safe')
return result
Let’s run our function to all labels data
y_train = np.array([convertClass(tag,classes) for tag in y_train])
y_val = np.array([convertClass(tag,classes) for tag in y_val])
y_test = np.array([convertClass(tag,classes) for tag in y_test])
y_train
Our data is almost ready for training. but one more thing needs to be done. we need to clean the comments by removing wrong characters such as special characters. we also need to change the case of all characters that make the comment to avoid case insensitivity our models. we also need to remove stop words since those are words that are likely to be common to all comments. let’s print a sample of 3 comments to see what the comments are like
X_train[2:5]
Let’s create a function and call it text_prepare which will be passed a text and return a cleaned version of the text.
import re
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
NEW_LINE = re.compile('\n')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
def text_prepare(text):
"""
text: a string
return: modified initial string
"""
text = text.lower()# lowercase text
text = NEW_LINE.sub(' ',text) # replace NEW_LINE symbols in our texts by space
text = REPLACE_BY_SPACE_RE.sub(' ',text)# replace REPLACE_BY_SPACE_RE symbols by space in text
text = BAD_SYMBOLS_RE.sub('',text)# delete symbols which are in BAD_SYMBOLS_RE from text
text = ' '.join([word for word in text.split() if word not in STOPWORDS])
return text
Let’s test our function by using a small sample before we can apply the function to all texts.
sample_text = [ 'Possibly related 2 nearby IPs that have also spammed; see talk pages for details:',
'"\nWhen the category was created there was only one person in the category. I don\'t know anything about the American Protestant groups, so if you need to, you can create the category. Please use Category:Southern Baptist Convention when categorizing individuals who are Southern Baptist.Ć¢\x80\x94RyĆ
Ā«lóng (ç«\x9cé¾\x99) "',
'Giant Cunt==\n\nShe is not liberal but in fact a giant cunt. \n\n==',
"Sci-Fi Dine-In Theater Restaurant \n\nHello, Neelix. I'm not clear on why you are reverting my edits to this page. Please let me know! I'm attempting to adhere to wiki guidelines and am fairly certain that my edits are in lockstep with them. Thanks!\n\nJohn",
'(only elligable for new accounts)']
sample_text_clean = [text_prepare(x) for x in sample_text]
sample_text_clean
X_train = [text_prepare(x) for x in X_train]
X_val = [text_prepare(x) for x in X_val]
X_test = [text_prepare(x) for x in X_test]
X_train[2:5]
You can definitely do a better job than I did if you spend more time looking at the data and cleaning accordingly. I am satisfied with the little I did, that I do. so I will move to the next step.
Data transformation¶
In this section we will transform our input text into vectors of numbers that represent each word that makes the training corpus
Let’s first create two dictionaries one to hold all words and the number of times they have been used across the corpus. and another to hold all tags and the number of times they have been assigned a comment. these two will be used later when we want to check what the model is learning about some of the most important words in the corpus.
# Dictionary of all tags from train corpus with their counts.
tags_counts = {}
for tags in y_train:
for tag in tags:
if tag in tags_counts:
tags_counts[tag] += 1
else:
tags_counts[tag] = 1
# Dictionary of all words from train corpus with their counts.
words_counts = {}
for title in X_train:
for word in title.split():
if word in words_counts:
words_counts[word] += 1
else:
words_counts[word] = 1
most_common_tags = sorted(tags_counts.items(), key=lambda x: x[1], reverse=True)[:5]
print('Common tags',most_common_tags)
print('-----------------')
most_common_words = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:5]
print('common words',most_common_words)
There are different ways we can represent each word in a numerical way. the most common are bag of words, TFIDF and Word Embeddings. The later has recently attracted attention of many especially when you have a big corpus of millions of words(including n-grams). I will use this technique in another post where I will cover as similar classification exercise using deep learning. for this exercise though, I will use TFIDF which is an improved version of bag of words which uses inversed logarithmic normalisation to penalise those words that are most frequent accross different input text of our corpus.
Let’s use sklearn library to generate tfidf vectors. I will also use n-grams of size 1 and two. you can experiment with size if you have enough RAM on your computer. Also keep in mind that increasing the size of n-grams without increasing the size of samples might generate a lot of features than the size of the sample. this can result into your model being high biased.
from sklearn.feature_extraction.text import TfidfVectorizer
Create a function that convert each set of our traing, validation and test corpus into a TFIDF matrix
def tfidf_features(X_train, X_val, X_test):
"""
X_train, X_val, X_test ā samples
return TF-IDF vectorized representation of each sample and vocabulary
"""
# Create TF-IDF vectorizer with a proper parameters choice
# Fit the vectorizer on the train set
# Transform the train, test, and val sets and return the result
tfidf_vectorizer = TfidfVectorizer(min_df=5, max_df=0.9, ngram_range=(1, 2),token_pattern='(\S+)')
tfidf_vectorizer.fit(X_train)
X_train = tfidf_vectorizer.transform(X_train)
X_val = tfidf_vectorizer.transform(X_val)
X_test = tfidf_vectorizer.transform(X_test)
return X_train, X_val, X_test, tfidf_vectorizer.vocabulary_
Run the functiion agains the training set, validation and the test. check if most common words of our corpus are still there. this is crucial since the performance our model will depend on the TFIDF representation of the data sets. Implement function tfidf_features using class TfidfVectorizer from scikit-learn. Use train corpus to train a vectorizer. You can filter out too rare words (occur less than in 5 titles) and too frequent words (occur more than in 90% of the titles). Also, use bigrams along with unigrams in your vocabulary.
X_train_tfidf, X_val_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_val, X_test)
tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}
print('article' in tfidf_vocab)
print('page' in tfidf_vocab)
print('wikipedia' in tfidf_vocab)
print('please' in tfidf_vocab)
print('talk' in tfidf_vocab)
Check the size of our training set to find out how many features are we going to train.
print('X_train_tfidf shape ', X_train_tfidf.shape)
print('X_val_tfidf shape ', X_val_tfidf.shape)
print('X_test_tfidf shape ', X_test_tfidf.shape)
Training the classifier¶
Now that our different data are ready. let’s decide on the training technic. since this exercise requires to predict tags for each comment and since a comment can have zero, one or more than one tags, we will need to use a model that consider each input of tfidf vectors representing words of each comment; and each tag individually and evaluate the probability that the input can be assigned the tag(output 1) or otherwise (output 0). we will use the MultiLabelBinarizer of sklearn to convert each tag into a binary form.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes=sorted(tags_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_val = mlb.fit_transform(y_val)
#y_test = mlb.fit_transform(y_test)
Now that the tags have been converted into binary representation, let’s build our classifier. we will use OneVsRestClassifier to train each binary tag individually and we will use basic classify called LogisticRegression as the upper layer. It is one of the simplest methods, but often it performs good enough in text classification tasks.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
def train_classifier(X_train, y_train):
"""
X_train, y_train ā training data
return: trained classifier
"""
# Create and fit LogisticRegression wraped into OneVsRestClassifier.
lr = LogisticRegression(C=4.0,penalty='l2') # use L2 to optimise
ovr = OneVsRestClassifier(lr)
ovr.fit(X_train, y_train)
return ovr
Let’s run the classifier function against our training set
classifier_tfidf = train_classifier(X_train_tfidf, y_train)
Let’s use the model to predict first the output of using our training set. also get the score to be used for evaluation
y_train_predicted_labels_tfidf = classifier_tfidf.predict(X_train_tfidf)
y_train_predicted_scores_tfidf = classifier_tfidf.decision_function(X_train_tfidf)
Let’s print a sample of 200 comments of our training set and compare what the model predicted vs the true labels
y_train_pred_inversed = mlb.inverse_transform(y_train_predicted_labels_tfidf)
y_train_inversed = mlb.inverse_transform(y_train)
for i,text in enumerate(X_train[20:26]):
print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
text,
','.join(y_train_inversed[20:26][i]),
','.join(y_train_pred_inversed[20:26][i])
))
Our model seems to be doing well on the training set. but that was expected. let’s now predict the validation set
y_val_predicted_labels_tfidf = classifier_tfidf.predict(X_val_tfidf)
y_val_predicted_scores_tfidf = classifier_tfidf.decision_function(X_val_tfidf)
y_val_pred_inversed = mlb.inverse_transform(y_val_predicted_labels_tfidf)
y_val_inversed = mlb.inverse_transform(y_val)
for i,text in enumerate(X_val[84:90]):
print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
text,
','.join(y_val_inversed[84:90][i]),
','.join(y_val_pred_inversed[84:90][i])
))
Not bad either. However our training data set is big and has a hundred of thousands of comments. the validation set is also big. we can ‘t evaluate each example by comparing labels. we need some numbers that can help us to understand how the model performed. we will use a combination of accuracy, f1 score, precision, recall and roc curve. you can read more about the evaluation methods by following the links.
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
Let’s create a function that prints the metrics that specified above
def print_evaluation_scores(y_val, predicted):
######################################
######### YOUR CODE HERE #############
######################################
print('Accuracy:',accuracy_score(y_val, predicted))
print('F1 Score:',f1_score(y_val, predicted, average='weighted'))
print('Precision:',average_precision_score(y_val, predicted,average='weighted'))
print('Recall:',recall_score(y_val, predicted,average='weighted'))
print_evaluation_scores(y_val, y_val_predicted_labels_tfidf)
We can see that the accuracy is quite impressive. but that was not a surprise and it does not mean the model is performing well. You remember when I discussed about sparse data? this is what I was referring too. since one class is 90% dominant, predicting everytime that class will give us a 90% accuracy. almost what we see here. does that mean we are not doing well since we are getting only 91% accuracy? not yet. let’s think about the precission and recall. for a quick reminder see this Wikipedia. It is clear that the precision and recall are not telling all as well. same as F1 score. so we still need to analyse the area under the curve.
from metrics import roc_auc
%matplotlib inline
n_classes = len(tags_counts)
roc_auc(y_val, y_val_predicted_scores_tfidf, n_classes)
Our ROC curve looks good. there is a good trade-off between sensitivity and specificity. that can be told by looking how closely the curve is approaching both the left side and the top side. our model is not bias nor it has a high variance. well done.
Now let’s apply the model on the test set. this will allow us to see whether our model is generalising
y_test_predicted_labels_tfidf = classifier_tfidf.predict(X_test_tfidf)
y_test_predicted_scores_tfidf = classifier_tfidf.decision_function(X_test_tfidf)
y_test_pred_inversed = mlb.inverse_transform(y_test_predicted_labels_tfidf)
y_test_inversed = y_test#mlb.inverse_transform(y_test)
for i,text in enumerate(X_test[84:90]):
print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
text,
','.join(y_test_inversed[84:90][i]),
','.join(y_test_pred_inversed[84:90][i])
))
A quick look at the examples, shows that the model is doing well on the test set.
y_test_binary = mlb.fit_transform(y_test)
print_evaluation_scores(y_test_binary, y_test_predicted_labels_tfidf)
n_classes = len(tags_counts)
roc_auc(y_test_binary, y_test_predicted_scores_tfidf, n_classes)
The ROC curve also suggest that things are good.
Some things you can do¶
Pheeeew, we have now built our model and we are happy with the level of performance it shows on the comment data set. let’s now do something interesting. I found this a bit interesting and even funny. it is my favorite part for NLP models. let’s see what our models did actually learn. let’s ask our model to tell us the most positive words and the most negative words associated with each tag. sounds fun. Yuuuhuu.
def print_words_for_tag(classifier, tag, tags_classes, index_to_words, all_words):
"""
classifier: trained classifier
tag: particular tag
tags_classes: a list of classes names from MultiLabelBinarizer
index_to_words: index_to_words transformation
all_words: all words in the dictionary
return nothing, just print top 5 positive and top 5 negative words for current tag
"""
print('Tag:\t{}'.format(tag))
# Extract an estimator from the classifier for the given tag.
# Extract feature coefficients from the estimator.
est = classifier.estimators_[tags_classes.index(tag)]
######################################
######### YOUR CODE HERE #############
######################################
top_positive_words = [index_to_words[index] for index in est.coef_.argsort().tolist()[0][-5:]]# top-5 words sorted by the coefficiens.
top_negative_words = [index_to_words[index] for index in est.coef_.argsort().tolist()[0][:5]]# bottom-5 words sorted by the coefficients.
print('Top positive words:\t{}'.format(', '.join(top_positive_words)))
print('Top negative words:\t{}\n'.format(', '.join(top_negative_words)))
DICT_SIZE = len(words_counts)
WORDS_TO_INDEX = {word[0]:i for i,word in enumerate(sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:DICT_SIZE])}
INDEX_TO_WORDS = {WORDS_TO_INDEX[i]:i for i in WORDS_TO_INDEX}####### YOUR CODE HERE #######
####### YOUR CODE HERE #######
ALL_WORDS = WORDS_TO_INDEX.keys()
print_words_for_tag(classifier_tfidf, 'safe', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'identity_hate', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'insult', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'obscene', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'severe_toxic', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'threat', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'toxic', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
Impressive , Hein!!! our model can learn words that are most positive for each category. a few take on though. first some classes are ambigious and negative words are sometimes words that have nothing to do with the tag. for instance if a comment has a word article then the model might this is not offensive even when it is. depending on other words that make the comment,then the model might classify as a toxic comment. it is good to see what the model learn since that way you can improve your model by providing more examples.
Save the model to the disk for later use¶
import pickle
# save the model to disk
filename = 'classify_wiki_comments.sav'
pickle.dump(classifier_tfidf, open(filename, 'wb'))
# in case you would like to reload the model from the disk. just do
loaded_model = pickle.load(open(filename, 'rb'))
Comments are closed, but trackbacks and pingbacks are open.