Tag Archives: kaggle

Is this a toxic comment?

False news and toxic comments on the web are no longer merely a nuisance: they can topple governments; or act as a catalyst for communal disharmony. This gives the recently concluded Toxic comment identification competition on Kaggle an added value.


The Kaggle Competition was organized by the Conversation AI team as part of its attempt at improving online conversations. Their current public models are available through Perspective API, but looking to explore better solutions through the Kaggle community.


In terms of the size, the dataset is relatively small with training set containing 134,384 records and test set 117,888.

Training and test sets both contain following fields.

  • ID – a random unique string
  • Comment – the text of the comment

In addition, the training set contains following six binary label fields. These labels are not mutually exclusive: a comment can be both “Toxic” and “Severe_toxic”.

  • Toxic
  • Severe_toxic
  • Obscene
  • Threat
  • Insult
  • Identity_hate

Here is the distribution of classes in the training set.

Distribution of classes

Distribution of classes

Number of tags applied per comment.

Number of tags applied per comment

Number of tags applied per comment

Most frequent toxic words per class, taken from jagangupta’s kernel.

Most frequent toxic words

Most frequent toxic words

If interested, Jagangupta’s brilliant kernel explores the dataset in depth and presents a detailed report.

Text pre-processing

Little text pre-processing helped improve the results somewhat in the range of ~0.0005 mean ROC AUC (more on evaluation later). Replacing IP addresses by a token, replacing abbreviations, emojis and expressions such as “gooood” with appropriate/normalized words helped. Removing stop-words seemed to be hindering sequence models than helping them, which means models have been able to use at least some of the words. I also kept few symbols such as “!” and “?” since both fastText and GloVe embeddings have representations for symbols.

def clean(comment):
    This function receives comments and returns a clean comment
    # normalize ip
    comment = re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"," ip ", comment)
    # replace words like Gooood with Good
    comment = re.sub(r'(\w)\1{2,}', r'\1\1', comment)
    # replace ! and ? with <space>! and <space>? so they can be kept as tokens by Keras
    comment = re.sub(r'(!|\?)', " \\1 ", comment)   
    #Split the sentences into words
    words=comment.split(' ')
    # normalize common abbreviations
    # replacements is a dictionary loaded from https://drive.google.com/file/d/0B1yuv8YaUVlZZ1RzMFJmc1ZsQmM/view 
    words=[replacements[word] if word in replacements else word for word in words]
    clean_sent=" ".join(words)

Another interesting point is the maximum number of features used: it stops having any noticeable effect after a certain threshold. So to save the computation cost and time, it’s worth setting a limit. In Keras tokenizer, this can be achieved by setting the num_words parameter, which limits the number of words used to a defined n most frequent words in the dataset. In this case, I settled for 100,000 as the maximum number of words used for models.

It is same with the length of features used to represent a sentence, and can be selected by looking at the following graph.

Number of words in a sentence

Number of words in a sentence. Source: https://www.kaggle.com/sbongo/for-beginners-tackling-toxic-using-keras

Sentences longer than a given threshold need to be truncated while shorter sentences need to be padded to fit the length. This is required before feeding a dataset to a sequence model because the model needs to have a defined number of units. In all experiments detailed, I used 200 as the sequence length as it didn’t have any noticeable difference using more features.

from keras.preprocessing import text, sequence
maxlen = 200 # length of the submitted sequence
EMBEDDING_FILE = './data/fasttext/crawl-300d-2M.vec'
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
x_train = train["comment_text"].fillna("fillna").values
x_test = test["comment_text"].fillna("fillna").values
# default filters parameter removes symbols ! and ? which we want to keep
tokenizer = text.Tokenizer(num_words=max_features, filters='"#$%&()*+,-./:;<=>@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(list(x_train) + list(x_test))
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)
# pad sentences to meet the maximum length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
# build a mapping of word to its embeddings
def get_coefs(word, *arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.rstrip().rsplit(' ')) for o in open(EMBEDDING_FILE))
# build the embedding matrix 
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.zeros((nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector


The evaluation of models was based on the mean column-wise ROC AUC. That is, averaging the ROC AUC score of each column.

Though I tried a number of models and variations, they can be roughly summarized to five models.

  1. Logistic regression model combining unigram, bigram and character level features with td-idf weighting
  2. LSTM/GRU based model with GloVe and fastText embeddings
  3. LSTM/GRU + CNN model with GloVe and fastText embeddings
  4. A deep CNN model
  5. Sequence layer with attention

This is each model in detail.

  1. Logistic regression
  2. The final LR model was a combination of three sets of features extracted from two unigram, bigram td-idf weighted 10,000 most frequent words combined with a character level tokenizer of lengths 2 to 6 (highest 50,000 features). With a single cross validation split of 5% this model achieved a highest LB score of 0.9805.

    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import LogisticRegression
    from scipy.sparse import hstack
    class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    # unigram feature extractor
    unigram_vectorizer = TfidfVectorizer(
        sublinear_tf=True, strip_accents='unicode',analyzer='word', ngram_range=(1, 1), 
        use_idf=1, smooth_idf=True, stop_words='english', max_features=10000
    unigram_vectorizer.fit(all_text) # all_text is concat of training and test text
    train_unigram_features = unigram_vectorizer.transform(train_text)
    test_unigram_features = unigram_vectorizer.transform(test_text)
    bigram_vectorizer = TfidfVectorizer(
        sublinear_tf=False, strip_accents='unicode', analyzer='word', ngram_range=(2, 2),
        use_idf=1, smooth_idf=True, stop_words='english', max_features=10000
    train_bigram_features = bigram_vectorizer.transform(train_text)
    test_bigram_features = bigram_vectorizer.transform(test_text)
    char_vectorizer = TfidfVectorizer(
        sublinear_tf=True, strip_accents='unicode', analyzer='char',
        stop_words='english', ngram_range=(2, 6), max_features=50000
    train_char_features = char_vectorizer.transform(train_text)
    test_char_features = char_vectorizer.transform(test_text)
    train_features = hstack([train_unigram_features, train_bigram_features, train_char_features])
    test_features = hstack([test_unigram_features, test_bigram_features, test_char_features])
    for class_name in class_names:
        train_target = train[class_name]
        classifier = LogisticRegression(solver='sag')
        classifier.fit(train_features, train_target)
        submission[class_name] = classifier.predict_proba(test_features)[:, 1]
  3. A sequence model with bi-directional LSTM/GRU with embeddings
  4. Essentially these are Bi-directional LSTM or GRU models taking embeddings as input. Initially I tried with a many-to-one model (returning only the last state) for the sequence layer, but this performed poorly. Influenced from the community, sequence layer was changed to return all states of units, and these time-distributed signals were captured by two pooling layers (average and max), and then concatenated and used as input for the final layer: six densely connected activation units (sigmoid) representing six output labels. This performed much better, achieving around 0.9830 on LB easily.

    I tried few variations of the same model. Among them, a bi-directional LSTM/GRU layer connected to a time-distributed dense layer (figure a) and a multi-layer bi-directional LSTM/GRU model (figure b) are noteworthy. The multi-layer model seemed to be overfitting, and at best performed comparable to a single layer when regularized heavily with a high dropout rate. The single bidirectional LSTM layer connected to a time-distributed dense layer with a moderate dropout rate, however, performed a little better than a single bidirectional LSTM layer. So this last model was used for further model averaging.

    One observations from this experiment was that it didn’t have much difference in accuracy whether it was LSTM or GRU units were used for the sequence layer. However, different embeddings had a noticeable difference. I tried with fastText (crawl, 300d, 2M word vectors) and GloVe (Crawl, 300d, 2.2M vocab vectors), and fastText embeddings worked slightly better in this case (~0.0002-5 in mean AUC). I didn’t bother with training embeddings since it didn’t look like there was enough dataset to train. This lecture explains what might happen when trying to train pre-trained embeddings on a small dataset.

    For building models, Keras was used with default Tensorflow backend. As for the hardware, AWS P2 instances (Tesla K80 GPUs) were used.



    a) LSTM-Dense-Pooling

    Multi-layer Bi-GRU

    b) Multi-layer Bi-GRU


    def build_model(): # figure (a) as a Keras model
        inp = Input(shape=(maxlen, ))
        x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
        x = SpatialDropout1D(0.4)(x)
        x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.0, dropout=0.2))(x)
        x = TimeDistributed(Dense(100, activation = "relu"))(x) # time distributed  (sigmoid)
        x = Dropout(0.1)(x)
        # global pooling layer
        avg_pool = GlobalAveragePooling1D()(x)
        max_pool = GlobalMaxPooling1D()(x)
        conc = concatenate([avg_pool, max_pool])
        outp = Dense(6, activation="sigmoid")(conc)
        model = Model(inputs=inp, outputs=outp)
        return model
    model = build_model()
    opt = Nadam(lr=0.001)
    model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
  5. LSTM/GRU + CNN model with GloVe and fastText embeddings
  6. This model is an attempt at combining sequence models with convolution neural networks (CNNs). At the beginning I tried a convolution layer passing signals to the sequence layer. But it seems swapping these layers — embeddings feeding to LSTM first and then using a CNN on each LSTM unit’s state — and pooling to an output layer brings better results. This study and kernel better explain this model.


    CNN-LSTM-pooling model

    def build_model(): # figure (a) as a Keras model
        input = Input(shape=(maxlen, ))
        x = Embedding(max_features, embed_size, weights=[embedding_matrix])(input)
        x = SpatialDropout1D(0.4)(x)
        x = Bidirectional(LSTM(80, return_sequences=True, recurrent_dropout=0.2, dropout=0.2))(x)
        x = Conv1D(filters=64, kernel_size=2, padding='valid', kernel_initializer="he_uniform")(x)
        x = Dropout(0.2)(x)
        avg_pool = GlobalAveragePooling1D()(x)
        max_pool = GlobalMaxPooling1D()(x)    
        conc = concatenate([avg_pool, max_pool])
        output = Dense(64, activation="relu")(conc)
        output = Dropout(0.1)(output)   
        output = Dense(6, activation="sigmoid")(output)        
        model = Model(inputs=input, outputs=output)    
        return model

    As a variation, I tried to emulate bigrams and trigrams using kernel sizes of 2 and 3 in the CNN layer and concatenating their outputs through pooling layers. But again, this seemed to overfit.

  7. A deep CNN model
  8. This model is based on this paper and this kernel. In summary, it consists of multiple layers of convolution and pooling layers with skip layer connections. Unfortunately, this didn’t perform up to the mark of above two models, both fastText and GloVe embeddings scoring 0.9834 and averaging to 0.9843 on LB. It could be either due to not being able to spend time on tuning it or overfitting again. Still this could work with more data, and indeed, it has according to stats reported in the paper.

  9. Sequence layer with attention
  10. This was a half hearted attempt. But still I wanted to try out how well attention works out. I used an attention layer and a secondary LSTM layer. Strangely, it couldn’t surpass the first two models without attention. Probably, it could have with more time spent on tuning it, but the training time needed was higher than other models.

    Attention with LSTM

    Attention with LSTM (Source: https://www.coursera.org/learn/nlp-sequence-models/)

Model averaging and ensemble

Due to the relatively small dataset size, there seemed to be a case of overfitting. So it’s not surprising that model averaging and regularization showed a strong positive effect on the prediction accuracy. It was done through various forms: one was through stratified 10-fold training, and this improved the performance noticeably, though obviously it took more time. When averaging folds, weighted average or ranked average performed slightly better than taking the mean of predictions.

Another interesting tidbit is how to use stratified-k fold with multi label classification, since popular stratified k-fold scikit-learn function only supports single label splits. Here, numpy.packbits can come in handy.

import numpy as np
from sklearn.model_selection import StratifiedKFold
pred = np.zeros((x_test.shape[0], 6))
y_packed = np.packbits(y_train, axis=1)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=32)
for i, (train_idx, valid_idx) in enumerate(kfold.split(x_train, y_packed)):
    print("Running fold {} / {}".format(i + 1, n_folds))
    print("Training / Valid set counts {} / {}".format(train_idx.shape, valid_idx.shape))
    # train the model

Model averaging was done again using different embeddings: here, the predictions produced by the same model using GloVe and fastText embeddings were averaged. This step improved the final accuracy of predictions significantly: the best single LSTM-CNN model achieved 0.9856 on LB after averaging predictions from two embeddings, where GloVe and fastText only got 0.9850 and 0.9854 respectively using the same model.

Towards the end, the competition turned into an ensemble madness. I would have liked to try some stacking, which in my opinion is the better way of combining models than by randomly conjured up coefficients.

Improvements and notes from the community

In my opinion, the best feature of Kaggle competitions is the collaborative learning experience. Here are some of the effective techniques that have been used by other teams.

  1. Augmenting the train/test dataset with translation
  2. Using translations is an interesting method of augmenting the dataset, and it has worked wonderfully without information leaking. The technique is quite simple: translate sentences to few nearby languages, and then translate them back to English. When you think about it, this makes sense since it can help reduce overfitting by adding more data with variations in the sentence structure. More details can be found from this kernel.

  3. Adding more embeddings
  4. It seems much of the complexity in this case comes from the embedding layer; and using more embeddings helps more than using different structures. So using all variations of GloVe, Word2vec, LexVec and fastText (e.g., Crawl, Twitter, Wikipedia) can help by averaging resulting predictions. More on various pre-trained embeddings can be found from here and here.

  5. Byte-pair encoding (BPE)
  6. Usually in any text processing work there is a large number of out of vocabulary words. In other words, these are the words that are not found in the embeddings. The standard way of handling such words is to use a token such as <unk> and moving on. Byte-pair-encoding has shown better results handling these kind of words by breaking the word into subwords — similar to phoneme in speech recognition — and using embeddings of these subwords to arrive at the full word. More on this can be found from here.

  7. Capsule networks
  8. Few teams had tried Geoffrey Hinton’s Capsule Networks, and they have reported to be overfitting. Anyway, it’s something worth trying out.

Useful resources:


Notes from Quora duplicate question pairs finding Kaggle competition

Quora duplicate question pairs Kaggle competition ended a few months ago, and it was a great opportunity for all NLP enthusiasts to try out all sorts of nerdy tools in their arsenals. This is just jotting down notes from that experience.


Quora has over 100 million users visiting every month, and needs to identify duplicate questions submitted — an incident that should be very common with such a large user base. One interesting characteristic that differentiates it from other NLP tasks is the limited amount of context available in the title; in most cases this would amount to a few words.

Exploratory data analysis

The dataset is simple as it can get: both training and test sets consist of two questions in consideration. Additionally, in the training set there are few extra columns: one denoting whether it’s a duplicate, and two more for unique IDs of each question.

qid1, qid2 – unique ids of each question (only available in the training set)
question1, question2 – the full text of each question
is_duplicate – the target variable; set to 1 if question1 and question2 essentially have the same meaning; 0 otherwise.

Some quick stats:

  • Training set size – 404,290
  • Test set size – 2,345,796
  • Total training vocabulary – 8,944,593
  • Avg. word count per question – 11.06

A quick EDA reveals some interesting insight to the dataset.

  • Classes are not balanced.


Training class balance

Training class balance

In the training/validation set, the duplicate percentage (label 1) is ~36.9%. Since the class balance can influence some classifiers, this fact becomes useful when training models later.


  • Normalized unigram word shared counts can be a good feature
Shared unigram counts

Shared unigram counts


Violin plot of shared word counts


When the shared word ratio (Jaccard similarity) is considered, this becomes even more prominent:

(1)   \begin{equation*} \frac{\textit{question1 words} \cap \textit{question2 words}}{ \textit{question1 words} \cup \textit{question2 words}} \end{equation*}

Violin plot of shared word ratio

Violin plot of shared word ratio


The correlation of shared unigram counts towards the class further indicates that other n-grams can also perhaps participate as features in our model.

Arguably the best perk of being part of a Kaggle competition is the incredible community. Here are some in-depth EDAs carried out by some of its members:


Statistical modelling

XGBoost is a gradient boosting framework that has become massively popular, especially in the Kaggle community. The popularity is not underserving as it has won many competitions in the past and known for its versatility. So as the primary model, XGBoost was used with following parameters, selected based on the performance of the validation set.

  objective = 'binary:logistic'
  eval_metric = 'logloss'
  eta = 0.11
  max_depth = 5

Before discussing features used, there’s one neat trick that I believe everyone who did well in the competition used. After the first few submissions of prediction results, it became apparent that there’s something wrong when you compare the results obtained against the validation set with the Kaggle leaderboard (LB). No matter how many folds were used for the validation set, the results obtained against the validation set didn’t reflect on the LB. This is due to the fact that the class balance between the training set and the test set was considerably different, and the cost function (logloss) being sensitive to the imbalance. Specifically, in the training set around 37% were positive labels while in the test set it was approximated to be around 16.5%. So some oversampling of the negatives in the training set was required to get a comparable result on the LB. More on oversampling can be found here and here.


From a bird’s eye view, features used can be categorised into three groups.

  1. Classical text mining features
  2. Embedded features
  3. Structural features

Following features can be categorised under classical text mining features.

  • Unigram word match count
  • Ratio of the shared count (against the total words in 2 questions)
  • Shared 2gram count
  • Ratio of sum of shared tf-idf score against the total weighted word score
  • Cosine distance
  • Jaccard similarity coefficient
  • Hamming distance
  • Word counts of q1, q2 and the difference (len(q1), len(q2), len(q1) – len(q2))
  • Caps count of q1, q2 and the difference
  • Character count of q1, q2 and difference
  • Average length of a word in q1 and q2
  • Q1 stopword ratio, Q2 stopword ratio, and the difference of ratios
  • Exactly same question?

Since a large portion of sentence pairs are questions, many duplicate questions are starting with the same question word (which, what, how .etc). So few more features were used to indicate whether this clause applies.

  • Q1 starts with ‘how’, Q2  starts with ‘how’ and both questions have ‘how‘ (3 separate features)
  • same for words ‘what‘, which, who, where, when, why

Some fuzzy features generated from the script here, which in turn used fuzzywuzzy package,

  • Fuzzy WRatio
  • Fuzzy partial ratio
  • Fuzzy partial token set ratio
  • Fuzzy partial token sort ratio
  • Fuzzy qratio
  • Fuzzy token set ratio
  • Fuzzy token sort ratio

As for the embedded features, Abhishek Thakur’s script did everything needed: it generates a word2vec representation of each word using a pre-trained word2vec model on Google News corpus using gensim package. It then generates a sentence representation by normalizing each word vector.

def sent2vec(s):
  words = str(s).lower().decode('utf-8')
  words = word_tokenize(words)
  words = [w for w in words if not w in stop_words]
  words = [w for w in words if w.isalpha()]
  M = []
  for w in words:
  M = np.array(M)
  v = M.sum(axis=0)
  return v / np.sqrt((v ** 2).sum())

Based on the vector representations of the sentences, following distance features were generated by the same script.

Combined with these calculated features, full 300 dimension word2vec representations of each sentence were used for the final model. The raw vector addition required a large expansion of the AWS server I was using, but in hindsight brought little improvement.

Structural features have caused much argument within the community. These features aren’t meaningful NLP features, but because of the way how the dataset was formed, it had given rise to some patterns within the dataset. It’s doubtful if these features will be much use in a real-word scenario, but within the context of the competition, they gave a clear boost. so I guess everyone used them disregarding whatever moral compunctions one might have had.

These features include,

  • Counting the number of questions shared between two sets formed by the two questions
  •   from collections import defaultdict
      q_dict = defaultdict(set)
      def build_intersects(row):
      def count_intersect(row):
      df_train.apply(build_intersects, axis=1, raw=True)
      df_train.apply(count_intersect, axis=1, raw=True)
  • Shard word counts, tf-idf and cosine scores within these sentence clusters
  • The page rank of each question (within the graph induced by questions as nodes and shared questions as edges)
  • Max k-cores of the above graph


The effect of features on the final result can be summerized by following few graphs.

  • With only classic NLP features (at XGBoost iteration 800):
    • Train-logloss:0.223248, eval-logloss:0.237988 (0.21861 on LB)

  • With both classic NLP + structural features (at XGBoost iteration 800):
    • Training accuracy: 0.929, Validation accuracy: 0.923
    • Train-logloss:0.17021,  eval-logloss:0.185971 (LB 0.16562)


  • With classic NLP + structural + embedded features (at XGBoost iteration 700):
    • Training accuracy: 0.938, Validation accuracy: 0.931
    • Train-logloss:0.149663, eval-logloss:0.1654 (LB 0.14754)

Rank wise this feature set and the model achieved a max 3% at one point, though it came down to 7% by the end due to my lethargic finish. But considering it was an individual effort against mostly other team works consisting several ensemble models, I guess it wasn’t bad. More than anything, it was great fun and a good opportunity to play with some of the best ML competitors in the Kaggle community/world and collaboratively learn from that community.

I’ve shared the repository of Jupyter notebooks used and can be found from here.

PS: Some of the stuff that I wanted to try out, but didn’t get to:

Tagged ,