Category Archives: Python

Notes from Quora duplicate question pairs finding Kaggle competition

Quora duplicate question pairs Kaggle competition ended a few months ago, and it was a great opportunity for all NLP enthusiasts to try out all sorts of nerdy tools in their arsenals. This is just jotting down notes from that experience.


Quora has over 100 million users visiting every month, and needs to identify duplicate questions submitted — an incident that should be very common with such a large user base. One interesting characteristic that differentiates it from other NLP tasks is the limited amount of context available in the title; in most cases this would amount to a few words.

Exploratory data analysis

The dataset is simple as it can get: both training and test sets consist of two questions in consideration. Additionally, in the training set there are few extra columns: one denoting whether it’s a duplicate, and two more for unique IDs of each question.

qid1, qid2 – unique ids of each question (only available in the training set)
question1, question2 – the full text of each question
is_duplicate – the target variable; set to 1 if question1 and question2 essentially have the same meaning; 0 otherwise.

Some quick stats:

  • Training set size – 404,290
  • Test set size – 2,345,796
  • Total training vocabulary – 8,944,593
  • Avg. word count per question – 11.06

A quick EDA reveals some interesting insight to the dataset.

  • Classes are not balanced.


Training class balance

Training class balance

In the training/validation set, the duplicate percentage (label 1) is ~36.9%. Since the class balance can influence some classifiers, this fact becomes useful when training models later.


  • Normalized unigram word shared counts can be a good feature
Shared unigram counts

Shared unigram counts


Violin plot of shared word counts


When the shared word ratio (Jaccard similarity) is considered, this becomes even more prominent:

(1)   \begin{equation*} \frac{\textit{question1 words} \cap \textit{question2 words}}{ \textit{question1 words} \cup \textit{question2 words}} \end{equation*}

Violin plot of shared word ratio

Violin plot of shared word ratio


The correlation of shared unigram counts towards the class further indicates that other n-grams can also perhaps participate as features in our model.

Arguably the best perk of being part of a Kaggle competition is the incredible community. Here are some in-depth EDAs carried out by some of its members:


Statistical modelling

XGBoost is a gradient boosting framework that has become massively popular, especially in the Kaggle community. The popularity is not underserving as it has won many competitions in the past and known for its versatility. So as the primary model, XGBoost was used with following parameters, selected based on the performance of the validation set.

  objective = 'binary:logistic'
  eval_metric = 'logloss'
  eta = 0.11
  max_depth = 5

Before discussing features used, there’s one neat trick that I believe everyone who did well in the competition used. After the first few submissions of prediction results, it became apparent that there’s something wrong when you compare the results obtained against the validation set with the Kaggle leaderboard (LB). No matter how many folds were used for the validation set, the results obtained against the validation set didn’t reflect on the LB. This is due to the fact that the class balance between the training set and the test set was considerably different, and the cost function (logloss) being sensitive to the imbalance. Specifically, in the training set around 37% were positive labels while in the test set it was approximated to be around 16.5%. So some oversampling of the negatives in the training set was required to get a comparable result on the LB. More on oversampling can be found here and here.


From a bird’s eye view, features used can be categorised into three groups.

  1. Classical text mining features
  2. Embedded features
  3. Structural features

Following features can be categorised under classical text mining features.

  • Unigram word match count
  • Ratio of the shared count (against the total words in 2 questions)
  • Shared 2gram count
  • Ratio of sum of shared tf-idf score against the total weighted word score
  • Cosine distance
  • Jaccard similarity coefficient
  • Hamming distance
  • Word counts of q1, q2 and the difference (len(q1), len(q2), len(q1) – len(q2))
  • Caps count of q1, q2 and the difference
  • Character count of q1, q2 and difference
  • Average length of a word in q1 and q2
  • Q1 stopword ratio, Q2 stopword ratio, and the difference of ratios
  • Exactly same question?

Since a large portion of sentence pairs are questions, many duplicate questions are starting with the same question word (which, what, how .etc). So few more features were used to indicate whether this clause applies.

  • Q1 starts with ‘how’, Q2  starts with ‘how’ and both questions have ‘how‘ (3 separate features)
  • same for words ‘what‘, which, who, where, when, why

Some fuzzy features generated from the script here, which in turn used fuzzywuzzy package,

  • Fuzzy WRatio
  • Fuzzy partial ratio
  • Fuzzy partial token set ratio
  • Fuzzy partial token sort ratio
  • Fuzzy qratio
  • Fuzzy token set ratio
  • Fuzzy token sort ratio

As for the embedded features, Abhishek Thakur’s script did everything needed: it generates a word2vec representation of each word using a pre-trained word2vec model on Google News corpus using gensim package. It then generates a sentence representation by normalizing each word vector.

def sent2vec(s):
  words = str(s).lower().decode('utf-8')
  words = word_tokenize(words)
  words = [w for w in words if not w in stop_words]
  words = [w for w in words if w.isalpha()]
  M = []
  for w in words:
  M = np.array(M)
  v = M.sum(axis=0)
  return v / np.sqrt((v ** 2).sum())

Based on the vector representations of the sentences, following distance features were generated by the same script.

Combined with these calculated features, full 300 dimension word2vec representations of each sentence were used for the final model. The raw vector addition required a large expansion of the AWS server I was using, but in hindsight brought little improvement.

Structural features have caused much argument within the community. These features aren’t meaningful NLP features, but because of the way how the dataset was formed, it had given rise to some patterns within the dataset. It’s doubtful if these features will be much use in a real-word scenario, but within the context of the competition, they gave a clear boost. so I guess everyone used them disregarding whatever moral compunctions one might have had.

These features include,

  • Counting the number of questions shared between two sets formed by the two questions
  •   from collections import defaultdict
      q_dict = defaultdict(set)
      def build_intersects(row):
      def count_intersect(row):
      df_train.apply(build_intersects, axis=1, raw=True)
      df_train.apply(count_intersect, axis=1, raw=True)
  • Shard word counts, tf-idf and cosine scores within these sentence clusters
  • The page rank of each question (within the graph induced by questions as nodes and shared questions as edges)
  • Max k-cores of the above graph


The effect of features on the final result can be summerized by following few graphs.

  • With only classic NLP features (at XGBoost iteration 800):
    • Train-logloss:0.223248, eval-logloss:0.237988 (0.21861 on LB)

  • With both classic NLP + structural features (at XGBoost iteration 800):
    • Training accuracy: 0.929, Validation accuracy: 0.923
    • Train-logloss:0.17021,  eval-logloss:0.185971 (LB 0.16562)


  • With classic NLP + structural + embedded features (at XGBoost iteration 700):
    • Training accuracy: 0.938, Validation accuracy: 0.931
    • Train-logloss:0.149663, eval-logloss:0.1654 (LB 0.14754)

Rank wise this feature set and the model achieved a max 3% at one point, though it came down to 7% by the end due to my lethargic finish. But considering it was an individual effort against mostly other team works consisting several ensemble models, I guess it wasn’t bad. More than anything, it was great fun and a good opportunity to play with some of the best ML competitors in the Kaggle community/world and collaboratively learn from that community.

I’ve shared the repository of Jupyter notebooks used and can be found from here.

PS: Some of the stuff that I wanted to try out, but didn’t get to:

Tagged ,

Vista to Ubuntu (100%)

I had enough with eating crap with Vista. My last line of patience warned off when I happened to wait about 5 seconds when changing from one MS Doc file to another and also happened that I was running with time to finish a project report. No, I’m not running on 256 ram, it’s 1 GB and this kind of a time wastage is totally unacceptable. You may ask why I put up with Vista in the first place. That’s thanks to HP’s decision to embrace Vista so my laptop was pre-built with Vista and no chance to downgrade because there are no drivers. So where to go now ? Easy….Gutsy.

So I’m now another guy who left windows permanently because of their own defects. I was playing dumb with Vista for this long had nothing to do with Vista being better, it’s because of substitutions not being available for some applications that I was used to in Windows environment – mainly Macromedia(now Adobe) Dreamweaver and Fireworks. Now before biting my head off, yes – there are good web developer editors in Linux such as NVu or KompoZer but they will need another few developing years to get into the same line as Dreamweaver (which had a long time to develop into the state now in) and I have to have some thing until then.

But my worries were groundless as Wine now supports Macromedia 8 series like a babe. Dreamweaver, Fireworks and Flash all works perfectly with Wine – installation to Execution. Another great thing with Gutsy is that my Broadcom network card support is inbuilt and with some additional applications I can search networks and connect to them like a charm. Also I’m using AWN manager to manage desktop (here is a great article on desktop styling – thanks to Lakshan) and now it looks like a hybrid between Leopard and Vista. So what else I can ask for ?

Here is my application list in Ubuntu.

Web developments = Wine + DreamWeaver 8

Web images = Wine + Fireworks 8

Photo Editing = Gimp

Java editor = NetBeans for Linux

Python = Eclipse with PyDev plugin

IM client = Pidgin

Wifi manager = gtkwifi and wifi radar

Skype = Skype for Linux

Btw, My machine is AMD 64 X2, therefore had some issues and had to do some tweaks when installing some applications but nothing I can’t handle with some effort. The catch is that I can work with all comforts like in Windows with half the memory usage as in Windows.

Vista-Leopard Look

DreamWeaver in Ubuntu


End of a great three months

First, I’m sorry about not being able to post any thing in a while. I’ve been caught up with my Summer of Code final works and university works, I’ve hardly had any time left. Summer of Code finals evolution began on 20th August and I’ve tried to finish all goals before the deadline.

By midterm evolution, I had built the basic work for the text engine. It had a solid structure and ability to render inline tags by then. But after the midterm evolution I pushed myself bit hard than in the first half of SoC and could finish all the goals I’ve setup with my mentor.

Now Foiegras has the ability to render both inline tags and block tags. Rendering block tags had been something to think over and it was very satisfying to see it’s accomplished at last. Then it came to rendering images, and it wasn’t hard as implementing block tags and could finish quickly. Also meanwhile I understood that the text editor should have a mechanisms to validate and render tag attributes. So I’ve implemented the structure for attribute validating and rendering, in which you can define what are the attributes that’s valid under a given tag and what are the valid values for that attribute. (Similar to a DTD) . As the final goal I’ve planned to finish Foiegras table support, which is a bit complex task since Foiegras table structure has considerable amount of attribute variations. Because of this, I’ve focused only on implementing the basic table support within the SoC time period and could finish that. This is the current status of FoieGras text engine.

Even though the GSoC time period is over, it’s really is the beginning of FoieGras project. Since this project is starting from the scratch, I only focused on basic needs of the text editor that will be required for a first release. For this release two parts of the project will be merged and come as a complete text editor. I’m itching for that first release to see how the response is going to be since that’s what we will be reaping from the effort we have put into the product.

This three months of time period has really made an effect on me. It has helped me to get more in touch with Open Source developments and directly contribute to it through Gnome, which is the most favourite Linux desktop environment according to a survey of It was a great pleasure to work with such a community which consists of a large bunch of very entusiastic people.

Throughout the Summer of Code time period and even before that, my mentor Don (Don Scorgie) has been a great help to me, and for that I’m in debt to him. He is also the one who directed me to this project at the beginning when I was looking out for projects. Also my partner in Foiegras throughout Summer of Code period –Phenatic and his mentor Shaunm have been very supportive throughout the project. Phenatic has implemented a cool application shell for Foiegras and as I’ve mentioned before we have to combine the application shell and text engine for the first release. As a final remark, I’d like to thank everybody who has given a hand to this project and made me possible to product something useful(hopefully) for FOSS community.

Screen shots of the current FoieGras text rendering status is given below.

The final look of fiegras text rendering at the end of SoC

The final look of fiegras text rendering at the end of SoC

FoieGras- First look at how the text rendering

After about a months of time into the GSoC project, I was able to finish the first significant improvement in the project that happened from my side. For a long time I was wandering here and there testing one thing or the other to see what’s working best and finally got a breakthrough.

For now FoieGras (The code name of the editor we are developing) text engine is capable of these capabilities. It can render some given tags but only inline tags. It can hide and show tags, so if the user wants to see tags and work with them it’s possible. Also you can add more tags and configure the style tags, which are rendered using a configure file. I think it’ll be the main configuration file with regards to text rendering functions. After phenatic ( releases the first UI part, we’ll be able to map the menu events to this text rendering functions and present the first release of FoieGras. 🙂

To tell you how the implementation of FoieGras is done: it’s done using python and PyGTK as the wrapper for GTK to do UI work. First I planned on implementing the text renderer using some thread system, giving a thread to take care of each tag. But soon I understood that it’s a waste of time and resources so thought of a better way. Now the text rendering is done through the XMLparser that comes with python, and so far it’s working nicely.

Also my mentor ( Dr. Don Scorgie – yeah, he is now a doctor 🙂 . Congrats Don ! ) is helping me with technical details and brilliant ideas. FoieGras now has a repository in GNOME svn and I’ve created a branch for tag rendering implementation and copied my works there, So you can check it. Another news is that I’m planning to goto GUADEC next 14th July and I hope it’ll be fun and be able to discuss more about FoieGras there with the team and also hack FoieGras there.




SoC Preparations

My Hackergotchi

About 3 weeks after the acceptance into my SoC project I thought of putting a note saying what’s going on with the project.

Actually there’s a fair amount of work going on with SoC already. Even though we are not to start coding until May 28, there are things that have to be done before coding starts.

We have been given extra 2 months ahead of coding in order to better preparing for the project and get to know the community. It’s really good to have some time to understand ethics, ways and workings of new environment. that we are working with in next 3 months and may be for quite some time after that. So most of the organizations seem to using this extra time wisely and getting to know about their SoC students and making them comfortable with the community.

Talking about Gnome, it’s really great to work with them. Gnome got 29 projects selected for this year SoC and they are giving a great support to their students. After getting selected first thing I did was chat with my mentor and other guy who will be doing the other part of the project. ( Since the scope of this project seemed too long for SoC time line, Gnome divided the work load into two, thereby giving me most of GUI and may be some widget works and giving Phenatic most of inner works such as subversion integration, adding functions to send patches, more widgets.etc). So we had a sort of group discussion with 4 of us(Me, my mentor, Phenatic and his mentor) in IRC. Then I subscribed to Gnome SoC and Developer mailing lists. Gnome-soc mailing list is the place where all the Gnome SoC students can express their problems, progress in their projects and also can be used as a meeting place. After that I sent my blog and Hackergotchi to Gnome and hope it’ll be integrated into Planet Gnome soon. Yeah, above is my Hackergotchi ( I know it seems funny). And if any one interested in the FoieGras project you can reach us through IRC, #doc or #gnome chanels.

I also requested for a new SVN account in Gnome because it’ll make my works easier to have a separate account when doing the coding. There are some strict policies into getting a new account in Gnome and they seem to take security of the accounts very seriously and I agree with that. And also we have to give a sort of a report of what we have done in every week to Gnome starting from this Monday. Yeah, Gnome is taking a head start into the SoC projects and I think it’s a good thing, to keep in touch and correct mistakes ASAP before being too late.

And finally, in personal matters I’ve been doing some python and working with pyGTK which will be used to do the GUI parts of the project. And also I bought a wireless router so I can completely move into Ubuntu. (By the way, if you haven’t tested Fiesty, it has some good features, give it a try.). Hmm, Eventful days these are with nearing the start of SoC projects.