Category Archives: Fun

Notes from Quora duplicate question pairs finding Kaggle competition

Quora duplicate question pairs Kaggle competition ended a few months ago, and it was a great opportunity for all NLP enthusiasts to try out all sorts of nerdy tools in their arsenals. This is just jotting down notes from that experience.


Quora has over 100 million users visiting every month, and needs to identify duplicate questions submitted — an incident that should be very common with such a large user base. One interesting characteristic that differentiates it from other NLP tasks is the limited amount of context available in the title; in most cases this would amount to a few words.

Exploratory data analysis

The dataset is simple as it can get: both training and test sets consist of two questions in consideration. Additionally, in the training set there are few extra columns: one denoting whether it’s a duplicate, and two more for unique IDs of each question.

qid1, qid2 – unique ids of each question (only available in the training set)
question1, question2 – the full text of each question
is_duplicate – the target variable; set to 1 if question1 and question2 essentially have the same meaning; 0 otherwise.

Some quick stats:

  • Training set size – 404,290
  • Test set size – 2,345,796
  • Total training vocabulary – 8,944,593
  • Avg. word count per question – 11.06

A quick EDA reveals some interesting insight to the dataset.

  • Classes are not balanced.


Training class balance

Training class balance

In the training/validation set, the duplicate percentage (label 1) is ~36.9%. Since the class balance can influence some classifiers, this fact becomes useful when training models later.


  • Normalized unigram word shared counts can be a good feature
Shared unigram counts

Shared unigram counts


Violin plot of shared word counts


When the shared word ratio (Jaccard similarity) is considered, this becomes even more prominent:

(1)   \begin{equation*} \frac{\textit{question1 words} \cap \textit{question2 words}}{ \textit{question1 words} \cup \textit{question2 words}} \end{equation*}

Violin plot of shared word ratio

Violin plot of shared word ratio


The correlation of shared unigram counts towards the class further indicates that other n-grams can also perhaps participate as features in our model.

Arguably the best perk of being part of a Kaggle competition is the incredible community. Here are some in-depth EDAs carried out by some of its members:


Statistical modelling

XGBoost is a gradient boosting framework that has become massively popular, especially in the Kaggle community. The popularity is not underserving as it has won many competitions in the past and known for its versatility. So as the primary model, XGBoost was used with following parameters, selected based on the performance of the validation set.

  objective = 'binary:logistic'
  eval_metric = 'logloss'
  eta = 0.11
  max_depth = 5

Before discussing features used, there’s one neat trick that I believe everyone who did well in the competition used. After the first few submissions of prediction results, it became apparent that there’s something wrong when you compare the results obtained against the validation set with the Kaggle leaderboard (LB). No matter how many folds were used for the validation set, the results obtained against the validation set didn’t reflect on the LB. This is due to the fact that the class balance between the training set and the test set was considerably different, and the cost function (logloss) being sensitive to the imbalance. Specifically, in the training set around 37% were positive labels while in the test set it was approximated to be around 16.5%. So some oversampling of the negatives in the training set was required to get a comparable result on the LB. More on oversampling can be found here and here.


From a bird’s eye view, features used can be categorised into three groups.

  1. Classical text mining features
  2. Embedded features
  3. Structural features

Following features can be categorised under classical text mining features.

  • Unigram word match count
  • Ratio of the shared count (against the total words in 2 questions)
  • Shared 2gram count
  • Ratio of sum of shared tf-idf score against the total weighted word score
  • Cosine distance
  • Jaccard similarity coefficient
  • Hamming distance
  • Word counts of q1, q2 and the difference (len(q1), len(q2), len(q1) – len(q2))
  • Caps count of q1, q2 and the difference
  • Character count of q1, q2 and difference
  • Average length of a word in q1 and q2
  • Q1 stopword ratio, Q2 stopword ratio, and the difference of ratios
  • Exactly same question?

Since a large portion of sentence pairs are questions, many duplicate questions are starting with the same question word (which, what, how .etc). So few more features were used to indicate whether this clause applies.

  • Q1 starts with ‘how’, Q2  starts with ‘how’ and both questions have ‘how‘ (3 separate features)
  • same for words ‘what‘, which, who, where, when, why

Some fuzzy features generated from the script here, which in turn used fuzzywuzzy package,

  • Fuzzy WRatio
  • Fuzzy partial ratio
  • Fuzzy partial token set ratio
  • Fuzzy partial token sort ratio
  • Fuzzy qratio
  • Fuzzy token set ratio
  • Fuzzy token sort ratio

As for the embedded features, Abhishek Thakur’s script did everything needed: it generates a word2vec representation of each word using a pre-trained word2vec model on Google News corpus using gensim package. It then generates a sentence representation by normalizing each word vector.

def sent2vec(s):
  words = str(s).lower().decode('utf-8')
  words = word_tokenize(words)
  words = [w for w in words if not w in stop_words]
  words = [w for w in words if w.isalpha()]
  M = []
  for w in words:
    M = np.array(M)
    v = M.sum(axis=0)<code>
    return v / np.sqrt((v ** 2).sum())

Based on the vector representations of the sentences, following distance features were generated by the same script.

Combined with these calculated features, full 300 dimension word2vec representations of each sentence were used for the final model. The raw vector addition required a large expansion of the AWS server I was using, but in hindsight brought little improvement.

Structural features have caused much argument within the community. These features aren’t meaningful NLP features, but because of the way how the dataset was formed, it had given rise to some patterns within the dataset. It’s doubtful if these features will be much use in a real-word scenario, but within the context of the competition, they gave a clear boost. so I guess everyone used them disregarding whatever moral compunctions one might have had.

These features include,

  • Counting the number of questions shared between two sets formed by the two questions
  •   from collections import defaultdict
      q_dict = defaultdict(set)
      def build_intersects(row):
      def count_intersect(row):
      df_train.apply(build_intersects, axis=1, raw=True)
      df_train.apply(count_intersect, axis=1, raw=True)
  • Shard word counts, tf-idf and cosine scores within these sentence clusters
  • The page rank of each question (within the graph induced by questions as nodes and shared questions as edges)
  • Max k-cores of the above graph


The effect of features on the final result can be summerized by following few graphs.

  • With only classic NLP features (at XGBoost iteration 800):
    • Train-logloss:0.223248, eval-logloss:0.237988 (0.21861 on LB)

  • With both classic NLP + structural features (at XGBoost iteration 800):
    • Training accuracy: 0.929, Validation accuracy: 0.923
    • Train-logloss:0.17021,  eval-logloss:0.185971 (LB 0.16562)


  • With classic NLP + structural + embedded features (at XGBoost iteration 700):
    • Training accuracy: 0.938, Validation accuracy: 0.931
    • Train-logloss:0.149663, eval-logloss:0.1654 (LB 0.14754)

Rank wise this feature set and the model achieved a max 3% at one point, though it came down to 7% by the end due to my lethargic finish. But considering it was an individual effort against mostly other team works consisting several ensemble models, I guess it wasn’t bad. More than anything, it was great fun and a good opportunity to play with some of the best ML competitors in the Kaggle community/world and learn.

PS: Some of the stuff that I wanted to try out, but didn’t get to:

Tagged ,

My best 7 free JS modal boxes

I have been working on a new iteration of HOT (follow @hotelotravel for more info) for last few weeks and thought of changing the existing JQuery UI Dialog box for something bit fancy and solid (on other hand I may have just wanted to get a break from usual PHP stuff and to play with JQuery a bit after some time). I did have some popular JS modal box names such as Lightbox, Facebox and Thickbox that I wanted to test and found some few new names on the way.Certainly there will be many more modal boxes out there that I’ve missed and not to mention my requirement will be different from yours, but here is the gist of each modal box in my opinion of those I’ve tested so it may help someone to pick the right one at the right moment.

JQuery UI Dialog

JQUery UI in Hotelotravel

JQUery UI in Hotelotravel

This is the dialog box that I’ve used in most cases and of course it’s great. Few years back when I started working with it I noticed few issues when closing the dialog and such but by now they have all been fixed. Also it’s continuously maintained by JQuery community so you can be sure it is solid. What I like mostly about it is its simplicity as well as customization power through various callbacks when you need more action (you can define what to do when a drag starts, drag stops, box shows up, box closes. etc) through various option settings.

This is all you have to do to get a simple dialog box if you have a div with the id of “myDialog”.


Another perk of this modal box is its file size, which is quite small (about 10KB and minimized version is about 6KB) and when doing a complex site with numerous CSS and JS scripts, size of each file becomes crucial to maintain a small load size to reduce the load time as well as save server bandwidth. Only seemingly downside of this box is that it doesn’t come with any fancy preloaded stuff (themes, effects, preload-images.etc) but for someone interested (and with a bit of JS and CSS knowledge) can customize them.


Light box

Light box

This is focused mainly on presenting pictures and does a good job at it. If you are interested in creating something like a picture gallery without touching much JS, this could be the ideal JQuery plugin for you. Also it has a relatively low size with a size around 19KB and packed version is about 6KB. But since I was looking for something more with raw customization power, this wasn’t the choice for me.




Another popular choice for a modal boxes and it deserves the name. It has a very small file size and a simple code. It also comes with a default theme and can be a convenient choice for hasty tasks or people 😛 But the downside I noticed is that it gives very small customizing power to the user through JQuery code (which is the case with Light box also btw).

It’s simple to a fault and you just have to name the class name of the link you are going to pop the box as “facebox”. It could have had few more options such as to define the basepath of the project as I couldn’t find a way to define it for pre-loading images without hacking the code nor a way to give width and height manually or pass callbacks. Also this project seems to have been abandoned for a while now and if you are considering adapting this box for your whole site check it thoroughly.


ThickBox 3.1

ThickBox 3.1

ThickBox is really a cool JS box. It’s simplicity and extremely small file size makes it very adorable. This modal box gives some customizing power but still focuses mainly on simplicity and link naming as “thickbox” which is its magic word. However as mentioned earlier this has a better customizing power through JS than FaceBox or LightBox so it’s more flexible. With some hacking you can also give your own callbacks and options as you like.




This is an all purpose, very fancy looking modal box done using and prototype libraries. It can host all kind of media types and even flash clips which is really impressive. But the obvious downside is its huge file size of about 60KB which is a huge amount when considering this will be just a small part of and for general use it’s not tolerable IMHO. It is not compressed or packed by default so you can manually minimize it using something like YUI compresser to bring it to somewhere around 30KB but even then it’s too large for me. But this is ideal for a site that has tons of ajax stuff and in need of a very customizable modal box with lot of options and callbacks that can be used throughout the site.




This is the smallest JQuery modal box I’ve seen and if you need to customize your modal box to the extreme I’d suggest this. Downside is that you will have to write a lot to get something done through this modal box but extreme small size complements that.




This is also a very good looking JQuery based modal box with a reasonable file size(about 14KB for packed version) and gives considerable customizing power to the user. Also it comes with a packed theme and all that so you can use it easily without much coding which is a plus. In fact, I had some trouble deciding whether to use this one or JQuery UI for my task but finally settled on using JQuery UI because my familiarity with it and community backing. So I guess in my case JQuery UI is the rightful winner 🙂

Tagged , ,

Drupal with Dreamweaver

I had to migrate and setup the whole workspace at my old desktop machine last week after the breakdown of my HP notebook. So while setting it up I thought this little piece of trick could help someone who is fond of Dreamweaver and looking for an IDE to code Drupal.

As a side note, one could accuse me of promoting unethical software and in fact I don’t refute it (even though it is not my intention). I also know there are many great open source editors for coding PHP like Eclipse-PDT and even Gedit which could be transformed to a very helpful IDE. But on other hand I’m kind of addicted to DW after long years of working with it and if someone who already has DW looking for a way to use it for coding Drupal this could be helpful.

The most frustrating thing when coding Drupal modules or themes with DW is that it doesn’t recognize it as a PHP script and doesn’t give any validation or auto completion features like it usually does for php scripts. To fix this there’s little configuration to be done.

First find the installed directory of DW (if it’s Windows it most probably in ~/Program Files or if it’s Linux+wine it should be in your virtual windows environment) and go to Configuration->DocumentTypes. There should be a xml file called MMDocumentTypes that holds configurations regarding which language should be used for a given file format.

Find the line that says,
‘<documenttype id=”PHP_MySQL” servermodel=”PHP MySQL”….’
and depending on weather you are using windows or mac, append ‘,module,install,theme,inc’ to existing winfileextension or macfileextension values.

So for windows it should look like this.

<documenttype id=”PHP_MySQL” servermodel=”PHP MySQL” internaltype=”Dynamic” winfileextension=”php,php3,php4,php5,module,install,theme,inc” />

Now restart DW and open your module or theme file. Ta da! should now work as a normal php script.

Hope this will be helpful to someone looking for a way to convert Dreamweaver to a more Drupal friendly place.

Tagged , ,

PHP + Large files

I was working on the project Hotelotravel for last few months and as usual in many cases it involved working with large database files because when you consider all hotels, locations and images all over the world it means a lot. But if we want to do large file uploads or database updates with PHP there are few configurations to be done to default settings and I’m putting this as a note to myself (I’m always keep forgetting this) as well as to any one who may find this useful like when importing a large backup file through phpMyAdmin.

In your php.ini check for these settings and change them as you need.

  • post_max_size (The maximum size of post data you can send in one submission)
  • upload_max_filesize (Maximum size of file that can be uploaded)
  • memory_limit (Maximum memory limit that can be allocated for a script execution)
  • max_execution_time (Maximum time limit for a script execution)

As a side note, if you trying to import large files (backups.etc) through phpMyAdmin and it refuses, you may need to edit file and change these settings to 0 which means no limit.

  • $cfg[‘ExecTimeLimit’]
  • $cfg[‘MemoryLimit’]

As a final note, these settings are there for a purpose. So my advice is change them in whatever manner  you want in a development environment but be very careful when setting them in a production environment because an endless execution of a script can cause your servers to waste bandwidth and even crash.  So I guess this is my disclaimer 😉

Tagged ,

Closing Tributes

It’s bit late for for a closing tribute on SoC, but something is better than nothing, right ? So here we go.

Google Summer of Code 2008 was the second SoC I participated in and last year with Gnome was my first. This time the SoC was with Eclipse and it was in one word ‘Awsome’.

In more details, the summer with Eclipse community was a great experience, Specially my mentor David Carver who supported me in every stciky situation in the process and believe me when I say without his support it would have been a nightmare for me (one badspot with Eclipse is its lack of up-to-date documentation in certain areas). So my heartfelt thanks go to David and all the Eclipse community who helped me to have an easy learning curve.

You can check all 2008 SoC Eclipse projects here. More info on the project I was working on – Eclipse XQuery editor can be found here. The code can be checked out from Eclipse incubator repo from here – Also we have setup a download site for the plugin. If you are interested, please get a copy for yourself and check it out and let me know what you think of it.

Lastly, I’m hoping to continue the project and already have a long todo list prepared. Also my thanks go to all who voted me in as an official commiter for the project. Currently paper works are being carried out and hopefully I will soon get CVS access to Eclipse repo as an official commiter.

Btw, I got my SoC shirt and certificate last week and they will be great mementos to remember this summer with Eclipse. For all these great stuff a big thank goes to Google and specially its Open Source division.

GSoC 2008 mementos

Tagged ,