Category Archives: machine learning

Notes from Quora duplicate question pairs finding Kaggle competition

Quora duplicate question pairs Kaggle competition ended a few months ago, and it was a great opportunity for all NLP enthusiasts to try out all sorts of nerdy tools in their arsenals. This is just jotting down notes from that experience.


Quora has over 100 million users visiting every month, and needs to identify duplicate questions submitted — an incident that should be very common with such a large user base. One interesting characteristic that differentiates it from other NLP tasks is the limited amount of context available in the title; in most cases this would amount to a few words.

Exploratory data analysis

The dataset is simple as it can get: both training and test sets consist of two questions in consideration. Additionally, in the training set there are few extra columns: one denoting whether it’s a duplicate, and two more for unique IDs of each question.

qid1, qid2 – unique ids of each question (only available in the training set)
question1, question2 – the full text of each question
is_duplicate – the target variable; set to 1 if question1 and question2 essentially have the same meaning; 0 otherwise.

Some quick stats:

  • Training set size – 404,290
  • Test set size – 2,345,796
  • Total training vocabulary – 8,944,593
  • Avg. word count per question – 11.06

A quick EDA reveals some interesting insight to the dataset.

  • Classes are not balanced.


Training class balance

Training class balance

In the training/validation set, the duplicate percentage (label 1) is ~36.9%. Since the class balance can influence some classifiers, this fact becomes useful when training models later.


  • Normalized unigram word shared counts can be a good feature
Shared unigram counts

Shared unigram counts


Violin plot of shared word counts


When the shared word ratio (Jaccard similarity) is considered, this becomes even more prominent:

(1)   \begin{equation*} \frac{\textit{question1 words} \cap \textit{question2 words}}{ \textit{question1 words} \cup \textit{question2 words}} \end{equation*}

Violin plot of shared word ratio

Violin plot of shared word ratio


The correlation of shared unigram counts towards the class further indicates that other n-grams can also perhaps participate as features in our model.

Arguably the best perk of being part of a Kaggle competition is the incredible community. Here are some in-depth EDAs carried out by some of its members:


Statistical modelling

XGBoost is a gradient boosting framework that has become massively popular, especially in the Kaggle community. The popularity is not underserving as it has won many competitions in the past and known for its versatility. So as the primary model, XGBoost was used with following parameters, selected based on the performance of the validation set.

  objective = 'binary:logistic'
  eval_metric = 'logloss'
  eta = 0.11
  max_depth = 5

Before discussing features used, there’s one neat trick that I believe everyone who did well in the competition used. After the first few submissions of prediction results, it became apparent that there’s something wrong when you compare the results obtained against the validation set with the Kaggle leaderboard (LB). No matter how many folds were used for the validation set, the results obtained against the validation set didn’t reflect on the LB. This is due to the fact that the class balance between the training set and the test set was considerably different, and the cost function (logloss) being sensitive to the imbalance. Specifically, in the training set around 37% were positive labels while in the test set it was approximated to be around 16.5%. So some oversampling of the negatives in the training set was required to get a comparable result on the LB. More on oversampling can be found here and here.


From a bird’s eye view, features used can be categorised into three groups.

  1. Classical text mining features
  2. Embedded features
  3. Structural features

Following features can be categorised under classical text mining features.

  • Unigram word match count
  • Ratio of the shared count (against the total words in 2 questions)
  • Shared 2gram count
  • Ratio of sum of shared tf-idf score against the total weighted word score
  • Cosine distance
  • Jaccard similarity coefficient
  • Hamming distance
  • Word counts of q1, q2 and the difference (len(q1), len(q2), len(q1) – len(q2))
  • Caps count of q1, q2 and the difference
  • Character count of q1, q2 and difference
  • Average length of a word in q1 and q2
  • Q1 stopword ratio, Q2 stopword ratio, and the difference of ratios
  • Exactly same question?

Since a large portion of sentence pairs are questions, many duplicate questions are starting with the same question word (which, what, how .etc). So few more features were used to indicate whether this clause applies.

  • Q1 starts with ‘how’, Q2  starts with ‘how’ and both questions have ‘how‘ (3 separate features)
  • same for words ‘what‘, which, who, where, when, why

Some fuzzy features generated from the script here, which in turn used fuzzywuzzy package,

  • Fuzzy WRatio
  • Fuzzy partial ratio
  • Fuzzy partial token set ratio
  • Fuzzy partial token sort ratio
  • Fuzzy qratio
  • Fuzzy token set ratio
  • Fuzzy token sort ratio

As for the embedded features, Abhishek Thakur’s script did everything needed: it generates a word2vec representation of each word using a pre-trained word2vec model on Google News corpus using gensim package. It then generates a sentence representation by normalizing each word vector.

def sent2vec(s):
  words = str(s).lower().decode('utf-8')
  words = word_tokenize(words)
  words = [w for w in words if not w in stop_words]
  words = [w for w in words if w.isalpha()]
  M = []
  for w in words:
    M = np.array(M)
    v = M.sum(axis=0)<code>
    return v / np.sqrt((v ** 2).sum())

Based on the vector representations of the sentences, following distance features were generated by the same script.

Combined with these calculated features, full 300 dimension word2vec representations of each sentence were used for the final model. The raw vector addition required a large expansion of the AWS server I was using, but in hindsight brought little improvement.

Structural features have caused much argument within the community. These features aren’t meaningful NLP features, but because of the way how the dataset was formed, it had given rise to some patterns within the dataset. It’s doubtful if these features will be much use in a real-word scenario, but within the context of the competition, they gave a clear boost. so I guess everyone used them disregarding whatever moral compunctions one might have had.

These features include,

  • Counting the number of questions shared between two sets formed by the two questions
  •   from collections import defaultdict
      q_dict = defaultdict(set)
      def build_intersects(row):
      def count_intersect(row):
      df_train.apply(build_intersects, axis=1, raw=True)
      df_train.apply(count_intersect, axis=1, raw=True)
  • Shard word counts, tf-idf and cosine scores within these sentence clusters
  • The page rank of each question (within the graph induced by questions as nodes and shared questions as edges)
  • Max k-cores of the above graph


The effect of features on the final result can be summerized by following few graphs.

  • With only classic NLP features (at XGBoost iteration 800):
    • Train-logloss:0.223248, eval-logloss:0.237988 (0.21861 on LB)

  • With both classic NLP + structural features (at XGBoost iteration 800):
    • Training accuracy: 0.929, Validation accuracy: 0.923
    • Train-logloss:0.17021,  eval-logloss:0.185971 (LB 0.16562)


  • With classic NLP + structural + embedded features (at XGBoost iteration 700):
    • Training accuracy: 0.938, Validation accuracy: 0.931
    • Train-logloss:0.149663, eval-logloss:0.1654 (LB 0.14754)

Rank wise this feature set and the model achieved a max 3% at one point, though it came down to 7% by the end due to my lethargic finish. But considering it was an individual effort against mostly other team works consisting several ensemble models, I guess it wasn’t bad. More than anything, it was great fun and a good opportunity to play with some of the best ML competitors in the Kaggle community/world and learn.

PS: Some of the stuff that I wanted to try out, but didn’t get to:

Tagged ,

The necessity of lifelong learning

Live as if you were to die tomorrow. Learn as if you were to live forever.
― Mahatma Gandhi

The term “lifelong learning” sounds nonsensical when you consider that learning from experience is an intrinsic function built into all humans and animals. But today, this term in the context of rapid advances in the field of AI and automation carries a different meaning. This is an attempt at discussing why it’s increasingly needed today, and encourage everyone to take up on actively learning and expanding your horizons if you haven’t started already.

The pace of technological advancement

The consensus is that what you learn today will be out of date within 5-10 years from now. By that argument alone, it’s a no brainer that we should keep learning. The pace of advance is almost tangible when it comes to technical fields and not taking time to update yourself would be a critical carrier mistake. Since my experience is with computer science, this post will focus more on CS but I believe it holds true for most other areas as well.

I doubt there’s any other field that’s advancing as fast as CS at the moment (definitely subjective:)). Most of us working in the field acknowledge this fact and accept the challenge, and even call it an endearing quality. At any rate, the changing of tools is expected every 5-10 year period in CS so this shouldn’t be anything new. However, just changing of tools will not be enough if you want to get into emerging CS topics such as Internet of Things (IOTs), Software Defined Networking (SDN), Deep learning .etc. Here online courses can help in two ways.

1. You probably will need more maths and/or computer science fundamentals such as operating systems, networks, algorithms .etc. This is where MOOCs and especially Khan academy can be of great help. They can help us revise old maths lectures and fundamentals.

2. Once in a while there are wonderful offerings on such emerging topics by pioneering researchers, and usually these courses are awesome.

Automation and consequences

Marc Andreessen famously wrote sometime ago software is eating the world; now probably it’s time to say specifically that artificial intelligence is eating the world, or at least it’s going to. With ever increasing computational power and lifelong efforts by some great scientists, today we are seeing very exciting advances happening on weekly basis. Even though it took self-driving cars and Watson to bring AI to the mainstream, AI has been here for almost as long as the computer itself. From coining of the term in 1956, it has undergone through various stages of evolutions. From the golden era of logic based reasoning to the perceptrons and subsequent AI winter through to the advent of neural networks and current deep learning frenzy: AI has indeed come a long way.

There’s no question of this wave of AI and automation going to affect the way we work. The question is how much it’s going to change; and do we really need to worry ? After all, during the last century the world saw some major revolutions in the way humans work and why this should be any different ? With every major disruptive innovation, there have been both expiration of traditional jobs and creation of new jobs.

One main difference I see with AI based automation is that it’s not trying to emulate a single function like traditionally how it has happened. For example, horse-driven carriage to automobiles, or papers to digital media have revolutionized human civilization as we know it. But in each of these cases they were limited to one specific area. When we think of what’s happening today with AI, it’s trying emulate some skills that have been intrinsically marked as human territory and doing so to the degree of human precision: cognition and decision making key among them. With such faculties been outsourced to machines, there’s no telling of how widespread the affect will be.

While machine learning researchers caution the world to brace for mass outbreaks of unemployment cycles, some opinion the effect will be similar to disruptions happened in the past. While I agree with the former school of thought, i doubt anyone has a good estimation. This is probably why the Whitehouse policy paper for AI discusses on both overestimated and underestimated influences. Indeed some effects are quite unexpected. But looking at how things are going, we can already see some industries like transportation are due for a rude disruption. Here is another estimation of what type of jobs are more prone to overtaking. It can be expected that single-skill jobs will continue to decay while jobs that require social or maths skill will remain largely unaffected or get more demand.

In summary, think we can all agree on that this wave of AI is going to affect how we work, and as the wise say: it’s good to be safe than sorry. If you still think this may be into the far future, time to think again.

Technology domain is interconnected

Again this is mostly with regards to computer science, but it may hold true in other fields as well. Today, to get some meaningful work done, you usually need to tread upon at least a few cross disciplines. If you are a software engineer, it’s not enough to know the fundamentals and a few languages; depending on your flavour, it may be into systems, embedded systems. etc or distributed systems, web security, big data and ilk. If you are into data science — a cross discipline to begin with — there’s no escaping from learning, from statistics to CS and everything in between! Each of these field is vast on its own and advances rapidly just like most areas in CS. In that sense, the words “Try to learn something about everything and everything about something” is apt today than any other time.

With such a large scope to draw from and a rapidly advancing industry, I doubt any traditional college can satisfy the need no matter how good the degree program is. Fortunately, today we don’t have to look beyond our browser to learn whatever the topic we need to learn and the only question is whether we are ready to expand our horizons.

A modicum of balance to a knowledge driven world

With the ever persistent brain drain from developing countries and today’s demand for knowledge driven industries, most of the countries are at a severe disadvantage. With the imminent wave of automation, this kind of overwhelmingly biased world doesn’t look promising to begin with. Luckily, some very wise people, who are also happen to be leading machine learning researchers, kicked off the drive for today’s online learning initiative in parallel to the rise of AI (this is not anyway discounting the wonderful service rendered through MIT opencourseware prior to the arrival of MOOCs). So it’s not an exaggeration to call such learning initiatives as great equalizers in education and a step towards improving world’s future living standard. As with everything else today, some of them are increasingly getting money driven now, but still they have started something that could change the world for the better.

What should we learn

Little humble bragging: I was an early adaptor into MOOCs (as they were coined later) in 2011 and finished both Prof. Andrew Ng’s first online machine learning course, which went to become Coursera, and the first intro to Artificial intelligence course by Prof. Sebastian Thrun and Peter Norvig, which was the start of Udacity. From then to date, I took part in many courses, but as the norm with MOOCs finished only a dozen or so in truth. Anyway, I’d say I have a fairly good rapport with MOOCs as you can get, and would like to share few tips solely based on my subjective experience.

When it comes to learning, you can spend time on lots of things very similar but gain very little in return. In that sense, the classic “Teach Yourself Programming in Ten Years” by Peter Norvig is something everyone should read on what to learn.

Another lesson I learnt is that even though courses are free and limitless, your time is not. So even though a course is really interesting, I now carefully take time to decide whether that’ll help me to expand my knowledge in something I really need. Also rather than trying to keep up with bunch of courses at once and not getting anything fully done, restricting yourself to few depending on your schedule and fully concentrating on them is far better. Again, this is a no brainer, but our impulse is to grab everything free.

Another recent development is all the online services are introducing specializations and mini-degree programs. I have doubts whether this is the best way to go from a learner’s point of view. One of the advantages of online learning is that you are not restricted by any institutional rules to select what to learn and from where. But with this type of mini-degree programs, we are again bringing in traditional restrictions to learning. Instead I’d prefer to select my own meal, and if they are really good, pay for them or audit until I’m convinced. But again, this is very much subjective.

In conclusion, learning is an intrinsic function built into everyone. But with this new order of the world, learning has turned into a fast track lane and if we don’t catch up to the speed, world may move forward leaving us stranded.

Machine Learning in SaaS paradigm

Our ultimate objective is to make programs that learn from their experience as effectively as humans do. We shall…say that a program has common sense if it automatically deduces for itself a sufficient wide class of immediate consequences of anything it is told and what it already knows.

John Mccarthy, “Programs with Common Sense”, 1958

Machine learning has been and still is a large part of research area in academic circles. But in last decade or so it has made heavy in roads to the practical world of tech industry and today it’s no secret that most of the large players are using numerous machine learning techniques to enhance various aspects of their workflows. In this post I’m hoping to look at few ways how a SaaS application (presumably run by a startup) can use machine learning to enhance its overall experience.

Personalise user experience

In your experience of using SaaS products, how many times you may have found that your favourite item is at the bottom of the list and you have to scroll half a mile or navigate through several layers of menus ? Usability usually favours the majority and you may discover the bitterness of being stuck in minority.

If you go into the same coffee shop every morning and buy the same drink, if they are any good at their business then they should know your preferences after few days and you don’t have to go through the ordering routine everyday.  It may be just that the shop owner confirms “Same as usual ? ” and that’s it.  So if your app is bit more intelligent (good at its business), it could do the same and not irritate users by dropping their most used features to the bottom of the page and having them crawl over the page every time they use your app. However to be on safe side, just as in the case of coffee shop owner’s confirmation you may need to give an extra setting option to the user confirming whether it’s preferable for the app to learn user behaviours and adapt.

Reduce your support requests

Say you have a hot product in your hands and it’s getting more and more traction. If you have experienced this situation, one thing that you won’t miss is the number of support requests that’s sky-rocketing in parallel to the hotness of your product. Given that startups have limited man power, there’s no need to emphasise the importance of your team’s man hours and whether to spend them answering easily avoidable support issues or somewhere more useful fixing bugs and adding new features. The school book remedy in this situation would be to evaluate the usability of your app which certainly is a good option but it doesn’t hurt to make your app bit more intelligent to identify obvious pitfalls.

In your neighbourhood if you notice someone is wondering back and forth looking up and down wouldn’t you assume he is lost and offer your help ? Taking a leaf out of this situation your app can do the same and be kind enough to identify a stranger wondering throughout your app and offer him help. Not only will you be saving your team’s man hours but you will be saving user’s precious time and as an added bonus impress the user even more so on your product.

Make search intelligent

Search is a window to your application data and improving the quality of search will directly influence the user experience.  Rather than making the user guess under what keywords his target content is indexed under, what if your application is good at identifying user intent behind the search ? That would certainly be the icing on top of your search functionality. To make it even better, mix some fuzzy-ness  to auto correct a search term when there’s an obvious error. Of course all this is easier said than done and every company is not a Google. But you can take an initiative by analysing search terms to identify week spots, start addressing them first and moving forward as a minor experimental optimisation process.

On another note, a good application wide search will greatly help answer most questions your users may have. From experience most of the support questions are recurring in nature so if users can easily find answers from your community forum or support articles, it will help lighten your support inbox.

Finding more details about users

Well, this is more of a grey area. Gossiping on other’s juicy dirty secrets is usually frowned upon, but a little awareness of what’s going on around you could be useful and even healthy. Most of large companies are already digging up your day-to-day buying patterns to better target you but  the amount of how deep you dig into user information (or abstain from it) is certainly up to you. One way to look at this would be how you treat advertisements – as long as they are relevant and useful in achieving your goal you won’t mind it.  But the second it falls below your requirements and becomes nagging, it will be a nuisance and spamming.  Likewise if you can give users a coupon they can’t ignore it’s likely they won’t mind and you can comfort your conscience by thinking you are doing a service rather than snooping around.

Gauging user reaction to new features

It’s a normal practise for apps to use a simple voting system to get to know most desired new features of a SaaS app. Usually what happen is that app admins put up a set of features they feel important and users vote on them. But taking this one step further and crowdsourcing, you can know what users really need and same time know more about your users. Of course collaborative filtering is not a new technology and most of the social sites are using it to rate new items and know preferences of new users back and forth. So even though you are not running a social network, you can still use it to get to know attributes of your user base such as technical savviness, seeking automation. etc.


This only sums up some of the more obvious situations where machine learning techniques can play a part in improve a SaaS application. It certainly is an exciting field in which I’m trying to get a grasp on as a passive interest and hoping to carry out experiments to learn the applicability of various theories. It would be exciting to hear more ideas and how well they have worked so please feel free to share them here.

Some pointers to get started/keep an eye on: