Search This Blog

Wednesday, December 16, 2015

Rossmann Store Sales


The Rossmann Store Sales competition was held on Kaggle in Nov-Dec, 2015.

Objective
Rossmann operates over 3000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

The objective was to predict the sales of various Rossmann stores in Germany.

Data
Train data consisted of sales from over 1000 Rossmann stores along with information related to promotions, competitions, holidays, etc. upto July, 2015.

Test data consisted of dates in August and September, 2015 for which we had to predict the sales.

Approach
Being a classic time series sales forecasting problem, I explored two approaches. One being the standard building of tree-based and linear models. The other being trying out time series models like ARIMA.

It became quickly evident from cross-validation and validation results that ARIMA wasn't working. XGBoost was giving much better results.

There was a lot of external data shared and available, but none of those made a big improvement in the model. My final model didn't use any external data either.

Building models at a store-level was not giving as good results as building a model using all the data together, but it helped while blending models.

Model
I built multiple XGBoost models on different subsets of the entire data and averaged them. I merged these with store-level models of XGBoost, Random Forest and GBM. The blending of models gave a huge improvement and ultimately lead to the stability of the predictions.

I finally tweaked the predictions by using a multiplicative factor of 0.98 to get the best fit to the LB.

I usually share my code on GitHub, but this time I decided against it, since I haven't done anything extraordinary or special.

Results
My model gave a RMSPE of just below 0.10 on the public LB with rank 66th and RMSPE of just below 0.11 (in fact, I scored 0.10999!) which ranked me 14th on the private LB out of 3303 teams.

A lucky jump, having chosen a stable model, which results in my best individual performance on Kaggle till date, improving on my 14th rank / 2256 teams in the TFI competition.

View Public LB
View Final Results

Views
It was a tricky contest, mainly due to the nature of the public and private LB split. It was overwhelming to see so much external data being shared and used. Maybe under other circumstances, this could have played a much more important role.

Congratulations to the winner, Gert, who performed fantastically, by being way ahead of the lot in the public LB with very few submissions! And finally being stable enough to win on the private LB, again with a big lead.

So, I gained some good points from this contest, and moved to 111th in overall Kaggle rankings. My year-end goal was to be in Top-100. I'm close, and with the Walmart contest left, I might just make it.

Check out My Best Kaggle Performances

Monday, November 23, 2015

Black Friday Data Hack


AnalyticsVidhya organized a weekend hackathon called Black Friday Data Hack, which was held on 20th-22nd November, 2015.

Black Friday is actually the following weekend, but that's when we've to relax and enjoy :-)

The last hackathon was quite disappointing due to the randomness in the data and the evaluation metric. I was hoping this one would be better.
And it was. Much better.

Problem
The challenge was to predict the purchase amount of various products by users across categories given historic data of purchase amounts.

Data
In general, when you have more data, its always better. The train data had ~ 5.5 lakh observations and the test data had ~ 2.3 lakh observations. The data was very very clean and it feels wonderful to work on such datasets.

The data was of users who purchased products with the amounts. The products had data on three types of categories. The users had data about their age, gender, city, occupation, locality and marital status.

We were to build our models on the train data and score the test data which had pairs of user-product not present in the train data. The evaluation metric was RMSE, which also seemed a very appropriate choice for this problem.

Approach
I spent the first few hours just exploring the data, summarizing variables, plotting graphs, playing around with pivots and in parallel building base models (of course, XGBoost).

On the first day, I was able to go below 2500 with an optimized XGBoost model on raw features. It got me into the Top-3 and since then I've managed to maintain a position in the Top-5.

While checking the variable importance of my XGBoost, I found Product_ID was the most important variable and intuitively it made sense. So, I just submitted the average purchase amount of each product and voila! it scored 2682, which didn't seem like a very bad score. So, all those of you who couldn't cross 2682, here's a simple solution you missed.

Usually ensembles win competitions, but since I couldn't get any model close to the performance of XGB, so I decided to challenge myself to build a single powerful model. Which means, feature engineering.
These two days gave me some wonderful insights on how powerful feature engineering is. With some analysis, gut, trying, cross-validating, here are my final set of features that I used:

Model
User_ID: Used as a raw feature

User_Count: Number of observations of the user

Gender: Converted to binary

Age: Converted to numeric

Marital Status: Used as raw feature

Occupation: Used as raw feature

City Category: One-hot encoded features

Stay In Current City: Converted to numeric

Product Category 1, 2, 3: Used as raw feature

Product_Count: Number of observations of the product

Product_Mean: Average purchase amount of product

User_High: Proportion of times the user purchases products at a higher amount than the average purchase amount of the product

I built an XGBoost with these features, and the code is open-sourced on GitHub, the link is given below.

One very interesting feature I built was
F_Prop: Average purchase amount of product by female users / Average purchase amount of product by male users

This was among the top-3 important variables and gave a CV of ~ 2419 but the LB remained very similar ~ 2430, so I wasn't sure about it. I decided to go without this.

GitHub
View GitHub Repository

Results
This model gave me CV score of ~ 2425 and public LB score of 2428. I was 4th on the public LB, with Jeeban, Nalin and Sudalai in the Top-3. And we finished in the same positions with my final rank being 4th in the private LB.

Views
This is one of the best data-sets I've worked on in a while. The CV and LB scores were perfectly in sync and it was very satisfying to build features and improve the CV as well as LB scores. I'm happy with my performance as I managed to squeeze quite a lot of from the data with a single model.

I might have done better with an ensemble, but just couldn't get anything to work well. And after a while, was just too tired.

Overall, a great weekend, mostly spent on my laptop. For those of you who had memory issues, I worked on my 4GB MacBook Air throughout the weekend. Algorithms and models will advance and become optimized every day, but the power of building good features is still in the hands of Data Scientists like us.
Make the most of it until the machines come and take over ;-)

Thanks to all the folks at AnalyticsVidhya for organizing this hackathon. A big thumbs up from me.

Looking forward to the next Hackathon, and hope it gets better and more competitive.

External Links
View Other Players' Approaches on AnalyticsVidhya
View 3rd place solution code on GitHub by Sudalai Raj Kumar
View 5th place solution code on GitHub by Aayush Agrawal
 

Tuesday, October 6, 2015

World Sudoku Championship 2015


The 10th World Sudoku Championship was held on 11th-15th Oct, 2015 in Sofia, Bulgaria.

Championship Page
Download Instruction Booklet

The Indian Team was selected from the Indian Sudoku Championship and Times Sudoku Championship 2015, where Prasanna Seshadri, Rishi Puri, Kishore Kumar and I form the A-Team.

Prasanna Seshadri, Rishi Puri, Kishore Kumar and Me

"We are the best four players of the country, as we finished in the top-4 of ISC as well as TSC. It looks like a strong team with Prasanna and Rishi in good form, being finalists in the Sudoku GP and Kishore being consistent among the Indian circuit.

The Indian team stood 6th last year, which is the best performance ever, and I hope we can break into the top-5 this year. That's our goal and its probably our best shot at it in the near future, since, Rishi is not planning to continue being an active participant from next year. It will be hard to find a replacement for someone at Rishi's level, but we'll hope for the best.

On a personal level, I'm aiming for a Top-10 finish. Its been a disappointing couple of years, where I stood 14th last year and 16th the year before.
Haven't been in the best of forms lately, with some career changes and travelling going on, but hoping to give my best and make India proud!"


All of us travelled separately. Prasanna and Rishi had to reach one day before for GP Playoffs, I was flying from Mumbai, Rishi was flying from Hyderabad and Kishore was flying from Greece. Well, so much for 'best team'.

The instruction booklet just looked like a shadow of WSC 2014 in London. Very similar structure and rounds and format. I was surprised, but then realized that the main authors of WSC are Richard Stolk (Netherlands) and Yuhei Kusui (Japan) and they might've been called at the last moment to save the event.

Being a big fan of Stolk's sudokus, I was looking forward to the championship.


The rounds and points (My points vs Highest points) were as follows:

Round 1: Classics (265 vs 330)
Classics! This round went fairly smooth, I solved in reverse, attacking the hard ones first and it paid off. I scored 265 and thought that is a good start.

Round 2: Assorted (410 vs 485)
Assorted sudoku variants. That's when the aroma of Richard started. Beautiful sudokus, enjoyable round and I did well.

Round 3: Assorted (395 vs 700)
This was a bad round. I broke Inner Frame and Sum Frame, and was generally slow in a couple of other puzzles. It broke my flow and I dropped a few places.
I spent far too much time on Max Triplet (which was an excellent puzzle).

Round 4: Straight (265 vs 285)
It was expected to be a simple puzzle. It was nice, mostly got solved using row and column non-repetition. Traditionally, I've done well on such rounds (reminded me of the WSC 2012 where the Overlapping round got me into the playoffs), and I'm glad I could finish it fairly smoothly.

Round 5: Assorted (640 vs 750)
The big round! I've messed up the big round in the previous two WSCs and I really really wanted this time to be different. It was. I solved in a very nice flow, cracking one sudoku after another, without a glitch. I got a Classic wrong at the end, but still, it was a solid score, that boosted me up a few ranks.

Round 6: Assorted (375 vs 465)
The dreadful round of irregular variants. It was surprisingly good. I took the safe way, solving the easy and medium ones and leaving out the hard ones. Worked. And worked well.

So, that was the end of the Individual Rounds for Day-1. When the results came out, it was a pleasant surprise to see myself in a solid 5th position. I felt I was closing in on my dream to get into the top-5 this year.

Round 7 (Team): Relay
The team round was interesting. Sudokus were nice, and we were hoping to finish the round. Kishore and Rishi got stuck up on the Irregulars. Rishi gave up on his and Kishore didn't manage to finish his either. Prasanna had to guess on his last grid (since Rishi's Irregular relay didn't come through) which went wrong.
And to make it worse, I left two cells of Extra Region as pencilmarks, thus losing chunks of points :-(

The Great Indian Team Round Debacles continue... year after year.

Round 8: Zodiac (275 vs 625)
Ahh, feel like kicking myself. This was the only big round on Day2, and even with a mediocre performance I would've maintained my top-10 position. But it was not to be. I broke two Arrow sudokus during the round and was never able to recover from that. To make things disastrous, I swapped 6 cells in the highest pointer - Gemini, which screwed my round completely, and I fell way below 10th.

Such a disappointment. We were on track to see two Indians in the playoffs for the first time, but I messed up. Thankfully, Prasanna maintained his calm and managed to be joint 8th before playoffs.

Round 9: Multi Sudoku (140 vs 170)
I was feeling so low... and fortunately this was a low scoring round. I felt like my hands were moving in slow motion during the solve, but it wasn't too bad at the end.

So, that completed the Individual rounds of WSC 2015. I finished 14th (same rank as last year), and certainly could've done better. Maybe next year.
But that also adds to me being in Top-20 in the last 6 WSCs. Only once in Top-10 :-(

Prasanna finished 8th and made it to the playoffs, so there was something to look forward to.

Round 10 (Team): X-Killer
This was a round that most teams were looking forward to. But the organizers cancelled it due to technical issues.
Disappointing, since we had practised this well. In fact, we hosted the practise set as a contest on LMI: X-Killer
Wonderful sudokus by Deb Mohanty.

Round 11 (Team): Fractal
With our first team round going bad and second being cancelled, we had to do well on this one. It was a nice simple linked multi-sudoku, where each of us started solving from the four corners and came to the centre. We finished fairly quickly, but... The Great Indian Team Round Debacle hit us again! We swapped two digits in Kishore's corner, lost 'a lot' of points, including all the bonus.


Playoffs
Nice to see Prasanna Seshadri in the playoffs, we had an Indian there after my playoffs in 2012. He surely is a crowd-entertainer with a phenomenal performance in the playoffs first leg where he finished first among the four, thus taking him into the second leg and guaranteed 7th place.
The second leg was hard, with Bastien in 4th and having a time advantage. Bastien managed to win the leg to join Kota, Tiit and Jakub for the final leg.

Well, the same four finalists of WSC 2014 battle it out again in WSC 2015. Kota having a big lead and time advantage, raced through the playoffs, winning his second WSC crown on the trot. Tiit came in second and Jakub third, all as expected. Playoffs wasn't really exciting.
Congrats to Kota, Tiit, Jakub for the podium finishes.

Download Complete Results

Tiit Vunk (Estonia), Kota Morinishi (Japan), Jakub Ondrousek (Czech Republic)

So, Prasanna finishes 7th, which improves on my best Indian rank of 8th at the WSC. Congrats to him and this made Rishi's (so-called) last WSC memorable. Rishi finished a disappointing 38th and Kishore did well on his debut with 47th.

A-Team
7th - Prasanna Seshadri (2825)
14th - Rohan Rao (2765)
38th - Rishi Puri (1950)
47th - Kishore Kumar (1748)

B-Team and UN-Team
Rakesh Rai (1890)
Amit Sowani (1845)
Jaipal Reddy (1509)
Swaroop Guggilam (1287)
Gaurav Kumar Jain (1185)
Puneet Goenka (900)

Team India finished 9th. This is bad, considering we have been in Top-8 for the last four years. Something to dwell upon and improve.

Thanks to Richard Stolk, Yuhei Kusui, Deyan, Galya, all the other organizers and volunteers for conducting this WSC. Puzzles were fantastic, hall and seating was comfortable and overall a great experience.

Lets hope the WSC 2016 in Slovakia proves to be bigger and better. And I really hope the Indian team breaks newer and more records next year.

And guess what? India won the bid to host the World Championships in 2017! So, hoping to see you all in Bengaluru in two years time!

Saturday, September 12, 2015

Carcinogenicity Prediction of Compounds


The Carcinogenicity Prediction competition was held on CrowdAnalytix in Jul-Sep, 2015.

Objective
Carcinogenicity (an agent or exposure that increases the incidence of cancer) is one of the most crucial aspects to evaluate drug safety.

The objective was the predict the amount of carcinogenicity in compounds, which is measured through TD50 (Tumorigenic Dose rate).

Data
The train data consisted of compounds with over 500 variables consisting of physical, chemical and medical features along with their corresponding TD50 values. About 60% of the TD50 values were 0, the rest were non-zeros with few outliers.

The test data consisted of compounds with these features for which we had to predict the TD50 value.

Approach
This was a weird contest. On exploring the data, within 3-4 days, I found a key insight, and that proved to be a game changer.

So, what was this golden insight? It was the evaluation metric: RMSE.

The target variable (TD50) had many zeros and the rest were positive continuous values. RMSE as a metric can very easily get skewed due to outliers.

The train data had two values above 20,000. Predicting them accurately (greater than 20,000) would reduce the RMSE by more than 50%. So, assuming there are these outliers in the test data too, I knew this would give the maximum boost in score.

All the participants were lingering in the 1700's scores... and most of the usual models were not performing better than the benchmark 'all zeros' submission! That was a proxy validation that there had to be outliers in the test set too.

I built a model to classify outliers. The train data had only two rows (the ones with TD50 > 20,000) with target value '1' and the rest as '0'. Scored the classifier on the test set. Took the top-3 predicted rows of the test set and used 25,000 as the prediction. And BINGO! The 2nd one dropped my RMSE from 1700's to ~900. Almost a 50% drop!
Thats what you call a game-changer :-)

There are pros and cons.
Pros are that it was definitely a 'smart trick', and not really a 'sophisticated model'. Which I accepted and mentioned on the forum too. It was a neat hack applied on a poor evaluation criteria.
Cons are, of course, it doesn't lead to the best model. And worse, the result was technically determined by just one or few rows, making the rest of the test set worthless.

Model
For the remaining observations, I used a two-step model approach.

I first built a binary classifier to predict zeros vs non-zeros. Used RandomForest for this.
I then built a regressor to predict the amount of TD50, only using it for the observations which were classified as non-zeros from the binary classifier. Used RandomForest for this too.

For the binary classifier and regressor, I subsetted the train data by removing all rows where the TD50 values were > 1000 (considering them as outliers).

Results
I was 1st on the Public LB and 1st on the Private LB too.

This is my first Data Science contest where I stood 1st. Yay!
Not a really good one, but I'll take it :-)

Congrats to Sanket Janewoo and Prarthana Bhatt for 2nd and 3rd. Nice to see all Indians on the podium!

Views
The evaluation metric became the decider for this contest. A learning for me, that sometimes a simple approach can make a BIG DIFFERENCE.

Which makes it VERY IMPORTANT to explore the data, understand the objective, the evaluation and always do some sanity checks before diving deep into models and analysis. I've learnt a lot of these things from top Kagglers, and I'm sharing one of these here today, hoping someone else learns and helps in the development, improvement and future of Data Science.

Data can do magical things sometimes :-)

Check out My Best CrowdAnalytix Performances

Saturday, September 5, 2015

Puzzle Ramayan 2016

The online rounds of Puzzle Ramayan 2015-2016 have ended! This is a national level event aimed at encouraging puzzle solvers of India to participate and compete with the top solvers to gain experience and improve competition in the years to come.

NOTE: This event serves as a qualifier to participate in the Indian Puzzle Championship 2016

The championship consisted of 8 online rounds (Sep-2015 to Mar-2016) from which the top solvers will be invited to participated in the national finals.
Championship Page

National Finals
The finals will be held on 17th July, 2016 in Chennai.

View Finals Page


Online Top-10
1. Rohan Rao - 597.3
2. Amit Sowani - 575.8
3. Swaroop Guggilam - 477.4
4. Rajesh Kumar - 458.3
5. Rakesh Rai - 411.2
6. Ashish Kumar - 374.2
7. Kishore Kumar - 372.1
8. Jayant Ameta - 344.2
9. Jaipal Reddy - 302.7
10. Devarajan D - 277.2

View Complete Results

P.S. Prasanna's name is removed from list since he has a wild card for the WPC next year on being the best Indian performer at the WPC this year.


Round 8: Placement (26th - 28th Mar, 2016)
Author: Rajesh Kumar
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

Nice puzzles, but on the harder side. A little disappointed that I wasn't able to finish the set.
Horrible answer keys, I struggled a lot with it, so did a few other players.

Overall, a decent end to PR.

The Top-10 look more-or-less as expected, but really good to see Ashish and Kishore improving and a great job by Devarajan for maintaining his top-10 position throughout the rounds. Looking forward for an interesting and fun-filled finals in Chennai in July.


Round 7: Loops (27th - 29th Feb, 2016)
Author: Prasanna Seshadri
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

Wow! Another wonderful round. I'm not very good at Loops, but I could solve this very smoothly. Puzzles were excellent, and a very well-balanced set and PR round. Probably the best so far.

I finished the set in 47mins, with Swaroop in 57mins and Amit in 65mins.  Swaroop now has increased his lead at 3rd place above Rajesh and should be able to hold on to it since the last round is authored by Rajesh.

Hope to end it well.


Round 6: Shading (23rd - 26th Jan, 2016)
Author: Swaroop Guggilam
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

Wonderful! What a perfectly balanced round this was. Kudos to Swaroop for authoring this set, my favourite round of PR so far. Prasanna finished the set in 44mins, I finished in 58mins and Amit in 65mins.

Lot of swaps in the points table after this round. Also due to the rankings being updated after discarding the worst two scores. I regain the top spot over Amit. Rajesh is less than a point above Swaroop. A disappointing round by Rakesh allowed Kishore to move above him.

With the last two rounds to go, it will be an interesting finish, especially crucial for Swaroop, who needs to be in the Top-3 to be eligible for the NRI wildcard.


Round 5: Snake (26th - 28th Dec, 2015)
Author: Ashish Kumar
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

After a great Round 4, I had an absolutely disastrous Round 5. Snakes is not a type I really enjoy, and it showed here. I scored a poor 57 points, compared to Amit's 88. Prasanna did well by finishing all puzzles just within 90 minutes.

Puzzles were top-notch quality from Ashish, but they were too hard for PR. I'm not surprised to see the participation low, but a little surprised by some regular names missing, including ones in the current Top-10.

Lot of changes in the top-10 after this round. Amit takes the top spot with a good lead, Swaroop moves to 3rd, above Rajesh, and finally, Rakesh moves above Kishore.
Its getting interesting, and I hope the remaining 3 rounds are better, way better. 


Round 4: Regions (28th - 30th Nov, 2015)
Author: Rakesh Rai
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

This was one of my better performances in a puzzle contest in recent times. The puzzles were of my liking. Yin Yang, Spiral Galaxies, Area Division are in my all-time favourites, and it was wonderful to solve this set. Puzzles were really fun and it was a better set than the last 3 PR rounds.

I topped the round by finishing in 48mins and was ranked 11th internationally, which is my best rank after Twist way back in 2011. Amit did well by finishing in 56mins, Prasanna finished in 64mins and Swaroop in 83mins.

That put me on top in PR rankings and also got me my best LMI Rating in Puzzles! So, a pretty good weekend!


Round 3: Evergreens (31st Oct - 2nd Nov, 2015)
Author: Amit Sowani
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

That was hard! Especially for the type of rounds expected in PR. Well, it was time to improvise. Since this resulted in a very low scoring round, the scoring system was changed to add a bit of normalization so that such variability in the difficulty of tests can be overcome to some extent.

Even though I topped the round (among Indians), it didn't feel like a smooth performance. Felt like I could've added some 8-10 points more to my score of 73.

Prasanna tested the puzzles, so you won't his name on the scorepage. Congrats to Rajesh and Swaroop for their good performances.


Round 2: Number Placement (26th - 28th Sep, 2015)
Author: Deb Mohanty
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

I didn't do too well. Couldn't finish the round, and got stuck up in too many puzzles during the test.
Puzzles were really nice. Much better than Deb's SM round :-)

Congrats to Prasanna who finished the set in 72 minutes and Amit who just managed to finish it before time. I scored 97.4 points.

This was supposed to be one of the rounds I was most comfortable with, and it bombed. Hope to cover-up in the next few rounds. I also hope this is the worst performance which will get discarded (along with R1 which I authored).


Round 1: Classics (5th - 7th Sep, 2015)
Author: Rohan Rao
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

Congrats to Prasanna, Swaroop and Amit for completing the set. Prasanna finished in 50mins which put him in 12th place worldwide. Swaroop and Amit were very close and finished just one second apart in the 77th minute.

Its nice to see my three team-mates, who will represent India at the WPC along with me, performing at the top among the Indians.

Congrats to Endo, Ulrich and Hideaki who take the top-3 international spots.

Overall, I'm glad the feedback was positive and most participants enjoyed the puzzles. There was some discussion around one puzzle, Hitori Blocks, being a tad harder than the rest for this set. I agree it was a little outlier, but it didn't affect rankings and performances much. Most of the results were as expected.

Seems like a good start to PR... 57 Indians with non-zero scores and totally 304 participants. I hope these numbers increase in subsequent rounds. And I'll be participating in the coming rounds! :-)

Friday, September 4, 2015

Sudoku Mahabharat 2016


The online rounds of Sudoku Mahabharat 2015-2016 have ended! This is a national level event aimed at encouraging sudoku solvers of India to participate and compete with the top solvers to gain experience and improve competition in the years to come.

NOTE: This event serves as a qualifier to participate in the Indian Sudoku Championship 2016

The championship consisted of 8 online rounds (Aug-2015 to Mar-2016) from which the top solvers will be invited to participated in the national finals.

Championship Page

National Finals
The finals will be held on 17th July, 2016 in Chennai.

View Finals Page


Online Top-10
1. Rohan Rao - 600.0
2. Kishore Kumar - 529.6
3. Rakesh Rai - 517.8
4. Jayant Ameta - 468.3
5. Jaipal Reddy - 441.8
6. Amit Sowani - 435.1
7. Suvarna - 413.6
8. Gaurav Kumar Jain - 406.59. Rajesh Kumar - 397.2
10. Shaheer Rahman - 383.3

View Complete Results

P.S. Prasanna's name is removed from list since he has a wild card for the WSC next year on being the best Indian performer at the WSC this year.


Round 8: Irregular (12th - 14th Mar, 2016)
Authors: Akash Doulani, Amit Sowani and Gaurav Kumar Jain
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

Smooth ending. Really good set of sudokus considering a majority of them were authored by first-time author Akash.

Overall, a good end to SM and I'm glad I was able to top every round from the eligible participants.

The Top-10 look more-or-less as I was expecting except for 7th place Suvarna (don't know who she/he is), but looking forward to a grand national finals in Chennai in mid-July.


Round 7: Converse (12th - 15th Feb, 2016)
Authors: Harmeet Singh and Rakesh Rai
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

I started badly, struggling on the Average sudokus... Took 11 minutes for the 6x6 and 13 minutes for the 9x9. But from there, it went really smooth and was able to cover up some lost time. I finished the set in 67mins, ahead of Prasanna in 84mins.

Really good performance by Shaheer Rahman, who gets his first podium in SM, scoring 84 points.
We have a newcomer 'Suvarna' in 2nd place with 88 points, who also enters the overall Top-10, quite an unknown player and it remains to see how she will perform at the national finals.

Well, since the final score is Best 6 out of 8 rounds and having topped 6 rounds, I will have a perfect score of 600 irrespective of the outcome of the last round. I can't say I wasn't expecting this, with Rishi's and Prasanna's absence, but its good to have achieved it.


Round 6: Twisted Classics (9th - 11th Jan, 2016)
Author: Rajesh Kumar
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

This went well. Had a couple of minor stumbles, but overall it was a smooth solve. The test was quite easy, especially compared to some of the previous rounds.

I completed the set in 41mins. Kishore finished in 58mins and Rakesh in 63mins.
I must mention a standout performance by Hemant Malani, who is one of our Sudoku Champs toppers, who finished the set in 80mins, ranking him 7th among Indians, which places him above some of the regular experienced folks. Hope to see some more strong performances like this in the future.

Its nice to see many Indians completing this set and a better participation level.

Now with 6 rounds completed, the top-10 look more-or-less stable, with few changes expected after the last two rounds.


Round 5: Outside (14th - 16th Nov, 2015)
Author: Rishi Puri
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

I was on a good streak before this contest, with some of my best performances in LMI puzzle test and Fed Sudoku. Unfortunately, it didn't continue here. I started well, but then broke the 9x9 Skyscraper. It was hard to find the error, so I just erased and started again. Lost over 10mins here.

I made mistakes while solving a 9x9 Classic too. And the icing on the cake was when I submitted the first 6x6 Classic incorrectly. And it drove me mad that I had to restart it twice to finally solve it correctly. I know I don't like 6x6, but what was wrong with me?

Overall, the participation was low. I still managed to finish the set in 76mins and be the top Indian after Prasanna, who finished in a great time of 62mins. Rakesh and Kishore missed out on a couple of sudokus but were not far behind.

With that, the top-10 remain the same with a couple of swaps.


Round 4: Math (14th - 16th Nov, 2015)
Author: Rohan Rao
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

Congrats to Prasanna for a very good score and to Kishore for topping my round among the eligible Indians. Jaipal and Rajesh were 2nd and 3rd with their strong performances. Nice to see them back among the top.

I really enjoyed creating this set and personally liked it much better than most of the other contests I've authored on LMI.
There is some feedback on the sudokus being hard. Yes, it was intended and since the scoring system has changed to normalize the points, tests can have varying difficulty without much effect on the score distribution.

The GroupSum 6x6 was my favourite of the set while the GroupSum 9x9 and Equal Product 9x9 were the hardest puzzles.

I'm glad most solvers enjoyed the round and its delighting to see close to 350 participants worldwide and over 100 Indians.


Round 3: Odd-Even (24th - 26th Oct, 2015)
Authors: Ashish Kumar and Swaroop Guggilam
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

I finished the set in 46 minutes. It was one of my smoothest solves on LMI. I started slowly, taking 13 minutes for the Odd-Event Count variants, but after that it was brisk.
So far, 3/3. Next SM Round is authored by me.

Wonderful sudokus by Swaroop and Ashish. I must mention that I loved the Quadro 9x9. Fantastic sudoku by Swaroop. Odd or Even 9x9 and Odd Sum 9x9 were good too.

Congrats to Prasanna, Rakesh, and Jayant for the other top Indians. A below par performance by Kishore who finished 6th among Indians.


Round 2: Neighbours (12th - 14th Sep, 2015)
Authors: Aditi Seshadri and Prasanna Seshadri (P.S. - They are not related :-) )
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

I finished the set in 52 minutes, made a small mistake in Touchy 6x6, which I was quickly able to correct. It was my best online contest this year.

Wonderful sudokus by Aditi and Prasanna. I loved the Touchy 9x9, Quadruple 9x9 and Repeated Neighbours 9x9. Touchy was just fantastic for rule usage whereas Quadruple had a slightly different rule which turned into a very nice variant.

The set was a little harder than SM1, but had very interesting set of sudokus.

Congrats to Jayant for completing in 59 minutes. I think this is one of his strongest performances in a sudoku contest. I hope his form continues. Kishore completed in 73 minutes to take the 3rd spot among the Indians.


Round 1: Standard (22nd - 24th Aug, 2015)
Author: Deb Mohanty
Download Instruction Booklet
Download Puzzle Booklet

View Results
View Forum

I finished the set in 49 minutes, stumbled slightly in a couple of grids, but overall it was a smooth solve.

Nice set of sudokus, but it didn't give that 'Deb' feeling :-)
A fun solve though.

Congrats to Kishore who finished the set in 52 minutes and a shocking performance by Prasanna who finished in 64 minutes. He certainly messed up, but we all have our bad days. Rishi, the fourth member of our Indian team at WSC has quit from sudoku solving and tested the set. Its sad to hear that, but I hope he continues to help LMI and the puzzle community in India.

Monday, August 17, 2015

Indian Puzzle Championship 2015


The Indian Puzzle Championship (IPC) 2015 was held on 9th August, 2015. It was open and free for all. The winners would be selected to represent India at the World Puzzle Championship (WPC) 2015 in October, 2015 in Sofia, Bulgaria.

View Championship Page

Download Instruction Booklet
Download Puzzle Booklet
Password will be released later

View Complete Results

I won! Yay!
I must say that this was possible only thanks to Prasanna organizing. He's been the best Indian in the puzzling circuit last couple of years, and still continues to spearhead the Indian contingent in puzzles.
So, I stood 1st with 930 points, with Amit in 2nd place with 774 points and Swaroop in 3rd with 725 points. Just the team I predicted, and I'm glad we made it. Definitely the best team in India today.

This is my 4th IPC title (after 2010, 2011, 2012), and it feels good to be back on top, after missing out in 2014 and the terrible performance in 2013.

Old-school Rajesh Kumar was 4th with 659 points.
Impressive performance by Ashish Kumar who stood 5th with 611 points. Certainly a hot prospect to represent India in the near future.
The usual suspects Rakesh Rai, Jayant Ameta and Jaipal Reddy complete the top-8 with 582, 560 and 506 points respectively.

Lets hope we can rock at the WPC this year and get India a top-10 rank.

"With Prasanna organizing the IPC once again, it is down to the top-3 of IPC who will join Prasanna to form the Indian team for the WPC. I fancy my chances this year, since I've been in touch lately (at least in Sudoku, by winning TSC and coming 3rd in ISC) and I'm confident of making it to the team. I was runner-up last year to Amit Sowani, with Swaroop Guggilam in 3rd.

I think Prasanna, Amit, Swaroop and me would make a strong team and I hope we finish at the top :-)
Good Luck to all participants!"

Wednesday, August 5, 2015

10 Years with 9 Digits




Left one is a picture of me (2nd from right) winning my first sudoku competition (U-16 category) at the age of 14yrs in August, 2005, and the right one is a picture of me (2nd from left) winning the Times National Sudoku Championship in August, 2015.

Its been a decade.

I'm now a three-time National Sudoku Champion (2010, 2011, 2012), having won the Times Sudoku Championship twice (2012, 2015). I've been representing India at the World Sudoku Championship since 2009, and my ranks have been 25th, 15th, 12th, 8th, 16th, 14th.
I'll be representing India for the 7th straight year at the WSC 2015 in Bulgaria in mid-October.

In 2012, I became the first and only Indian to break into the top-10 in the World (8th rank).

So much has happened in the decade. Started off as a casual past-time... which turned into a damn serious hobby but gave me an opportunity to excel at something. I'm glad all the time and effort that I put into this has resulted in success and happiness. Making my country proud and starting many new friendships with my foreign sudoku friends is an experience that will remain in my heart all my life.

Logic Masters India, the national organization that conducts the national events in India has had immense trust and faith in me, and I've tried to do my best and help them out in whichever way possible. Its very creditable that Deb Mohanty and Amit Sowani have been able to run LMI for all these years, and supported the team in various ways.
Its been so much fun competing with some of the most talented people in the country... Sumit Bothra, Ritesh Gupta, Gaurav Korde during the early years, and Rishi Puri, Prasanna Seshadri in recent years.
Kudos to you guys and all the folks who are part of LMI and the puzzling community in India!

I hope I can pursue this for at least few more years, I feel old, but I'm not done yet :-)
My aim is to encourage new talent and potential top solvers to achieve even greater success in the years to come.

I thank Mom, Dad, Sis and family for their continuous support and help. I am extremely fortunate to have been exposed to a wide varieties of activities of my interest and been given the freedom to pursue some of them as successful hobbies :-)
I also thank one more person, she's been there for me, inspiring and motivating me, witnessing and experiencing all these memories and being an integral part of this incredible journey, as a partner, mentor, friend and much more. I dedicate the TSC 2015 win to her, which coincidently was held on Friendships Day! :-)

Sudoku is my passion and if there is one message I want to put out to the world, it is Pursue Your Passion (PYP). Don't let anything stop you from doing what you enjoy the most.

I hope some day someone can better my 8th rank at WSC, and hoist the Indian flag on the podium.

Friday, July 31, 2015

Exacerbation Prediction of COPD


The Exacerbation Prediction of COPD patients competition was held on CrowdAnalytix in May-Jul, 2015.
Seems like this was a sequel to the first Exacerbation Prediction competition in which I stood 2nd.

Objective
Smoking related diseases like chronic pulmonary obstructive disease (COPD) are a severe global medical problem which have affected over 50 million people worldwide. As their condition worsens, a fraction of patients experience “exacerbations”. Exacerbation is defined by sudden worsening of symptoms such as shortness of breath and increased airway inflammation often requiring immediate medical treatment and emergency room visits.

The objective was to build a predictive model using medical data which predicts beforehand which patients will experience 'exacerbation' so that they can be provided appropriate medical treatment to prevent/control it.

Data
The train data consisted of 1935 patients and 62 variables related to medical and smoking history, demographics, lung functions, etc. along with the true labels of whether they experienced Exacerbation or not.
The test data consisted of 1324 patients for which we had to predict the probability of Exacerbation.

Approach
Being one of the toppers of the previous Exacerbation Prediction competition, I followed a similar approach. My approach was to build 3-4 models and ensemble.

Unfortunately, it was very hard since the CV and LB scores did not go hand-in-hand. I finally tried various subsets and combinations of XGBoost, RandomForest, Logistic Regression and k-NearestNeighbours.

Model
My best model on the public LB was a simple average of XGBoost and Logistic Regression. Which is the exact same ensemble I used in the previous Exacerbation contest.
My best model on the private LB was Logistic Regression on the PCA-transformed variables (using the top-7 components).

Results
My public LB gave an AUC score of 0.767 (XGB + LR) putting me in 11th place, whereas, my private LB gave an AUC of 0.769 (LR) putting me in 4th place.

So, I stood 4th and won some more prize money! (Who wants a party?)
This also means I've been in the Top-5 in 3 of the 4 CrowdAnalytix competitions I've participated in.

Views
I think the evaluation system is absolutely useless. The winners were decided solely based on the best private LB score. Kaggle does the same, but forces players to choose two submissions for evaluation. Here, ALL private submissions were evaluated and the best one was chosen.

I see a lot of cons here:

1. Players can try out all sorts of models and submit, and the more submissions a player makes, the likelier is one of them to be among the top.
2. Players don't know which model will be the final best model. So, if they made 100 submissions, are they supposed to track all 100 of them and submit the one that CA chooses as best? Are you kidding me? I had a tough time identifying which model of mine finally gave the best private LB score.
3. What sense does it make when one model fits best to public LB and another fits best to private LB?
4. Winners are more based on luck. Models are likely to be the luckiest fit to the private test set. I'm not sure how useful this would be to the client.

Kaggle has a much better, robust and stable evaluation system, and I really hope CrowdAnalytix figures something out soon, else its just going to be a series of lottery competitions.

Nonetheless, I'm happy with my performance. Another win up my sleeve and looking forward to add more in the future!

Read a blog post about the 7th place solution by Triskelion on ML Wave.

Check out My Best CrowdAnalytix Performances

Saturday, July 11, 2015

Times Sudoku Championship 2015


The Times Sudoku Championship will be held in July/August. This championship will select four players who will be sponsored to represent India at the World Sudoku Championship 2015 (WSC).

Note: This will be the 'sponsored' team for the WSC. The main A-Team that will represent India for the WSC has been selected from the Indian Sudoku Championship 2015 held last month.

--- Rishi Puri, Prasanna Seshadri, Kishore Kumar and Me are in the A-Team ---

Read Rules and Regulations

The schedule is as follows:

Regional Round in Delhi (12th July, 2015)
Regional Round in Mumbai (12th July, 2015)
Regional Round in Chennai (19th July, 2015)
Regional Round in Bengaluru (26th July, 2015)

National Finals in Mumbai (August, 2015)

Top-3 players from each regional round will be selected for the National Finals. Last year's TSC winners (Prasanna, Rishi, Sumit and Me) get wild cards for the National Finals.


Mumbai Regionals (12th July, 2015)
Who better to take the Mumbai crown than Tejal Phatak! Congratulations for finally making it on the Mumbai podium! Congrats to Prabha Joshi and Jaykumar Patel for qualifying too. See you all during the finals!

Read the Mumbai Article

The regional rounds in Mumbai and Delhi will be held simultaneously on 12th July. I won the Mumbai regionals last three years. Since I have a wild card this year and so does Prasanna Seshadri, its an open door for newcomers. I have no idea who could be in the top-3 unless someone is planning to travel from another city, maybe Tejal.


Delhi Regionals (12th July, 2015)
Delhi results are not surprising. Congrats to Akash Doulani, who has been in good touch lately, followed by Rajesh Aggarwal and Ritesh Gupta. Lets lock horns at the finals!

Read the Delhi Article

The regional rounds in Mumbai and Delhi will be held simultaneously on 12th July. Among the top known solvers of India, I expect Akash Doulani and Ritesh Gupta to qualify from Delhi (if they participate). Maybe Himani Shah, who's not been in the sudoku circuit recently or even Dileep Singh.


Chennai Regionals (19th July, 2015)
Chennai results were as expected too. Congrats to Rakesh Rai and Kishore Kumar for their consistent performances at the sudoku circuit, and Pranav Kamesh for coming in third. Looking forward to the finals!

Read the Chennai Article

I'm expecting and hoping Rakesh Rai and Kishore Kumar qualify from Chennai, they've been among the top solvers of India in recent times. There are some upcoming names from Chennai, so I wouldn't be surprised to see a new name in the top-3 this year.


Bengaluru Regionals (26th July, 2015)
The Bengaluru regionals is going to be a cracker! There are many potentials players who could make it in the top-3 this year. Some usual suspects are Rajesh Kumar, Harmeet Singh, Kunal Verma, Jayant Ameta, Gaurav Kumar Jain, Zalak Ghetia... and some more who's names are not on the top of my head! I would've loved to be there and see you guys battle it out, but unfortunately, I'm in Mumbai over the weekend.
Good Luck and may the best-3 qualify!


Wednesday, June 24, 2015

Indian Sudoku Championship 2015


The Indian Sudoku Championship 2015 was held online on 28th June, 2015.


Information
View Championship Page

Download Instruction Booklet
Download Puzzle Booklet

View Forum
View Results

The top-4 from ISC will represent India at the World Sudoku Championship (WSC) 2015, which will be held in October. There will be a separate event Times Sudoku Championship (TSC) conducted later by TOI and LMI to sponsor four players for the WSC.

Yes, its a little complicated (with a 2-experienced + 2-inexperienced rule being enforced in TSC this year), but I'll come to that later once TSC is announced. Right now, its more about practising for this Sunday, enjoying the afternoon solving some wonderful sudokus (like its always been during ISC) and crowning the national champion.


Personal
I've finished in the top-3 of ISC for the last 5 years and hoping to do the same this year too. Its going to be a tough and interesting contest battling it out with Rishi and Prasanna, who have been in great form recently.

There are many other good contenders who have the potential to make it to the team this year. Let's hope we have the best team that will get India into the top-5 in the World!

Good Luck to all participants!


Results
View Complete Results

Congrats to Rishi Puri, for retaining the National Title with a impressive performance of completing all sudokus in 97mins, to Prasanna Seshadri, who's been in top form, for completing all sudokus in 112mins and to Kishore Kumar, who has missed the top-4 closely last couple of years, for finally making it.

I finished all sudokus in 119mins, making me 3rd.

Really nice set of sudokus (as usual), and it was an enjoyable contest. Thanks to Deb and all the authors for organizing another successful ISC.

Even though Kishore completed 26 / 27 sudokus, it was good enough to get him into 4th place. I'm really happy he's on the team after narrowly missing out last year.

So, Rishi, Prasanna, Kishore and me. I think this is a strong team, definitely the best 4 sudoku solvers in India today and hope we can get India into the top-5 this year!


Practise Sudokus
I'll upload some practise sudokus as and when I get time. Feel free to comment if you want any particular sudoku type to practise, I'll try creating some.

Here's are some sudokus that I created for practise.

Shape Sudoku (using colours instead of shapes):






Filler Sudoku:





Links to some more practise sudokus:

Triomino Sudoku (Ashish Kumar)
Triomino Sudoku (Prasanna Seshadri)
True Or Lie Sudoku (Rishi Puri)
Search-9 Sudoku (Rishi Puri)

LMI Forum (Links to practise puzzles are generally shared)

You always have Google for more :-)

Wednesday, May 6, 2015

TFI Restaurant Revenue Prediction


The TFI Restaurant Revenue Prediction competition was held on Kaggle in Mar-May, 2015.

Objective
New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

The objective was to predict the annual restaurant sales of a restaurant.

Data
The train data consisted of 137 rows with restaurant data pertaining to opening date, city, type and anonymous variables related to demographic, real estate and commercial data along with the corresponding annual revenue.

The test data consisted of 100000 rows with restaurant data for which we had to predict the annual revenue. Most of the test data was junk, a popular technique used to prevent hand-labeling in such competitions. The actual size of the test set is rumoured to be around 320.

Approach
What can you do when you have just 137 data points? And some of them that look like outliers? You have to make a choice and bank on some luck :-)

I chose to build a model which gives relatively stable CV and a decent LB. I tried out a few models and found RandomForest giving decent/stable results, like many other participants found after reading the forums.

I tried some simple features, nothing too complex, since over-fitting is highly likely on such datasets.
I also shuffled the train data and built RF models on different subsets to remove noise and the effect of outliers.

Also, training the model on log-transformed 'revenue' was much better than the raw 'revenue'.

Model
My final model was an average of many RFs built on different subsets of the data.

'days' variable (Number of days back the restaurant was opened) was the most important variable. I also converted some of the anonymous variables into dummy variables, treating them as categorical. These two ideas gave the best improvements.

Github
View Github repository

Results
My model scored 16.2L on the public LB which ranked 66th, and 17.6L on the private LB which ranked 14th! There were totally 2256 teams.

This is my best individual performance on Kaggle! :-)
Not the best competition on Kaggle, but it was certainly the biggest one in terms of teams. This also becomes the first competition to cross the 2000-participants mark. Of course, the Otto competition is going to beat this soon.

View Final Results
View Public LB

Views
Working with a small data-set is always challenging in many aspects. Choosing the model, training the model appropriately, preventing over-fitting, etc.

I'm glad I stuck with RF and made it as stable as possible.

The BAYZ team built a 'perfect submission' which scored 0 on the public LB, putting them in 1st place. How? You can keep track of their forum post and learn how to become a master at over-fitting! Of course, their final private LB rank is way below, but I still think they came up with a winning model (to overfit, not to predict revenue!) and I'm looking forward to know how they cracked it.

So, this gets me to 105th in overall Kaggle rankings and among the Top-3 Indians.
Next target is Top-50 and then Top-Indian.

Check out My Best Kaggle Performances

Saturday, April 18, 2015

Women's Healthcare Prediction


The Women's Healthcare Prediction competition was held on DrivenData from Feb-2015 to Apr-2015.

Objective
The challenge was to predict which healthcare services (like household, pregnancy, family, medical, etc.) were opted by women. Essentially, it was a multi-label, multi-class classification.

Data
The train data consisted of ~14600 rows (or women) along with various numeric and categorical variables and which of the 14 services were opted by them. Each woman could've opted for more than one service.

The test data consisted of ~3600 rows (or women) for which we had to predict which all services would they have opted for.

Approach
There were 1300+ variables, so my general approach was to do some form of FS along with an ensemble of classification models.

I started off with the usual suspects and found the tree-based models performing better than the linear models. None of the other models came even close to the accuracy received using XGBoost or RandomForest.

I tried multiple ways of doing feature selection and reducing the dimension, but they didn't improve the results significantly.

Once I exhausted all ideas, I used the brute-force approach to optimize my model performance by tweaking the parameters of each of the 14 individual labels.

Model
My final model was an ensemble of XGBoost and RandomForest with some standard data cleaning and FS. I optimized the parameters for each of the 14 labels, but that gave very minor improvement.

Results
I stood 11th on the public LB out of 104 teams. Just missed the Top-10 and also the Top-10% !
My model achieved logloss of 0.2588 while the topper was 0.2539.

View Complete Results

Views
This is the first competition where I really struggled for a long time. Tried lots of ideas, but nothing seemed to work. Ensembles hardly gave any improvement and I was literally stuck during the last 2 weeks.

The public/private LB split seemed excellent with the ranks remaining almost the same. Even the CV and LB scores moved in the same direction.

Feels like I missed out here, but it only motivates me to come harder next time. This was my first competition on DrivenData, and I'm hoping there are better ones to come soon!

Wednesday, April 1, 2015

Unlucky 13


I authored a Sudoku contest Unlucky 13 on LMI. It was held from 1st - 6th April, 2015 and consists of 13 sudokus to be solved in 65 minutes.

View Championship Page

Download Instruction Booklet
Download Puzzle Booklet
Password is LuckyYou

View Forum

View Results

"13 is my favourite number and I created this themed test in late-2014 during some easy days at work. Incidently, this is also the 13th test I'm authoring at LMI. Lot of special moments and memories along the way... and I hope players enjoy this set and make it a success!"

Congrats to Jan Zverina, Hideaki Jo and Jakub Ondrousek for the top-3 overall players.
Congrats to Prakhar Gupta, Kishore Kumar and Rishi Puri for the top-3 Indian players.

"Good artists copy, great artists steal" - Pablo Picasso.

Few months back, a couple of my friends created some sudoku variants and asked me to test solve them. It was their first try at creating sudokus and they did quite a decent job, since they were all unique. Only problem was, all the variants could be solved like Classics without having to use the variant rule and I had a hearty laugh while solving them. For example, there was an Odd Even Sudoku with 42 givens... a Non-Consecutive Sudoku with 33 givens... etc. :-)

That's how the idea was formed for this test. If I enjoyed it so much, maybe other solvers would enjoy it too, in its own humorous way. It was subtle April 'fooling', unlike last year's total surprise (which was awesome in its own way). Thanks to Deb Mohanty and Prasanna Seshadri for test-solving and other inputs and 'contributions'. I'm happy many players were able to complete the test and get the bonus, it was intentionally left longer and the difficulty such that a large portion of solvers would finish.

Thanks for all the messages, and hope to see some more exciting Sudoku solving in the months to come! :-)

Friday, March 20, 2015

Hotel Demand Forecasting


The Hotel Demand Forecasting competition was held on CrowdAnalytix in Feb, 2015.

Objective
The objective was to build a forecasting model to predict the demand in hotel using historic inquiries.

Data
The train data consisted of historic inquiries (reservation, denial, regret) for five different hotels in 2011, 2012, 2013.

The test data was predicting the demand for the hotels in 2014.

Approach
I found the historic demand having strong weekly trends (like you would intuitively expect) and a naive submission of using previous year's demand for the same weekday (eg: using previous year's Friday demand to predict this year's Friday demand in the corresponding week) gave a very good score. So, I decided to play with and optimize the historic averages. I ended up using two different versions of historic averages and made it to the top-5 without a sophisticated model. I wouldn't be surprised if the other toppers used similar ideas.

The first method of historic averages was using a weighted average of the previous three years' demand on the same weekday of the week.

The second method of historic averages was aggregating the weekly demand of the previous three years in the corresponding week and splitting it based on the historic demand proportion for each weekday, which was calculated separately for each quarter.

Model
The final model was an average of the two historic averages, along with some smoothing. The smoothing was redistributing the predictions across three weeks (the week before, the current week and the week after) using a weighted average. This smoothing gave me the biggest jump in my LB score, which pushed me into the top-5.

Code
View on Github

Results
I stood 5th on the public LB out of 45 teams. My model achieved MAPE score of ~ 0.26 and the best was ~ 0.25

The top-5 models were evaluated further and I still stood 5th :-|

Views
This is my second competition on CrowdAnalytix (after Exacerbation) and I'm glad I could finish in the top-5 in both of them. Though these are not as popular as the ones on Kaggle, I enjoyed exploring this forecasting model especially since it was one without any features.

Check out My Best CrowdAnalytix Performances

Tuesday, February 24, 2015

Avazu Click-Through Rate Prediction



The Avazu Click-Through Rate Prediction competition was held on Kaggle from Nov-2014 to Feb-2015.

Objective
Click-through rate is a very important measure for performance of ads and the challenge was to predict how likely an ad will be clicked.

Data
Train data consisted of ~ 40 million ads (which is just 10 days of Avazu data!) along with a label indicating whether they were clicked or not. The variables were about the website/app where the ad appeared, some features of the ad (like size, position, etc.), demographics of the user to whom the ad was shown and some anonymous variables.
The test data consisted of ~ 6 million ads (11th day of Avazu data).

Approach/Model
This is the largest data set I've worked with till date and 40 million rows of data meant memory issues right from the start.

Wait a minute. What about that awesome online-lr code? Of course... that's the same beauty I used for the Tradeshift competition and its the same one I used for this competition too. Well, isn't it just fabulous?

I started off playing around with the parameters of the code and adding interaction variables and generating some features. Some of the anonymous variables were decoded (by some Kagglers) and I tried using them more smartly.

There were massive number of participants and after 2-3 weeks, I was ranked in the top-20 with 600-700 teams. I had some work assignment for which I travelled to US, and wasn't sure if I would have time to try out new ideas, so I decided not to pursue it further.

Not much to share here, no particularly nice model ideas, but I still managed to secure 79th place out of a whopping 1604 teams scoring 0.3908 / 0.3889 using the logloss metric.

Views
It was a challenge to work with this data, and not having access to much RAM, it is all the more tricky. Thanks again to pypy and tinrtgu for the online-lr code and I'm glad I still made it into the top-10%

Congrats to 4-Idiots, Owen and Random Walker for the top-3 spots. What can you say about Owen? Leading the overall Kaggle rankings with more than double the points over 2nd place David Thaler. Some feat that it!

And for me, moved to 185th in overall rankings. The race is on to finish in the top-100 (or top-50) by end of this year.

Check out My Best Kaggle Performances

Saturday, January 10, 2015

Exacerbation Prediction


The Exacerbation Prediction competition was held on CrowdAnalytix in Nov-Dec, 2014.

Objective
Respiratory diseases (asthma, cystic fibrosis, smoking diseases, etc.) are one of the leading causes of deaths globally. As the condition of patients deteriorate, they experience 'exacerbations', which is sudden worsening of symptoms, requiring immediate emergency and medical attention.

The objective was to build a predictive model using medical and genetic data which predicts beforehand which patients will experience 'exacerbation' so that they can be provided appropriate medical treatment to prevent/control it.

Data
The train data consisted of ~ 4000 patients and 1300 variables along with the true labels of whether they experienced Exacerbation or not.
The test data consisted of ~ 2000 patients for which we had to predict the probability of Exacerbation.

Approach
My main idea was to build 2-3 strong classifiers and then build an ensemble with them. With 1300 variables, variable selection / dimension reduction became a must.

I tried tree-based models like Random Forest, GBM, Extra Trees, XG-Boost, etc., regression based models like Logistic Regression, Ridge Regression, etc., and some others like k-NearestNeighbours, SVM, NaiveBayes, etc.

RF and XGB gave the best results while LR and k-NN were decent. I explored and optimized these. After some tuning, XGB and LR gave much better scores, and k-NN didn't add any improvement.

Model
My final model was a weighted average of XG-Boost and Logistic Regression.

The XG-Boost was built on 150 variables, which were selected based on the variable importance of some sample tree models.

The Logistic Regression was built on the top-50 Principal Components.

Results
I stood 4th on the Public LB out of 101 teams, 1st on the Private LB, and finally 2nd on the Private Evaluation. I'm not sharing the scores (since they are not public), but my models achieved AUC scores of ~ 0.845

So, I stood 2nd! This is the first time I've got a ranking with some prize money! Yay!

Views
When I started this competition, I was looking at all numeric features of anonymized variables. I wasn't sure how much I could squeeze out from the data, but I put in a lot of time and effort and found some wonderful ideas in the process.

I think my model was a very robust and competitive one, and I was surprised it scored so consistently across multiple test sets.

Overall, it was fun. The Public LB evaluation on CrowdAnalytix is not absolutely ideal, since you can tune your model to overfit the LB. I still love Kaggle's method of evaluating winners.

Thanks to my family, friends and colleagues (especially my flat-mate and colleague Shashwat) for their help and support, this is a big achievement for me and I'm hoping to perform better in the years to come, and hopefully call myself one of the best Data Scientists of India :-)

Check out My Best CrowdAnalytix Performances