Fun with Kaggle Leaderboards

Our team recently fought hard to earn our way into the top 10% of Kagglers in the AXA telematics challenge. I even wrote a post about it here where I talk about some of the feature generation I contributed. It was our first time in a real data science competition, and I’ve got to say, it was quite a rush. But, after it was all over, it was hard to describe the heat of the race to the top. Ranks on the public leaderboard and ROC curve scores did little to provide a realistic feel of the competitors around us.

I wanted to change that.

I was reading a blog post about plateaus in the 2013 Titanic Kaggle challenge. Trevor, the author, plotted the score vs. rank for all competitors on the public leaderboard. In this way, he was able to see plateaus where there were larger or smaller changes in rank for a given change in score.

Static Plot: Any given day

I wanted to test some web scraping skills using SelectorGadget and the XML and rvest packages, so the first approach to the problem was just get a quick look at the current leaderboard.

library(XML)
library(rvest)
library(ggplot2)

# AXA Standings
axaROC<-html("http://www.kaggle.com/c/axa-driver-telematics-analysis/leaderboard") %>%
  html_nodes("td:nth-child(4) , .abbr")
axaRank<-html("http://www.kaggle.com/c/axa-driver-telematics-analysis/leaderboard") %>%
  html_nodes(".leader-number")

In the above snippet, I used SelectorGadget to isolate the html nodes that stored the scores and ranks for all competitors. I can’t stress how valuable the extension is for scraping well-organized webpages.

The output of html_nodes() is unusable as-is as numeric data. With xmlValue, it was a relatively simple matter of trimming the fat around them to get at the scores and ranks themselves.

score<-as.numeric(unlist(lapply(axaROC,function(x) xmlValue(x,trim=T))))
rank<-as.numeric(unlist(lapply(axaRank,function(x) xmlValue(x,trim=T))))
rankROC<-data.frame(cbind(rank,score))

Final AXA Leaderboard

Finally, ggplot2 created this view of the public leaderboard on the final day. I also manually entered the benchmarks that the Kaggle Admins had put in place as well as 0.66 which was a popular post that provided R code to achieve a leaderboard score of 0.66 using simple logistic regression.

AXA Static

Score vs. Rank for the AXA Telematics Challenge. Our team, Vivi’s Angels, is highlighted in red.

The ggplot2 code was straightforward, but I include it here in case you are interested.

 

ggplot(data=rankROC)+geom_point(aes(x=rank,y=score),size=1)+
  geom_point(aes(x=139,y=0.90926),color="red",size=3)+
  annotate("text",x=350,y=0.92,label="Vivi's Angels")+
  scale_x_reverse()+
  geom_hline(yintercept=0.5,color="dark blue")+
  annotate("text",x=500,y=0.485,label="All 1's Benchmark")+
  geom_hline(yintercept=0.53154, color="blue")+
  annotate("text",x=500,y=0.545,label="Trip Length Benchmark")+
  geom_hline(yintercept=0.66,color="light blue")+
  annotate("text",x=500,y=0.675,label="Speed Quantiles")+
  xlab("Private Leaderboard Rank")+
  ylab("Private Leaderboard Score")+
  ggtitle("AXA Telematics: Final Standings")+
  theme_bw()

This gave some sense of climbing a mountain. We were able to quickly see not only that we had traversed a steep climb from some of the benchmarks, but also that ahead of us in the top 5 and 10% was a very steep climb.

Animated Plot: What about during the competition?

We had a sense of where we ended up, but I wanted to see the changing landscape and our path to the final standings. Well, this was easy enough to plot in R after downloading the AXA public leaderboard history as a .csv, which Kaggle conveniently provides at the bottom of each competition’s public leaderboard page. This file contained all timestamped and identified updates made to the leaderboard since the beginning of the competition. Juicy data!

The animation package let me send it a function that produced a series of plots and then stitch those plots together as an animated gif. This was a technique a classmate of mine covered at the NYCDSA Bootcamp, so I was eager to try it myself.

The first step was to point to the downloaded .csv file in the working directory. It would have been nice to do this automatically, however, this required cookies which would be different for all users. I thought it wasn’t really a huge trade-off in usability to have the user download and unzip the csv themselves.

require(lubridate)
require(animation)
require(grid)
# Info about the competition
### download the zip file and place it in the working directory
### go to the public leaderboard page and download the file at the end
plFile<-"axa-PL.csv"
animFile<-"axa-vis.gif"
PLHist<-read.csv(plFile)
PLHist$SubmissionDate<-ymd_hms(PLHist$SubmissionDate)

# What is your user or team name?
teamName<-"Vivi's Angels"
teamID<-PLHist[which(PLHist$TeamName==teamName),1][1]

# Start of the competition
start<-min(PLHist$SubmissionDate)+days(1)
hour(start)<-0
minute(start)<-0
second(start)<-0

# End of the competition
latest<-max(PLHist$SubmissionDate)+days(1)
hour(latest)<-0
minute(latest)<-0
second(latest)<-0

Next, I had to create the function that would generate each plot, for any given day. Here, genPlot takes several inputs:
daysIn describing the days from the beginning of the competition (the day for which a plot will be generated)
yMin and yMax which define the range of scores to show on the generated plot
higherBetter which determines which direction to plot the axes. For most competitions, higher is better
PLHist is the public leaderboard dataframe loaded above
startDate which defines “t=0” for the plots of interest.

genPlot=function(daysIn,yMin=0,yMax=1,higherBetter=T,leaderboard=PLHist,startDate=start){
  thisTime<-startDate+days(daysIn)
  thisPL<-leaderboard %>%
    filter(SubmissionDate<thisTime) %>%
    group_by(TeamId) %>%
    summarise(bestScore=ifelse(higherBetter,max(Score),min(Score)))

  if(higherBetter){
    thisPL<-arrange(thisPL,desc(bestScore))
  } else{
    thisPL<-arrange(thisPL,desc(bestScore))
  }

  # Add the day's ranks
  thisPL<-thisPL %>%
    group_by(1:n())
  names(thisPL)[3]<-"rank"

If a team of interest is specified outside of the function (teamID), this part of genPlot will get the coordinates (score and rank) for the team, if it has made a submission. It will also set up a timeStamp to float in the upper left corner of the plot and a team stamp to identify the team name, if it is present (this uses the annotate function in the grid package).

  teamScore<-ifelse(teamID %in% thisPL$TeamId,as.numeric(thisPL[which(thisPL$TeamId==teamID),2]),NA)
  teamRank<-ifelse(teamID %in% thisPL$TeamId,as.numeric(thisPL[which(thisPL$TeamId==teamID),3]),NA)

  timeStamp <- grobTree(textGrob(as.character(thisTime), x=0.1,  y=0.95, hjust=0,
                                 gp=gpar(col="red", fontsize=18, fontface="bold")))
  teamStamp <- grobTree(textGrob(teamName, x=0.1,  y=0.9, hjust=0,
                                 gp=gpar(col="blue", fontsize=18, fontface="bold")))

The last part of the function sets up the output plot. If a team of interest has been supplied, it will add blue crosshairs to signify the team’s location and it will annotate the plot with the teamStamp.

  p<-ggplot(data=thisPL)+
    geom_point(aes(y=bestScore,x=rank),size=2)+
    ylim(c(yMin,yMax))+
    scale_x_reverse()+
    #   xlim(c(totalTeams,0))+
    xlab("Rank")+
    ylab("Score")+
    annotation_custom(timeStamp)+
    theme_bw()

  if(is.na(teamRank)) print(p)
  if(!is.na(teamRank)){
    print(p+geom_hline(yintercept=teamScore,color="blue",alpha=0.7)+
            geom_vline(xintercept=teamRank,color="blue",alpha=0.7)+
            annotation_custom(teamStamp)
            )
  }
}
}

Cool! Now we have a code that will plot the team’s position on any given day, specified by days from the start date. Now we want to animate this to show progress. Luckily, the animate package makes this really simple. I defined the pl.animate function which, in addition to passing backdoor arguments to genPlot, allows the user to specify the animation interval (animInt) and the start day of the animation. Finally, the saveGIF() function calls the pl.animate function, specifying an interval (in seconds) for each frame to appear. I thought 4 frames per second was just the right speed for a ~90 day competition.

pl.animate <- function(animInt=1,dayzero=start,...) {
  lapply(seq(0,as.numeric(latest-dayzero),animInt), function(i) {
    genPlot(daysIn=i,startDate=dayzero,...)
  })
}

saveGIF(pl.animate(yMin=0.4,yMax=1), interval = .25, movie.name=animFile)

The Climb

Animated Public Leaderboard for the AXA Telematics Competition. Go, team, go! Click on the image to (re)load the animated GIF as WordPress tends to bug out.

And so there it is, if you click on the image above, you can observe the changing layout of scores and ranks throughout 90 days of competition and follow the crosshair of Vivi’s angels as we climb up the hill.

Conclusion

I’ll definitely be using this before entering and during competitions to see how things are changing. Even in the static image, plateaus other than the benchmark may provide some clue as to a useful forum post or idea someone has discovered. The benefit of the animated plot shows just how quickly teams make it into the top scoring tiers.

Finally, I invite you to fork this code from my github page and use it for your own team or to see any leaderboards you may be interested in.

Leave a Reply

Your email address will not be published. Required fields are marked *