Kaggle Forum Machine Learning Cloud


I had been wanting to explore some text scraping and manipulation in R, so I decided to try my hand at a word cloud. I know they’re not especially quantitative, but people like them; they’re cute.

With Kaggle on the brain, I had been reading a lot of the forums. After each competition, the top three teams are (usually) interviewed by the Kaggle folks and are asked things like:

  • “What was your background prior to entering the Challenge?
  • “What was your most important insight into the dataset?”
  • “Which tools did you use?”
  • “What preprocessing and supervised learning methods did you use?”

Those last two were what I was after. My goal was to get a word cloud of the tools and methods being employed by the top performers on Kaggle.

So I got to work gently scraping the Kaggle blogs.

Finding Kaggle blog links (and titles)

I constructed a blog page URL that held several posts per page. With no visible limit to the number of pages (that might otherwise signal a place for this loop to end), I checked to see the number of post links on the next page. If it was 1 or more, the loop proceeded.

On each page, I used the CSS markers to grab the post titles and corresponding links, saving the result into a data frame.


# For constructing the blog links

## Cycle through blog pages and get all post links
while(nextLinkCount>=1){ # keep going until there are 0 links on the page

  titleXML<-html(paste0(baseURL,page,pNum)) %>%
    html_nodes("#homepage h1 a:nth-child(1)")
  # Get the post titles and links on the current page
  postTitles<-unlist(lapply(titleXML,function(x) xmlValue(x)))
  postLinks<-unlist(lapply(titleXML,function(x) xmlGetAttr(x,"href")))

  # Check the number of post links on the next page
  nextLinkCount<-length(html(paste0(baseURL,page,pNum+1)) %>%
                            html_nodes("#homepage h1 a:nth-child(1)"))


Just the Winners, please.

Kaggle is a great resource for seeing how folks are using data science techniques. However, they also blog about quite a few other things. The three main categories are:
Data Science News and Editorials (33 posts as of 3/28)
Kaggle News (118 posts as of 3/28)
Tutorials and Winners’ Interviews (116 posts as of 3/28)

Obviously, I only wanted the last of these, so for each of the blog links, I checked the category, later filtering so that all I was left with was the Tutorials & Winners’ Interviews.

for(i in 1:length(blogDF$link)){
  catXML<-html(blogDF$link[i]) %>%
    html_nodes(".categories a")


# get just the posts about winning teams
winPosts<-filter(blogDF,category=="Tutorials and Winners\' Interviews")

Idea: Check the tags!

Now I had the relevant posts. My first thought was to check the tags on each page. Some posts were tagged in a helpful way, i.e. “GBM, Python, scikitlearn”. However, I quickly saw that this was not helpful: Relatively few posts were tagged and even then, some posts were tagged oddly. One post was even tagged “Celine Dion.” A dead end here, but My Heart Will Go On.

Grabbing the (important) Text

There was a sea of text, and as much as Kagglers like their SVMs and GBMs, those mentions would be easily swamped out by competition-specific words in the word cloud. So, I needed to isolate the answers to just the important questions. I had seen that a few posts had interview questions marked with helpful node ids tags like “question” or even “h1” but that others were simply strong or bold. I resorted to something fairly brute-force, just grabbing all the paragraph text.

# get the text from the winning pages
for(i in 1:nrow(winPosts)){
  pXML<-html(winPosts$link[i]) %>%
    html_nodes("div p, h3")
  postText<-unlist(lapply(pXML,function(x) xmlValue(x)))

What was the question?

With all the text from the interviews scraped, I had to pick out the questions (and specifically, the questions that asked about methods and tools). I needed all of the questions because I would later be extracting all the text in the post between the important question and the *next* question. Spot checking the question list, I saw that most that asked about approaches involved the words “tools, “methods” or “learning.”

# Tag an entry as a question

# Tag an entry as involving tools/methods/learning i.e. "Which machine learning methods did you use?"
entrytxt$techniques<-grepl(x=entrytxt$text,pattern="tools|methods|learning") &
  (sapply(gregexpr("\\W+", entrytxt$text), length) + 1)<11 # Filter out lengthy erroneous labels

Some answers

I set an index of questions and an index of machine learning questions, looping through the text data frame to extract only the text that was in response to the methods/tools questions of interest.

# Index for all questions
Qindex <- which(entrytxt$isQuestion)

# Index for ML questions
MLindex <- which(entrytxt$techniques)

# Find the paragraphs between a ML question and the next question
for(i in 1:length(Qindex)){
  if(Qindex[i] %in% MLindex){
    startMLAns <- Qindex[i]+1
    endMLAns <- Qindex[i+1]-1
    MLpara <- rbind(MLpara,data.frame(start=startMLAns,end=endMLAns))

# Take the actual text from the index list
for(i in 1:nrow(MLpara)){
  MLAns<-rbind(MLAns,paste(as.character(entrytxt$text[MLpara$start[i]:MLpara$end[i]]),collapse=" "))

answerText<-paste(MLAns,collapse=" ")

# Write the text file to a temporary directory for the wordcloud

Finally, it was time for a word cloud. Using tm and wordcloud, I followed this tutorial to get up and running. I had to remove certain “meaningless” words before I was able to see the random forest through the trees.


meaninglessWords<-c("data","used", "using","features","code","different","models","table","problem","like","category", "first","second","third","one","two","three","approach","many","number","blog","model","feature", "also","learning","top","score","competition","create","user","much")

winSpeak<-Corpus(DirSource("tmp/")) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(stemDocument) %>%
  tm_map(removeWords,meaninglessWords) %>%
wordcloud(winSpeak, scale=c(3,0.5),
          colors=brewer.pal(8, "Dark2")

The right tool for the right job… and a giant toolbox

Sort of unsurprisingly, no particular machine learning method or coding language pops out of the background. “Random,” is gigantic, but this is not necessarily because of the mention of random trees or random forest. The cloud shows that there is no one tool we should all be using and no clear-cut method to win Kaggle.

On a cool note, I recently stumbled across Kaggle’s beta wiki and posts like this one on Random Forests that link to relevant information and even competition winners who used that random forests primarily. This will be an awesome tool for even the casual Kaggler (or an aspiring data scientist seeking to expand his toolset).