Thursday 10 May 2012

My first experience with Mahout (as a researcher)

One of the problems in the area of Recommender Systems is that it is hard to reproduce someone else's experiments. I find that normally in papers, the implementation and the configuration of algorithms is vague and incomplete. Little tweaks in an algorithm can lead to big changes in the output, and it's hard to repeat experiments if these small (but crucial) details are not known.

Recently I have seen that researchers are using frameworks to implement existing algorithms which I think is a necessary step in order to allow for the repeatability of experiments. For this reason I decided to try Mahout and implement some traditional recommendation algorithms, in particular User-based Collaborative Filtering. I want to describe my experience with this framework in case there are other researchers thinking about using it, or that have already started using it.

The Mahout distribution I've been using is 0.6. I've been looking at the taste package which includes the code for recommendations.

The first thing I noticed when looking at some classes is that the default algorithms in Mahout aren't ideal. For example, the predictor used by Mahout for CF is just a simple weighted average predictor. Resnick is known to work better so I found it strange that they haven't implemented it. Indeed, after implemented a new Resnick predictor and running it on the Movielens dataset I got better results. So if you are going to run UBCF, first of all try to modify the predictor; the theory is that you'll see better results with Resnick.


The second thing I realised is that when I ran User Based CF it was a bit slow. Then I found out that Mahout computes the user similarities on the fly. This is because in a life system users tend to join and leave the system. But when running experiments we don't need this! Computing similarities in advance is a common practice so I would recommend modifying this on Mahout if your datasets are too big and you don't want to be waiting for days to get your results.

Finally there are a couple of silly mistakes I made because of a non-obvious implementation in Mahout.

Be careful when loading files with the same name into a FileDataModel !!!!

When loading a particular file into a FileDataModel I realised I was getting a model with a lot more users and items in it. I almost ended up crazy trying to find out what was happening. Well, it seems that if you load a particular file (for example "data.test") and there is another file in the same folder with the same name but different file extension (for example, "data.train"), it will also load that file into your model. Apparently, the reason why they do this provide updated data to the main file to allow pushing new updates without having to copy the same data again.

But for researchers we can have the same name for test and training. Funny enough, this is the format of the commonly used Movielens dataset test and training files... You can imagine how wrong (and "better") results  can be if you are trying to load your training data and at the same time you are also loading the test data!! So I think this is something to be extremely careful about!

Be careful when manipulating a PreferenceArray.

Do not create a PreferenceArray using the constructor GenericUserPreferenceArray(int size) if you don't know the size of the array. It's not dynamic like a Vector for example, so if you don't fill it all, there will be elements with default values, which can be dangerous and most probably lead to wrong outputs!!
PreferenceArray newPrefs = new GenericUserPreferenceArray(100); //Avoid this!!
newPrefs.set(0, p0);
newPrefs.set(1, p1);

Instead fill an ArrayList of Preferences first and then use that to create the PreferenceArray.

ArrayList<Preference> list = new ArrayList...
list.set(0, p0);
list.set(1, p1);

PreferenceArray newPrefs = new GenericUserPreferenceArray(list);

My conclusion after a first "taste" ;) of the framework is that although Mahout might be very useful for someone trying to deploy a life recommender system I think that for someone working on research is not appropriate. Researchers might use it as a framework to plug in components, but even in that case I'm not sure if that's the best framework since the default configuration might lead to errors in the results. It would have been interesting to see a spin off of Mahout focused in research, but unfortunately that doesn't seem to be the direction they are taking with it.

Lately I started hearing good things about other recommender systems frameworks: LensKit, (created by the GrupLens research group) and MyMediaLite (developed at the University of Hildesheim, Germany). Although I haven't used it yet the fact that is focused on research gives me a good feeling and I'll probably be using one of these in the future. Both have been previously presented in the last ACM Recommender Systems Conference (2011):


MyMediaLite: a free recommender system library


Rethinking the recommender research ecosystem: reproducibility, openness, and LensKit

LensKit: A modular Recommender Framework


Tuesday 16 August 2011

Reactions on Obama joining Foursquare


Today the White House has announced that president Obama is now on Foursquare. So many things come to someone's head after hearing this. I wondered what Twitter users thought about this and here are some of the users reactions I found (to me the last one is the winner):

  • Paul Gillin: Obama is now on Foursquare. Do you think he'll be satisfied with aspiring to the position of "mayor"?
  • Irene Rojas
    Pres. Obama is the mayor of Area 51! RT : Obama's been fooled into thinking Foursquare is somehow useful
  • William Behrmann
  • The Dark Lord: Obama is on Foursquare? This makes less sense than if Harry Potter had decided to 'check in' every time he found a horcrux.

Some interesting workshop proceedings

I've just found these workshop proceedings from some of last year's IUI and WWW workshops. I think they can be useful since they are quite related and they are not on ACM.

Reviewing Papers with PaperCritic

Here is a tool you lab monkeys might find useful. It's called PaperCritic and it has been created by a phd student from Trinity College who I met in Cambridge.

The idea behind PaperCritic is simple but quite interesting: PaperCritic allows Mendeley users to review papers and also provide ratings on different aspects: references, originality, argumentation and readability. You can also add a button on your toolbar which allows reviewing papers in one click (similar to the "one-click-button" from Mendeley). They have just launched it so of course there is still a lot of space for improvements but it looks cool anyway. I like the way they are using the Mendeley API although it seems they are also restricted by it.

You can try the tool here.


Facebook relationship status



Thursday 11 August 2011

Datasets and APIs

I've put together these API links and datasets (some from your emails) that can be useful for you at some point:

Movies:
Flixster/Rotten Tomatoes API: http://developer.rottentomatoes.com/docs
Netflix: http://narod.ru/disk/7133213001/netflix.7z.html

Products:
Blippr: http://www.blippr.com/api
Best Buy: http://bbyopen.com/
Shopsense: http://shopsense.shopstyle.com/
Amazon: http://aws.amazon.com/code

Others:
Clique datasets: http://www.cliquecluster.org/data
3Taps(eBay, Craigslist, Etsy, Twitter...): http://3taps.com/developers
Synthetic Datasets: http://code.richrelevance.com/reclab-core/

Sentiment Analysis
If you need to do sentiment analysis on Twitter, Standford has a sentiment tool that seems to work quite well: https://sites.google.com/site/twittersentimenthelp/api
You can try their tool here: http://twittersentiment.appspot.com/



NOTE: You can update this post if you have any other useful links!

Let's get Social!


Hey guys!

I've created this private blog so we can share stuff about our research. I've noticed we keep sending emails with useful links but then it's hard to find those links again... So that's why I think we can organise all this information a bit better. So we can have some categories like news, datasets, conferences, spare time...

By now I've made the blog private. We don't want people to know about our amazing ideas!!! :p

Anyway, hope you like the idea and that you start blogging soon!

Sincerely Yours,

The Spanish Monkey