Netflix Prize: Machine Learning vs Microeconomics

Posted in Netflix, operations research, preferences by Francisco Marco-Serrano @ Aug 10, 2007

PreferencesWhile I’m trying to juggle around with the data set offered by Netflix for the quest of improving their Cinematch algorithm I’m in my own quest for getting the theory behind the real model, the structure that resides behind those 2GB of user and movie ids, dates and so on.

Years ago I co-authored a paper about tastes and preferences, so I liked to carry on with this research, in order to give light to the matter (i.e. 40 movie features can be resumed in just one, “the rating”; people’s ratings are inconsistent; blah blah blah); by the way, it’s because, at the end of the day, ratings are just a set of preferences (ordinal, transitive, reflexive, but are they complete?). This doesn’t mean I’ll stop researching through machine learning, but that I’m opening two fronts.

For those fighting along my side, I’d recommend the following readings:

_Varian, H. (1992). Microeconomic Analysis. W. W. Norton & Company; 3rd edition.

                 Chapters: 7 (Utility Maximization), 8 (Choice), 19 (Time).

_Rabin, M. (1998). “Psychology and Economics”. Journal of Economic Literature, Vol.XXXVI, pp.11-46.

_Rieskamp et al. (2006). “Extending the Bounds of Rationality: Evidence and Theories of Preferential Choice”. Journal of Economic Literature, Vol.XLIV, pp.631-661.

It doesn’t mean these articles are going to help solve the problem, however are going to help understand why when we do this this and that, the result is such a given RSME.

Netflix Prize for Dummies [A+]

Comments Off
Posted in Netflix, software by Francisco Marco-Serrano @ Aug 7, 2007

If you’re an A+ dummy (aka almost-not-a-dummy) you can try with this software (Varozhka), created by Eugene Rymski, to play around with the Netflix Prize dataset.

Brilliant!

Netflix Prize for Dummies [III]

Posted in Netflix, databases, spreadsheets by Francisco Marco-Serrano @ Jul 30, 2007

MySQL ODBCNow we’ve got the data into a MySQL database, so next step is accessing it from our prefered application (sometimes, that means MS Excel, i.e.). So, let’s go:

1) Make sure you system is up to date (specially the Jet Engine).

2) Download and install the last MySQL ODBC Connector.

3) Go to Start_Settings_Control Panel_32bit ODBC and create a new DNS (choose MySQL driver).

Once finished, you’ll be able to access the data from MS Excel. However, you have to take into account you can’t view the data in the spreadsheet since the volume of rows is huge. The good thing is you’ll be able to calculate the statistics from there, or get data from pivot table, etc.

Hit Predictor

Posted in Netflix, maths, music, preferences, statistics by Francisco Marco-Serrano @ Jul 27, 2007

New techniques for an old art Yesterday I was talking to my friend and colleague Pau Rausell-Köster, from the Research Unit in Cultural Economics (Universitat de València), about the Netflix Prize. We were discussing about the foundations of taste and preferences, and how it was quite difficult to, by means of a devil reductionism, create a mathematical model that could predict how you’re going to rate a movie. The question was: it works!.

This conversation though led to another mathematical model it’s been used for a while by a company called Polyphonic HMI S.L. to predict if a song will be successful (aka “a HIT”). They use a methodology they have named as “Hit Song Science”, which basically uses “Spectral Decomposition” to get different musical attributes for all the songs they have analysed (3.5 million to date). They, they apply clustering techniques to the songs that have been a success (aka “a HIT”) in the last 5 years (I imagine, the time-frame is just to take out the trends and account for changes/evolution in people’s preferences). Then, they are able to predict if a new song will succeed in the market and they asign a rating (controlling type-I error).

There’s only a downsize: would the record companies invest in promoting songs with low rating?. This would affect the song to the extent of not helping it to become a hit, so, again our beloved maths would be changing the course of events and distorting the model by means of the feedback in flawed data (the reverse, type-II error, could as well happen, bad songs evaluated as possible hits being highly promoted and succeeding). Moreover, if this happens to be in a big scale, innovation in music creation is aborted…, unless… you’re brave and forget the model!.

PS For the Netflix Prize Teams: food for thought.

Netflix Prize for Dummies [II]

Posted in Netflix, databases, operations research by Francisco Marco-Serrano @ Jul 18, 2007

Next is the database. In this example I’m going to use MySQL, although you could use PostgreSQL, or MS SQL, for example. I’m in a Windows OS.

2) Creating the database and dumping the data into it.

a. Create the database:

CREATE DATABASE netflix;

b. Create the tables:

“training_ser” is the table where I’m going to dump the training_set data, made up of the movie ids, user ids, the rating, and the date.

CREATE TABLE `netflix`.`training_set` (
`idmovie` INTEGER UNSIGNED NOT NULL,
`iduser` INTEGER UNSIGNED NOT NULL,
`rating` INTEGER UNSIGNED NOT NULL,
`date` VARCHAR(10) NOT NULL,
PRIMARY KEY USING BTREE(`idmovie`, `iduser`);
)
ENGINE = MyISAM
COMMENT = ‘User ratings’;

“movies” is the table where I’ll dump the information from the movies file, made up of movie ids, release date, and title.

CREATE TABLE `netflix`.`movies` (
`idmovie` INTEGER UNSIGNED NOT NULL,
`release` INTEGER UNSIGNED NOT NULL,
`title` VARCHAR(150) NOT NULL,
PRIMARY KEY (`idmovie`)
)
ENGINE = MyISAM
COMMENT = ‘Movies List’;

Try doing the same for “probe” and “qualifying”.

c. Dump the data into the tables:

LOAD DATA LOCAL INFILE “C:/Netflix/training_set.txt”
REPLACE INTO TABLE netflix.training_set
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ STARTING BY ”
(idmovie, iduser, rating, date);

LOAD DATA LOCAL INFILE “C:/Netflix/movie_titles.txt”
REPLACE INTO TABLE netflix.movies
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ STARTING BY ”;

Try doing the same for “probe” and “qualifying”.

Any problems?.