Netflix Prize for Dummies [III]

Posted in Netflix, databases, spreadsheets by Francisco Marco-Serrano @ Jul 30, 2007

MySQL ODBCNow we’ve got the data into a MySQL database, so next step is accessing it from our prefered application (sometimes, that means MS Excel, i.e.). So, let’s go:

1) Make sure you system is up to date (specially the Jet Engine).

2) Download and install the last MySQL ODBC Connector.

3) Go to Start_Settings_Control Panel_32bit ODBC and create a new DNS (choose MySQL driver).

Once finished, you’ll be able to access the data from MS Excel. However, you have to take into account you can’t view the data in the spreadsheet since the volume of rows is huge. The good thing is you’ll be able to calculate the statistics from there, or get data from pivot table, etc.

Hit Predictor

Posted in Netflix, maths, music, preferences, statistics by Francisco Marco-Serrano @ Jul 27, 2007

New techniques for an old art Yesterday I was talking to my friend and colleague Pau Rausell-Köster, from the Research Unit in Cultural Economics (Universitat de València), about the Netflix Prize. We were discussing about the foundations of taste and preferences, and how it was quite difficult to, by means of a devil reductionism, create a mathematical model that could predict how you’re going to rate a movie. The question was: it works!.

This conversation though led to another mathematical model it’s been used for a while by a company called Polyphonic HMI S.L. to predict if a song will be successful (aka “a HIT”). They use a methodology they have named as “Hit Song Science”, which basically uses “Spectral Decomposition” to get different musical attributes for all the songs they have analysed (3.5 million to date). They, they apply clustering techniques to the songs that have been a success (aka “a HIT”) in the last 5 years (I imagine, the time-frame is just to take out the trends and account for changes/evolution in people’s preferences). Then, they are able to predict if a new song will succeed in the market and they asign a rating (controlling type-I error).

There’s only a downsize: would the record companies invest in promoting songs with low rating?. This would affect the song to the extent of not helping it to become a hit, so, again our beloved maths would be changing the course of events and distorting the model by means of the feedback in flawed data (the reverse, type-II error, could as well happen, bad songs evaluated as possible hits being highly promoted and succeeding). Moreover, if this happens to be in a big scale, innovation in music creation is aborted…, unless… you’re brave and forget the model!.

PS For the Netflix Prize Teams: food for thought.

Netflix Prize for Dummies [II]

Posted in Netflix, databases, operations research by Francisco Marco-Serrano @ Jul 18, 2007

Next is the database. In this example I’m going to use MySQL, although you could use PostgreSQL, or MS SQL, for example. I’m in a Windows OS.

2) Creating the database and dumping the data into it.

a. Create the database:

CREATE DATABASE netflix;

b. Create the tables:

“training_ser” is the table where I’m going to dump the training_set data, made up of the movie ids, user ids, the rating, and the date.

CREATE TABLE `netflix`.`training_set` (
`idmovie` INTEGER UNSIGNED NOT NULL,
`iduser` INTEGER UNSIGNED NOT NULL,
`rating` INTEGER UNSIGNED NOT NULL,
`date` VARCHAR(10) NOT NULL,
PRIMARY KEY USING BTREE(`idmovie`, `iduser`);
)
ENGINE = MyISAM
COMMENT = ‘User ratings’;

“movies” is the table where I’ll dump the information from the movies file, made up of movie ids, release date, and title.

CREATE TABLE `netflix`.`movies` (
`idmovie` INTEGER UNSIGNED NOT NULL,
`release` INTEGER UNSIGNED NOT NULL,
`title` VARCHAR(150) NOT NULL,
PRIMARY KEY (`idmovie`)
)
ENGINE = MyISAM
COMMENT = ‘Movies List’;

Try doing the same for “probe” and “qualifying”.

c. Dump the data into the tables:

LOAD DATA LOCAL INFILE “C:/Netflix/training_set.txt”
REPLACE INTO TABLE netflix.training_set
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ STARTING BY ”
(idmovie, iduser, rating, date);

LOAD DATA LOCAL INFILE “C:/Netflix/movie_titles.txt”
REPLACE INTO TABLE netflix.movies
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ STARTING BY ”;

Try doing the same for “probe” and “qualifying”.

Any problems?.

Netflix Prize for Dummies [ I.b ]

Posted in Netflix, VBA, operations research by Francisco Marco-Serrano @ Jul 16, 2007

Yes, I wasn’t happy at all with the previous code so I changed it. It improved in processing time, coming down to 13 minutes and 59 seconds to aggregate all the files into a sole one (tough it increased size up to 2.62GB). Moreover, I have modified the structure so it’ll be easier to introduce the data into a database. Now the new file is divided into 4 (CSV) columns: movieid, userid, rating, date.

Here’s the VBA code:

Sub GroupData()

Dim T As Date
T = Now

Dim N As Double
Dim Text1 As String
Dim Text2 As String
Dim Text3 As String

Open “C:\Netflix\training_set.txt” For Output Access Write As #1

For N = 1 To 17770
Open “C:\Netflix\training_set\mv_00″ & Format(N, “00000″) & “.txt” For Input Access Read As #2

‘For the first line.
Input #2, Text1, Text2, Text3
Print #1, N & “,” & Right(Text1, Len(Text1) - (Len(CStr(N)) + 2)) & “,” & Text2 & “,” & Left(Text3, 10)

‘For the rest of lines.
Do While Not EOF(2)
Input #2, Text1, Text2
Print #1, N & “,” & Right(Text3, Len(Text3) - 11) & “,” & Text1 & “,” & Left(Text2, 10)
Text3 = Text2
Loop

Close #2

Next N

Close #1

MsgBox Format(Now - T, “hh:mm:ss”)

End Sub

Not another gentle post on Numb3rs… maybe.

Posted in education, maths, movies by Francisco Marco-Serrano @ Jun 26, 2007

After reading a lot of posts patronising Numb3rs I decided to have a taster. Regretfully in Spain was a total disaster (where’s our education system going?), and in Brazil, where I’m currently living, I think it’s on cable (however, from the last news appeared on press, the education system for maths is as bad as an 8th grader, about 14 years old, having problems calculating percentages). Well, but that’s not the point; the thing is I bought the First Season DVD Set and watched the first two chapters. Here we go!.

Minuses:

_ The acting is not very good…, I imagine… yet. It’s just the first two.

_ Why the mathematician has to be presented as a troubled mind?.

Pluses:

_ I liked the way maths is inserted into the action…, natural, we do use maths everyday and this is shown.

_ The most I liked was the way they try to introduce “humanity” within maths. I think our colleagues from INFORMS Section, Behavioral Process Management, will be happy.

Thanks! And remember, It’s just my opinion.