Netflix Prize for Dummies [A+]

Comments Off
Posted in Netflix, software by Francisco Marco-Serrano @ Aug 7, 2007

If you’re an A+ dummy (aka almost-not-a-dummy) you can try with this software (Varozhka), created by Eugene Rymski, to play around with the Netflix Prize dataset.

Brilliant!

Research (Management) 2.0, not Research 2.0

Posted in project management, software by Francisco Marco-Serrano @ Aug 2, 2007

For so many months I’ve been thinking and looking for a tool that would let me manage my activity as a researcher; firstly I looked into documental management systems and knowledge management systems, however, I always ended up doing nothing.
This idea was taken to life again during my last conversation with Pau Rausell-Köster, and how universities were starting to apply more advanced systems to the research activity (Research Management).

So, hands on!, a little bit of this plus a sprinkle of that and…, taking a look to the proposal of Julen Iturbe (Consultoría Artesana en la Red - Handcrafted Consulting in the Network), we can adapt a little bit the concept of Consulting 2.0 to obtain Research 2.0 (trendy too much, but mates, I’ve never been from the group that extends the knowledge frontier, but from the one that fills the gap - I’ve taken this idea from a conversation I had years ago with a professor called Jose Ramon Ruiz Tamarit).

So, in the words of Julen, “a project, a wiki”.

Netflix Prize for Dummies [III]

Posted in Netflix, databases, spreadsheets by Francisco Marco-Serrano @ Jul 30, 2007

MySQL ODBCNow we’ve got the data into a MySQL database, so next step is accessing it from our prefered application (sometimes, that means MS Excel, i.e.). So, let’s go:

1) Make sure you system is up to date (specially the Jet Engine).

2) Download and install the last MySQL ODBC Connector.

3) Go to Start_Settings_Control Panel_32bit ODBC and create a new DNS (choose MySQL driver).

Once finished, you’ll be able to access the data from MS Excel. However, you have to take into account you can’t view the data in the spreadsheet since the volume of rows is huge. The good thing is you’ll be able to calculate the statistics from there, or get data from pivot table, etc.

Netflix Prize for Dummies [II]

Posted in Netflix, databases, operations research by Francisco Marco-Serrano @ Jul 18, 2007

Next is the database. In this example I’m going to use MySQL, although you could use PostgreSQL, or MS SQL, for example. I’m in a Windows OS.

2) Creating the database and dumping the data into it.

a. Create the database:

CREATE DATABASE netflix;

b. Create the tables:

“training_ser” is the table where I’m going to dump the training_set data, made up of the movie ids, user ids, the rating, and the date.

CREATE TABLE `netflix`.`training_set` (
`idmovie` INTEGER UNSIGNED NOT NULL,
`iduser` INTEGER UNSIGNED NOT NULL,
`rating` INTEGER UNSIGNED NOT NULL,
`date` VARCHAR(10) NOT NULL,
PRIMARY KEY USING BTREE(`idmovie`, `iduser`);
)
ENGINE = MyISAM
COMMENT = ‘User ratings’;

“movies” is the table where I’ll dump the information from the movies file, made up of movie ids, release date, and title.

CREATE TABLE `netflix`.`movies` (
`idmovie` INTEGER UNSIGNED NOT NULL,
`release` INTEGER UNSIGNED NOT NULL,
`title` VARCHAR(150) NOT NULL,
PRIMARY KEY (`idmovie`)
)
ENGINE = MyISAM
COMMENT = ‘Movies List’;

Try doing the same for “probe” and “qualifying”.

c. Dump the data into the tables:

LOAD DATA LOCAL INFILE “C:/Netflix/training_set.txt”
REPLACE INTO TABLE netflix.training_set
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ STARTING BY ”
(idmovie, iduser, rating, date);

LOAD DATA LOCAL INFILE “C:/Netflix/movie_titles.txt”
REPLACE INTO TABLE netflix.movies
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ STARTING BY ”;

Try doing the same for “probe” and “qualifying”.

Any problems?.

Netflix Prize for Dummies [ I.b ]

Posted in Netflix, VBA, operations research by Francisco Marco-Serrano @ Jul 16, 2007

Yes, I wasn’t happy at all with the previous code so I changed it. It improved in processing time, coming down to 13 minutes and 59 seconds to aggregate all the files into a sole one (tough it increased size up to 2.62GB). Moreover, I have modified the structure so it’ll be easier to introduce the data into a database. Now the new file is divided into 4 (CSV) columns: movieid, userid, rating, date.

Here’s the VBA code:

Sub GroupData()

Dim T As Date
T = Now

Dim N As Double
Dim Text1 As String
Dim Text2 As String
Dim Text3 As String

Open “C:\Netflix\training_set.txt” For Output Access Write As #1

For N = 1 To 17770
Open “C:\Netflix\training_set\mv_00″ & Format(N, “00000″) & “.txt” For Input Access Read As #2

‘For the first line.
Input #2, Text1, Text2, Text3
Print #1, N & “,” & Right(Text1, Len(Text1) - (Len(CStr(N)) + 2)) & “,” & Text2 & “,” & Left(Text3, 10)

‘For the rest of lines.
Do While Not EOF(2)
Input #2, Text1, Text2
Print #1, N & “,” & Right(Text3, Len(Text3) - 11) & “,” & Text1 & “,” & Left(Text2, 10)
Text3 = Text2
Loop

Close #2

Next N

Close #1

MsgBox Format(Now - T, “hh:mm:ss”)

End Sub