Netflix Prize for Dummies [III]

Posted in Netflix, databases, spreadsheets by Francisco Marco-Serrano @ Jul 30, 2007

MySQL ODBCNow we’ve got the data into a MySQL database, so next step is accessing it from our prefered application (sometimes, that means MS Excel, i.e.). So, let’s go:

1) Make sure you system is up to date (specially the Jet Engine).

2) Download and install the last MySQL ODBC Connector.

3) Go to Start_Settings_Control Panel_32bit ODBC and create a new DNS (choose MySQL driver).

Once finished, you’ll be able to access the data from MS Excel. However, you have to take into account you can’t view the data in the spreadsheet since the volume of rows is huge. The good thing is you’ll be able to calculate the statistics from there, or get data from pivot table, etc.

Netflix Prize for Dummies [II]

Posted in Netflix, databases, operations research by Francisco Marco-Serrano @ Jul 18, 2007

Next is the database. In this example I’m going to use MySQL, although you could use PostgreSQL, or MS SQL, for example. I’m in a Windows OS.

2) Creating the database and dumping the data into it.

a. Create the database:

CREATE DATABASE netflix;

b. Create the tables:

“training_ser” is the table where I’m going to dump the training_set data, made up of the movie ids, user ids, the rating, and the date.

CREATE TABLE `netflix`.`training_set` (
`idmovie` INTEGER UNSIGNED NOT NULL,
`iduser` INTEGER UNSIGNED NOT NULL,
`rating` INTEGER UNSIGNED NOT NULL,
`date` VARCHAR(10) NOT NULL,
PRIMARY KEY USING BTREE(`idmovie`, `iduser`);
)
ENGINE = MyISAM
COMMENT = ‘User ratings’;

“movies” is the table where I’ll dump the information from the movies file, made up of movie ids, release date, and title.

CREATE TABLE `netflix`.`movies` (
`idmovie` INTEGER UNSIGNED NOT NULL,
`release` INTEGER UNSIGNED NOT NULL,
`title` VARCHAR(150) NOT NULL,
PRIMARY KEY (`idmovie`)
)
ENGINE = MyISAM
COMMENT = ‘Movies List’;

Try doing the same for “probe” and “qualifying”.

c. Dump the data into the tables:

LOAD DATA LOCAL INFILE “C:/Netflix/training_set.txt”
REPLACE INTO TABLE netflix.training_set
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ STARTING BY ”
(idmovie, iduser, rating, date);

LOAD DATA LOCAL INFILE “C:/Netflix/movie_titles.txt”
REPLACE INTO TABLE netflix.movies
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ STARTING BY ”;

Try doing the same for “probe” and “qualifying”.

Any problems?.

Netflix Prize for Dummies [ I ]

Posted in Netflix, VBA, databases, operations research by Francisco Marco-Serrano @ May 3, 2007

The Netflix Prize is in the company’s own words the”quest” for “substantially improve(ing) the accuracy of predictions about how much someone is going to love a movie based on their movie preferences”.

I read about the prize last february on Michael Trick’s blog and the first thing I saw was the $1 Million for the winner. However, although we’re on it for the money (YES!) we don’t thing we gonna get it. So, let’s mess about it!:

_For all of you that are, like me, amateur OR-ers, I’m starting a series of posts showing where the heck I am.

……………………………………………………….

1) The data: the training set (data you have to use to create the model) is made up of more than 17 thousand text files. So, although some experts are advising on Netflix’s forums not to group them, I’ll do.

Following my own weaknesses and economist-like-mind I’m going to group the data in a single file, in order to dump it into a database (PostgreSQL, probably). Even more, as I don’t have time to learn any other language, I’ll be using VBA for Excel.

Here we go…

Sub AgrupaDatos()

Dim N As Double
Dim TextoArchivo As String

Open “C:\training_set.txt” For Output As #1

For N = 1 To 17770
Open “C:\training_set\mv_00″ & Format(N, “00000″) & “.txt” For Input As #2

Do While Not EOF(2)
Line Input #2, TextoArchivo
Print #1, TextoArchivo
Loop

Close #2

Next N

Close #1

End Sub

The module above takes about 30 minutes (Pentium 1.73 Ghz, 1GB RAM) to process the data into a file with a size of 1,92GB.

Next, the database.