Netflix Prize for Dummies [A+]
If you’re an A+ dummy (aka almost-not-a-dummy) you can try with this software (Varozhka), created by Eugene Rymski, to play around with the Netflix Prize dataset.
Brilliant!
The dark side of an operations researcher. Or how operations research can be a funny subject. Wanna join me?.
If you’re an A+ dummy (aka almost-not-a-dummy) you can try with this software (Varozhka), created by Eugene Rymski, to play around with the Netflix Prize dataset.
Brilliant!
For so many months I’ve been thinking and looking for a tool that would let me manage my activity as a researcher; firstly I looked into documental management systems and knowledge management systems, however, I always ended up doing nothing.
This idea was taken to life again during my last conversation with Pau Rausell-Köster, and how universities were starting to apply more advanced systems to the research activity (Research Management).
So, hands on!, a little bit of this plus a sprinkle of that and…, taking a look to the proposal of Julen Iturbe (Consultoría Artesana en la Red - Handcrafted Consulting in the Network), we can adapt a little bit the concept of Consulting 2.0 to obtain Research 2.0 (trendy too much, but mates, I’ve never been from the group that extends the knowledge frontier, but from the one that fills the gap - I’ve taken this idea from a conversation I had years ago with a professor called Jose Ramon Ruiz Tamarit).
So, in the words of Julen, “a project, a wiki”.
Now we’ve got the data into a MySQL database, so next step is accessing it from our prefered application (sometimes, that means MS Excel, i.e.). So, let’s go:
1) Make sure you system is up to date (specially the Jet Engine).
2) Download and install the last MySQL ODBC Connector.
3) Go to Start_Settings_Control Panel_32bit ODBC and create a new DNS (choose MySQL driver).
Once finished, you’ll be able to access the data from MS Excel. However, you have to take into account you can’t view the data in the spreadsheet since the volume of rows is huge. The good thing is you’ll be able to calculate the statistics from there, or get data from pivot table, etc.
Next is the database. In this example I’m going to use MySQL, although you could use PostgreSQL, or MS SQL, for example. I’m in a Windows OS.
2) Creating the database and dumping the data into it.
a. Create the database:
CREATE DATABASE netflix;
b. Create the tables:
“training_ser” is the table where I’m going to dump the training_set data, made up of the movie ids, user ids, the rating, and the date.
CREATE TABLE `netflix`.`training_set` (
`idmovie` INTEGER UNSIGNED NOT NULL,
`iduser` INTEGER UNSIGNED NOT NULL,
`rating` INTEGER UNSIGNED NOT NULL,
`date` VARCHAR(10) NOT NULL,
PRIMARY KEY USING BTREE(`idmovie`, `iduser`);
)
ENGINE = MyISAM
COMMENT = ‘User ratings’;
“movies” is the table where I’ll dump the information from the movies file, made up of movie ids, release date, and title.
CREATE TABLE `netflix`.`movies` (
`idmovie` INTEGER UNSIGNED NOT NULL,
`release` INTEGER UNSIGNED NOT NULL,
`title` VARCHAR(150) NOT NULL,
PRIMARY KEY (`idmovie`)
)
ENGINE = MyISAM
COMMENT = ‘Movies List’;
Try doing the same for “probe” and “qualifying”.
c. Dump the data into the tables:
LOAD DATA LOCAL INFILE “C:/Netflix/training_set.txt”
REPLACE INTO TABLE netflix.training_set
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ STARTING BY ”
(idmovie, iduser, rating, date);
LOAD DATA LOCAL INFILE “C:/Netflix/movie_titles.txt”
REPLACE INTO TABLE netflix.movies
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ STARTING BY ”;
Try doing the same for “probe” and “qualifying”.
Any problems?.
Yes, I wasn’t happy at all with the previous code so I changed it. It improved in processing time, coming down to 13 minutes and 59 seconds to aggregate all the files into a sole one (tough it increased size up to 2.62GB). Moreover, I have modified the structure so it’ll be easier to introduce the data into a database. Now the new file is divided into 4 (CSV) columns: movieid, userid, rating, date.
Here’s the VBA code:
Sub GroupData()
Dim T As Date
T = NowDim N As Double
Dim Text1 As String
Dim Text2 As String
Dim Text3 As StringOpen “C:\Netflix\training_set.txt” For Output Access Write As #1
For N = 1 To 17770
Open “C:\Netflix\training_set\mv_00″ & Format(N, “00000″) & “.txt” For Input Access Read As #2‘For the first line.
Input #2, Text1, Text2, Text3
Print #1, N & “,” & Right(Text1, Len(Text1) - (Len(CStr(N)) + 2)) & “,” & Text2 & “,” & Left(Text3, 10)‘For the rest of lines.
Do While Not EOF(2)
Input #2, Text1, Text2
Print #1, N & “,” & Right(Text3, Len(Text3) - 11) & “,” & Text1 & “,” & Left(Text2, 10)
Text3 = Text2
LoopClose #2
Next N
Close #1
MsgBox Format(Now - T, “hh:mm:ss”)
End Sub