Netflix Prize for Dummies [ I.b ]

Posted in Netflix, VBA, operations research by Francisco Marco-Serrano @ Jul 16, 2007

Yes, I wasn’t happy at all with the previous code so I changed it. It improved in processing time, coming down to 13 minutes and 59 seconds to aggregate all the files into a sole one (tough it increased size up to 2.62GB). Moreover, I have modified the structure so it’ll be easier to introduce the data into a database. Now the new file is divided into 4 (CSV) columns: movieid, userid, rating, date.

Here’s the VBA code:

Sub GroupData()

Dim T As Date
T = Now

Dim N As Double
Dim Text1 As String
Dim Text2 As String
Dim Text3 As String

Open “C:\Netflix\training_set.txt” For Output Access Write As #1

For N = 1 To 17770
Open “C:\Netflix\training_set\mv_00″ & Format(N, “00000″) & “.txt” For Input Access Read As #2

‘For the first line.
Input #2, Text1, Text2, Text3
Print #1, N & “,” & Right(Text1, Len(Text1) - (Len(CStr(N)) + 2)) & “,” & Text2 & “,” & Left(Text3, 10)

‘For the rest of lines.
Do While Not EOF(2)
Input #2, Text1, Text2
Print #1, N & “,” & Right(Text3, Len(Text3) - 11) & “,” & Text1 & “,” & Left(Text2, 10)
Text3 = Text2
Loop

Close #2

Next N

Close #1

MsgBox Format(Now - T, “hh:mm:ss”)

End Sub

Netflix Prize for Dummies [ I ]

Posted in Netflix, VBA, databases, operations research by Francisco Marco-Serrano @ May 3, 2007

The Netflix Prize is in the company’s own words the”quest” for “substantially improve(ing) the accuracy of predictions about how much someone is going to love a movie based on their movie preferences”.

I read about the prize last february on Michael Trick’s blog and the first thing I saw was the $1 Million for the winner. However, although we’re on it for the money (YES!) we don’t thing we gonna get it. So, let’s mess about it!:

_For all of you that are, like me, amateur OR-ers, I’m starting a series of posts showing where the heck I am.

……………………………………………………….

1) The data: the training set (data you have to use to create the model) is made up of more than 17 thousand text files. So, although some experts are advising on Netflix’s forums not to group them, I’ll do.

Following my own weaknesses and economist-like-mind I’m going to group the data in a single file, in order to dump it into a database (PostgreSQL, probably). Even more, as I don’t have time to learn any other language, I’ll be using VBA for Excel.

Here we go…

Sub AgrupaDatos()

Dim N As Double
Dim TextoArchivo As String

Open “C:\training_set.txt” For Output As #1

For N = 1 To 17770
Open “C:\training_set\mv_00″ & Format(N, “00000″) & “.txt” For Input As #2

Do While Not EOF(2)
Line Input #2, TextoArchivo
Print #1, TextoArchivo
Loop

Close #2

Next N

Close #1

End Sub

The module above takes about 30 minutes (Pentium 1.73 Ghz, 1GB RAM) to process the data into a file with a size of 1,92GB.

Next, the database.