Netflix Prize for Dummies [ I.b ]
Yes, I wasn’t happy at all with the previous code so I changed it. It improved in processing time, coming down to 13 minutes and 59 seconds to aggregate all the files into a sole one (tough it increased size up to 2.62GB). Moreover, I have modified the structure so it’ll be easier to introduce the data into a database. Now the new file is divided into 4 (CSV) columns: movieid, userid, rating, date.
Here’s the VBA code:
Oops! It seems we have found nothing related.Sub GroupData()
Dim T As Date
T = NowDim N As Double
Dim Text1 As String
Dim Text2 As String
Dim Text3 As StringOpen “C:\Netflix\training_set.txt” For Output Access Write As #1
For N = 1 To 17770
Open “C:\Netflix\training_set\mv_00″ & Format(N, “00000″) & “.txt” For Input Access Read As #2‘For the first line.
Input #2, Text1, Text2, Text3
Print #1, N & “,” & Right(Text1, Len(Text1) - (Len(CStr(N)) + 2)) & “,” & Text2 & “,” & Left(Text3, 10)‘For the rest of lines.
Do While Not EOF(2)
Input #2, Text1, Text2
Print #1, N & “,” & Right(Text3, Len(Text3) - 11) & “,” & Text1 & “,” & Left(Text2, 10)
Text3 = Text2
LoopClose #2
Next N
Close #1
MsgBox Format(Now - T, “hh:mm:ss”)
End Sub



Remember, this is just for dummies. Don’t start me with “it could be optimised!”, “what a crappy code!”, blah blah, blah…
Of course, I would accept suggestions! ; )
Comment by Francisco Marco-Serrano — July 16, 2007 @ 6:03 pm
Moreover, consider the above code for transforming the other files: “probe.txt” and “qualifying.txt”.
Comment by Francisco Marco-Serrano — July 17, 2007 @ 12:00 am