Introductory Demo
 http://www.caffeinatedrook.com/MovieRec/MovieRecServlet
1
Problem Statement
Data: User-Movie Ratings
Input: User number and Movie number
Output: Predicted Rating
Goal: Predict Ratings with the smallest RMSD possible. (Make
Customer happy.)
-0.07283353
0.11694841
-0.0622078
0.081832446
0.049034953
-0.008441236
0.004925302
0.001412398
0.05334269
2
+
0.001291469
0.06918105
-0.08339876
0.12175138
0.17088805
-0.1085485
-0.07531176
-0.083747916
0.12860337
=
4.312 / 5
Motivation – Netflix
 “The Netflix Prize seeks to substantially improve the accuracy of
predictions about how much someone is going to love a movie
based on their movie preferences. The Netflix Prize improves our
ability to connect people to the movies they love.”
www.netflix.com
3
Impact to Field
 Better Recommendations
= Happy Customers
 Happy Customers
= More Money
= Larger Market Share
4
Motivation - Personal
 One million of them
 Feature Extraction
 My uber-competitive nature (aka Justin’s Wife)
5
A problem with the problem statement
 Tackling the Netflix Challenge requires many hundreds
(thousands…more?) of hours of computation.
 Ultimately, it will require the solution to many subproblems.
 Sparcity
 Noise
 Memory Requirements
 Movie Similarity
 User Similarity
6
(more on these a little later)
The Problem Statement Redefined
Data: User-Movie Ratings
Goal: Discover the relationships between..
Movies to other movies
Users to other users
Movies to users
7
Related Work
 Netflix Prize forum http://www.netflixprize.com//community/
 Lots of info on strategies people are trying.
www.netflix.com
www.amazon.com
www.blockbuster.com
www.spout.com
 Singular value decomposition and least squares solutions, Numerische Mathematik Springer





8
Berlin / Heidelberg
Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing)
Predicting User Preference for Movies using NetFlix database, Dhiraj Goel and Dhruv Batra,
Carnegie Mellon University
The Netflix Prize, James Bennett Stan Lanning
Use of KNN for the Netflix Prize, Ted Hong, Dimitris Tsamis, Stanford University
How To Break Anonymity of the Netflix Prize Dataset, Arvind Narayanan, Vitaly Shmatikov
Domain Understanding
The success or failure of retailers rely on matching the
customer to the product. In the case of online retailers,
like Netflix, recommender systems can be built to utilize
the vast sums of data generated online.
Netflix keeps a record for each user, containing the rating
(1-5) for each film the user has rated.
9
Data Selection
 First, what the data does not contain.
 It does not contain
 Movie titles, directors, actors, studio, year
 Customer age, sex, income, favorite color
 Some contestants have written web-crawlers to mine this
information from the web.
10
Data Selection


















11
6:
2031561,1,2004-07-26
1176140,1,2004-02-16
2336133,2,2004-09-05
1521836,1,2004-08-11
117277,3,2004-10-12
326587,3,2004-09-06
1961542,3,2004-04-20
1041552,3,2004-10-19
1678346,3,2005-04-11
643182,2,2004-07-18
2182301,5,2004-08-04
2502669,2,2004-02-10
2211030,4,2004-05-26
603277,3,2004-12-13
214166,2,2005-10-09
……..
……..
• 17,770 Movies
• 480,189 users
• 100,480,507 Ratings
• 17,770 * 480,189
= 8,532,958,530
• 100,480,507 / 8,532,958,530
= 0.01177
(%98.8 sparse!)
Cleaning and Preprocessing
1)
2)
12
Transformed files from movie-view to user-view.
Normalized user ratings via Z-Score normalization.
Discovering Patterns
 Which Software to use? SPSS, SAS, Weka?
 8,532,958,530 ratings * 4 bytes / rating
 34,131,834,120 bytes
 33,331,869 kilobytes
 32,550 megabytes
 31 gigabytes
 Too big to hold the entire matrix
 Too big to hold condensed matrix
 Too “stupid” to manage memory without paging.
13
Discovering Patterns
 Which feature selection method to use?
 Principle Component Analysis
 Singular Value Decomposition
 Multifactor Dimensionality Reduction
 Latent Semantic Analysis
14
Discovering Patterns
M = 17,770 * 25
D = 17,770 * 480,189
444,250
12,004,725
12,448,975
8,532,958,530
Movie: a
User: b
vab = ∑i(Uai x Mbi)
U = 25 * 480,189
15
1000
.001
1c/5h
25c / ~5
 A little board work to explain the algorithm
16
Interpretation: Feature 1-movie view
Trailer Park Boys: Season 3
Trailer Park Boys: Season 4
The Lord of the Rings: The Fellowship of the Ring:
Extended Edition
Lord of the Rings: The Return of the King:
Extended Edition
Lord of the Rings: The Two Towers: Extended
Edition
Lost: Season 1
Veronica Mars: Season 1
House
4
As Time Goes By: Series 9
3
Gilmore Girls: Season 4
Sweet Potato Pie
Legion of the Dead
Dark Town
Comedy Only in Da Hood
Predator Island
Bad Bizness
Vampiyaz
My Big Phat Hip Hop Family
Jack O'Lantern
Desperate Souls
2
1
0
-1
-2
17
Interpretation: Feature 2-movie view
Lost in Translation
National Lampoon's Mr. Wong
Without You I'm Nothing
Punch-Drunk Love
Dogville
The Royal Tenenbaums
Whiteboyz
Pornografia
Spooks & Creeps
Kaaterskill Falls
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
18
-1
Dragon Ball Z: World Tournament
Dragon Ball: Piccolo Jr. Saga: Part 2
Dragon Ball: Tien Shinhan Saga
Dragon Ball Z: Fusion
Dragon Ball: Red Ribbon Army Saga
Dragon Ball Z: Garlic Jr.
Dragon Ball: Piccolo Jr. Saga: Part 1
Armageddon
Dragon Ball: The Path to Power
Pearl Harbor
Interpretation: Feature 3-movie view
Nostradamus: A Voice from the Past
1.5
Absolution
1
Ozzy Osbourne: Double O: Unauthorized
Monster-a-Go-Go!
0.5
Dark Harvest 2: The Maize
0
Jessica: A Ghost Story
Vanilla Sky
-0.5
Ivan Vasilievich: Back to the Future
-1
American Beauty
Still Bout It
-1.5
WWE: Rebellion 2002
Battle Athletes: Vol. 3: Go
Sailor Moon: Vol. 10: The Trouble With Rini
Battle Athletes Victory: Vol. 7: The Last Dance
Battle Athletes Victory: Vol. 1: Training
Battle Athletes Victory: Vol. 8: The Human Race!
ECW: Extreme Evolution: Extreme
Championship Wrestling
Battle Athletes Victory: Vol. 6: Willpower
Fushigi Yugi: The Mysterious Play: Eikoden
19
Lupin the 3rd: Dead or Alive
Interpretation: Nearest Neighbors
18 components
 Find the nearest neighbors using Euclidean (q=2) distance
20
q=1
1. American Beauty (1999)
2. Fight Club (1999)
3. Reservoir Dogs (1992)
4. Mystic River (2003)
q=3
1. American Beauty (1999)
2. Mystic River (2003)
3. Fight Club (1999)
4. Traffic (2000)
q=2
1. American Beauty (1999)
2. Fight Club (1999)
3. Mystic River (2003)
4. Reservoir Dogs (1992)
q=4
1. American Beauty (1999)
2. Mystic River (2003)
3. Fight Club (1999)
4. Traffic (2000)
Demo – Name a movie!
 http://www.caffeinatedrook.com/MovieRec/MovieRecServlet
21
Descargar

Title Slide - DePaul University