

welcome back to miningof massive datasets. we continue our discussionof recommender systems. in the the previous lecture, we looked at content-based approachesto building recommender systems. in this lecture, we're going to look at another popularapproach called collaborative filtering. the basic idea behind collaborativefiltering is very simple. suppose we have a user x to whomwe want to make recommendations. what were going to do is weregoing to find a group of other users,
whose likes anddislikes are similar to user x. for example, suppose you'redoing movie recommendations. now this group of users likethe same movies that x likes and dislike the same movies that x dislikes. we call this set of usersthe neighborhood of user x. once we find the set n of users or theneighborhood of users similar to user x, we find other movies that are likedby a lot of users in the set n and recommend those items to the user x. so that's the basic idea behindcollaborative filtering.
the key trick is to find the set ofusers that are similar to user x. the neighborhood of user x. and, to do that, we need to definea notion of similarity between users. let's see how to do this. so here's, here's a simpleexample with four users, a, b, c, and d and a bunch of movies. okay? hp1 stands for harry potter 1. hp2 stands for harry potter 2 and so on.
you can imagine whatever namesyou want for the other movies. sw stands for star wars. so for example,user a has rated three moves. harry potter 1 twilight and star wars 1. and he's given the ratings 4,5 and 1 to these movies. imagine we are rating on a scalefrom zero to five stars here. and user b has rated three movies,user c has rated three movies and user d has just rated two movies. let's call the set of ratings fora user the user's rating vector.
for example, user a's rating vectoris what i've highlighted here. now, if you have two users, x and y withrating vectors rx and ry, what we really need is a similarity metric sim(x, y) thatlooks at the rating of vectors rx and ry. now the interesting thinghere is that there's lots of movies that a has not rated andlots of movies that b has not rated. so the key in definingthe similarity function is how we deal with these unknown values. now, once we define the rating vector, we would like to define it in a way suchthat captures a very simple intuition.
it captures intuition that userswith similar tastes have higher similarity than userswith dissimilar tastes. for example, in this case, notice that aand b have rate only one movie in common, harry potter 1. however they both ratedthe movie fairly highly. whereas users a and c have actuallyrated two movies in common. twilight and star wars 1, buttheir ratings are very dissimilar. a seems to like twilightwhile c doesn't like it. while a hates star wars 1,while c loves it.
so we'd like to, you know,it seems intuitively that users a and c are dissimilar while users a andb are similar, and so we'd like to capture this intuition whenwe define the nothing of similarity. we'd like the similarity of a,b to behigher than the similarity of a,c. so the first option that wetried is jaccard similarity. which you are familiar withfrom previous lectures. so just to just as a reminder. the jaccard similarity of a,b is just raintersection rb divided by ra union rb. right, we just take the intersectionof the rating vectors and
divide it by their union. and it's, this is a, and notice thatthe jaccard similarity is nothing but one minus the jaccard distance of ra andrb. right so when we use the jaccardsimilarity, notice that a and b have one rating in common,harry potter 1 and they've rated overall five movies and thereforethe similarity of a and b is 1 over 5. and similarly the similarity of a and c since they've rated twomovies in common is 2 over 4. notice though that when wecompute similarities in this way,
the jaccard similarity ofa,b is actually less then the jaccard similarity of a and c. this is counter to the intuition thatwe wanted to capture, that a and b are actually more similar than a and c. so we'll have to abandon this notionof using jaccard similarities. the problem with jaccard similaritythat we'd like to fix is that it ignores the rating values. it just noticed this, that a andb have both watched one movie in common, while a andc have watched two movies in common.
but it doesn't capture the fact that,you know, it doesn't take into account how they actually liked the moviesthat they that they watched. now, a way of capturing the ratingvalues and using them to compute similarity is to notice that a andb and a and c are vectors, and so we can compute the cosine betweenthe vectors and use cosine distance. and this is very similar to the way we used cosine distances in the caseof content-based filtering. so, so we have let's say we definethe similarity of a,b to be just the cosine of the angle betweenthe rating vectors ra and rb.
right? and let's compute in this casecompute the similarity of a and b. and notice to computethe cosine similarity we have to insert some value forthe unknown ratings. and the simplest thing to dois to treat them as zeros. right suppose we treat all theseunknown values here as zeros. and compute the cosine ofthe angle between a and b. and when you, when you do thatthe cosine turns out to be 0.38. and similarly when youcompute the cosine of a and
c the angles turns out to be 0.32. and notice, in fact, that this capturesthe fact that the similarity of a,b is actually greater than the similarityof a,c as we actually wanted. but, actually it's not very much. the similarities are actuallyvery close to each other. it just sohappens that we that the similarity of a,b is actually marginally greater thanthe similarity of a,c which we wanted. although, the, this, this still doesn'tseem to capture that, you know, a and b are much more similar toeach other than a and c are.
the problem we have with cosinesimilarities is that it keeps the missing ratings actually as negative ratings. what we've done is that we'veused 0 to fill in the blanks and in our ratings scale from 0 to 5,0 would is the worst possible rating. so we've sort of assumed that if a hasnot rated harry potter 2 he'll give harry potter 2 a rating of 0 whichis actually a bad assumption, given the fact that theyactually liked harry potter 1. so this is the problem that we have withcosine rating that we'd like to fix. one way of fixing cosine similarityto accomplish what we want is to
use something called a centered cosine. and the way we're going to do that iswe're going to normalize our ratings for a given user by subtracting the rawmean or the average rating of the user. to illustrate what i mean,let's go back to our example. here are the ratings forthe rating vector for a and notice that the average rating fora is 10 over 3, because there are, you know, a has rated three items,and the sum of those ratings is 10. so the average rating of a is 10 over 3. and similarly, the average ratingof b is 14 over 3, and so on.
what we're going to do, is we're going togo through the row for user a, and we're going to subtract the row mean, whichis 10 over 3, from each of the ratings. except the 0 ratings which your going to, except the blank ratings whichyour going to treat as 0's. 'kay? and, when i do that. here is a modified ratingmatrix that we end up with. notice that a's rating for harry potter 1 which was 4 has nowbecome 2 over 3 because we ended up
subtracting the, the row average, whichis 10 over 3 from each of the ratings. and 5 has become 5,5 over 3, and rating 1 for star wars has actually become a negativerating, negative 7 over 3, right? and, of course,we treat the blank values as 0's. notice something interesting here. if you sum up the the, the ratingsin any row, you're going to get 0. what we've actually done is thatwe've centered the ratings of each user around 0. so 0 becomes the average rating forevery user.
and positive ratings indicated the userliked a movie more than average. and negative ratings indicated the userliked the movie less than average. and the magnitude of ratings al,also means, you know, shows how much he liked ordisliked a specific item. now once you've done the centering andwe can, we can, we can compute cosines usingthese, these centered ratings. and when we do that you know,the similarity of a and b now becomes the cosine ofthe centered rating vectors ra and rb, and that turns out to be 0.09,and the similarity of a and c,
using the centered cosine vector, actuallyturn out, turns out being negative 0.56, which actually captures a fact that a andc are quite dissimilar users. you know, the movies a, a likes, c doesn't like and the moviesa doesn't like, c actually likes. a notice also there is a big gapnow between the similarity of a,b on the similarity of a,c. it shows that a andb are much more alike than a and c are. in fact a andc are very unlike each other. so the centered cosine captures the,
the our intuition of similar usersmuch better than the simple cosine. and this is because the missing ratings,instead of being treated as negative ratings,are actually treated as average ratings. and also this turns out to be a nice wayto handle tough raters and easy raters. because we, you know,some rater, you know, some people tend to be tough raters and,and tend to, you know give rate movies on a scalereally of 0 to 3, while others, they tend to be easy raters and tend to be muchmore liberal with the star star ratings. by subtracting out the averagerating of each user we've
centered users around an average of 0 and we've sort of normalized to some extentthe tough raters and the easy raters. and so the center,centered cosine has these two advantages of centering around 0 andcapturing that intuition better. now, it turns out there's another name for the centered cosine similarityin the world of statistics. it's also known asthe pearson correlation. so if you hear somebody talkabout the pearson correlation, they're just talking aboutcentered cosine similarity.
okay, now we've come up with a way ofestimating similarity between users. once we have this, how do we actuallymake rating predictions for a user? so here is a problem. let, suppose rx is the vectorof user x's ratings. and what we're going to do is we'regoing to use the notion of centered cosine similarity to find the set n of userswhich we'll call the neighborhood, and the neighborhood consists of kusers who are most similar to x. we're going to go throughthe set of all users. compute the similarity between user x andevery other user.
and select the top select k users withthe highest similarity values and call that the set n. but we also have to be careful, since we are trying to estimatethe rating of item i by the user x. we want to make sure that thisset n consists only of users who have actually rated item i. right?so, n is actually therefore the set of users k users most similar to x whoalso happened to have rated item i. now once we have this, we can makea prediction for user x and item i.
the simplest prediction is to just takethe average ratings from the neighborhood. remember the set n consists ofusers who also rated item i and are similar to user x. so the simplest estimate is to just usethe average rating of all the users for item i in the neighborhood andthat's and and take that as our estimate ofthe rating for for user x and item i. now the option 1 is very, very simple, but it actually ignores the actualsimilarity values between users. now while neighborhood the neighborhoodn consists of users who are similar to
item i there might be a range ofsimilarity values within the neighborhood. it might you know contain users whoare very highly similar to the user x and a few users who are notthat similar to user x. and so what we'd really like to dois to weight the average dating by the similarity values andthat gives us option 2. option 2 is a weighted average. we look at the, the neighborhood n andfor each user y, in the neighborhood n, we rate y'srating for i by the similarity of x and, x and y, and then we just normalize it bytaking the sum of the similarities, and
that gives us a rating estimate foruser x and item i. and here, notice here that ijust use a short hand sxy if there's a similarity of user x and user y. now the technique that we used so far is called user-user collaborativefiltering, because given a user, we try to find other users that aresimilar to that user and find user ratings of those users to predict the ratingsof the user that we started out with. a dual approach to user-usercollaborative filtering is item-item collaborative filtering.
the basic idea is simple. instead of starting off with the user andfinding similar user, we're going to start out with an item iand find similar items to, to item i. and then we are going toestimate the rating for item i based on the ratings forother similar items. and we can use in fact the,the same similarity matrix and prediction functions asthe user-user model. we can use the center cosine distanceto find the centered cosine similarity, to find a neighborhood ofitems similar to our item i.
and then we can use the prediction,the average or the weighted average predictionmodeled from the previous slide to actually make rating predictions forthe item. and when we do that, here's the here'swhat the, the rating function looks like. what, what, what we're trying to do isthat they're going to predict the rating for user x ant item i, sowe're going to start with the item i and we're going to finda neighborhood of items. the neighborhood of items n(i,x) is just a set of items that are rated bythat are both rated by the user x,
and are similar to the tothe item i that were looking at. all right, so we just take the the item i,compute the similarity between item i and every other item in the set that weknew about restrict attention or lead to the items that hadbeen rated by the user x and then take the top k of thoseas the neighborhood n(i,x). and, and once we do that, we can justuse this same weighted average formula that we had from the previous slide topredict the rating of user x and item i. so let's work through an exampleto illustrate what this means. let's say that the neighborhood sizethat we want to pick the size of
n is actually 2. we're going to be looking at the twonearest neighbors of, of item i, and we're going to be using itemto item collaborative filtering. here is a utility matrix and the here the, the, the yellows are the knownratings in the utility matrix. we have movies on the y axis. and we have users on the x axis. and the blanks are the unknownratings that we're trying to predict. and let's assume that the ratingvalues are between 1 and 5.
remember our first step and, and let's say our goal is to estimatethe the rating of movie 1 by user 5. remember the first step isto take the take movie 1 and find other movies thatare similar to movie to movie 1. we're going to use pearson correlationas our similarity, which is same as a centered cosine orthe distance that we computed. so we're going to takeevery other movie and we're going to compute it's centeredcosine distance from movie 1. and when we do that and movie1 is,of course the center cosine, is is 1.0.
movie 2 is somewhat dissimilarto movie 1 and so on. and so this vector here illustrate thecentered you know list the centered cosine distances of of all the differentmovies with respect to movie 1. and, what we're going to do i remembersince, since our neighborhood size is 2 we need to find the two movies withthe highest similarity to to movie 1. and and that have also beenrated by by the, by user 5. and those two happen to be movie 3,with a similarity of 0.41 and movie 6 with a, with a similarity of 0.59. so we're going to pick those twomovies as our neighborhood for for
movie 5 oh sorry, or for movie 1. once we do that the similarity valuesare in this case 0.41 and 0.59. and we're just going to use a weightedaverage estimate to predict the rating for movie 1 and user 5. the weighted average is 0.41times the rating of of movie, of movie 3 and 0.59 times the ratingof movie 6, which is 3 and, and then divided by the, divided bythe normalizing value this gives us 2.6. and 2.6 is is predicted rating for user 1 and and movie formovie 1 and user 5.
we looked at two ways of doingcollaborative filtering. user-user and item-item. now, in theory, user-user and item-item are dual approaches andshould have similar kind of performance. in practice, though,they don't perform similarly at all. in fact, in practice, it's been observedthat item-item collaborative filtering, hugely outperforms user-user collaborativefiltering, for most use cases. in use cases, for example,such as movies, or or books, and so on item-item clearlyoutperforms user-user.
now, we might wonder why this is the case. and the answer turns outto be quite interesting. items are inherently simpler than users. items belong to, for example,a small set of genres. for example you know, you can takea piece of music and, and, and classify it as classical music,or pop, or rock and so on. but as users tend to have very,very varied taste. so the same user might like for example baroque classical music and,and acid rock.
and these two are verydifferent genres but the same user might actuallylike both these genres. but it's very rare that an itemwill actually belong to both these genres of music. and therefore,it turns out that the notion of item similarity is inherently more meaningfulthan the notion of user similarity. and that's why item-itemcollaborative filtering works much better than user-user collaborativefiltering for most use cases.