Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosine similarity between two particular users is 1.0 whereas they have different ratings for a common item #483

Open
deil87 opened this issue Aug 20, 2024 · 1 comment

Comments

@deil87
Copy link

deil87 commented Aug 20, 2024

Description

I was trying to create my custom algo and before moving to something more complicated I wanted to make sure I understand how current code is working.
I couldn't understand why Cosine similarity between two particular users is 1.0 whereas they have different ratings for common item.

Steps/Code to Reproduce

from surprise import AlgoBase, BaselineOnly, NormalPredictor
from surprise import PredictionImpossible

class MyAlgorithm(AlgoBase):
    def __init__(self, sim_options={}, bsl_options={}):

        AlgoBase.__init__(self, sim_options=sim_options, bsl_options=bsl_options)

    def fit(self, trainset):

        AlgoBase.fit(self, trainset)

        # Compute baselines and similarities
        self.bu, self.bi = self.compute_baselines()
        self.sim = self.compute_similarities()

        return self

    def estimate(self, u, i):

        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            raise PredictionImpossible("User and/or item is unknown.")

        # Compute similarities between u and v, where v describes all other
        # users that have also rated item i.
        
        
        neighbors = [(v, self.sim[u, v]) for (v, r_ignored) in self.trainset.ir[i]]  
        # Sort these neighbors by similarity
        neighbors = sorted(neighbors, key=lambda x: x[1], reverse=True)

        print(f"estimate({u}, {i}):")
        print("The 3 nearest neighbors of user", str(u), "are:")
        for v, sim_uv in neighbors[:3]:
            print(f"user {v} with sim {sim_uv:1.2f}")

        # ... Aaaaand return the baseline estimate anyway ;)
        bsl = self.trainset.global_mean + self.bu[u] + self.bi[i]
        return bsl

Preparing the data ( taking subsample of 20K to make exploration/investigation faster):

# import urllib.request
# urllib.request.urlretrieve("http://files.grouplens.org/datasets/movielens/ml-1m.zip", "ml-1m.zip")

reader = Reader(rating_scale=(1, 5))
data_full_lr_20k = data_full_lr[:20000]
data_full = Dataset.load_from_df(data_full_lr_20k[["user", "item", "label"]], reader)

trainset_full, testset_full = train_test_split(data_full, test_size=0.25, random_state=42)

Then I'm running my new custom algo:

bsl_options = {"method": "als", "n_epochs": 5, "reg_u": 12, "reg_i": 5}
sim_options = {
    "name": "cosine",
    "user_based": True,  # compute  similarities between users
}

algo = MyAlgorithm(bsl_options=bsl_options, sim_options=sim_options)

# train and test algorithm.
algo.fit(trainset_full)
predictions = algo.test(testset_full)

# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)

Expected Results

In the printout ( the one that is provided in docs ) I was looking for some neigbors with non zero similarity and for example took this one:

estimate(171, 428)
The 3 nearest neighbours of user 171 are:
user 1123 with sim 1.00
user 450 with sim 0.00
user 455 with sim 0.00

Note that I also added logging for the current estimate function parameters to know which item we are predicting for ( 428 in this example)

So I see that algo considered user 171 and 1123 to be similar. I decided to check it manually.

As "user_based": True then we are calculating similarity between user 171 and other users, that have ratings for 428.

So i checked trainset_full.ir[428]
Output:

[(450, 1.0),
 (455, 2.0),
 (2683, 3.0),
 (1715, 2.0),
 (2819, 2.0),
 (2928, 2.0),
 (1654, 1.0),
 (2179, 3.0),
 (49, 3.0),
 (621, 2.0),
 (3757, 3.0),
 (324, 1.0),
 (2780, 2.0),
 (1123, 2.0)]. <--------------------------- this one. User 1123 rated item 428 with rating 2.0. Looks ok, so far(ie it qualified into the neighbourhood) 

Then I decided to check rating for these 2 users 171 and 1123 to see whether they have similar ratings for common items.

# trainset_full.ur[171]
sorted(trainset_full.ur[171], key=lambda tup: tup[0])

Output:
[(32, 4.0),
 (116, 4.0),
 (160, 2.0),
 (322, 4.0),
 (476, 3.0),
 (513, 2.0),
 (540, 4.0),
 (891, 2.0),
 (896, 4.0),
 (919, 3.0),
 (1034, 3.0),
 (1277, 4.0),
 (1921, 4.0),
 (2106, 3.0),
 (2134, 1.0),
 (2152, 5.0),
 (2493, 3.0)]

sorted(trainset_full.ur[1123], key=lambda tup: tup[0])

[(54, 4.0),
 (182, 3.0),
 (281, 4.0),
 (292, 4.0),
 (428, 2.0),
 (459, 5.0),
 (654, 5.0),
 (891, 3.0),
 (1019, 2.0),
 (1167, 5.0),
 (1210, 3.0),
 (1557, 4.0),
 (1637, 5.0),
 (1666, 3.0)]

I found only one common item and the ratings are different

(891, 2.0)  vs   (891, 3.0),

I didn't specify min_support but it shouldn't matter as when we below it we get 0. It means we are greater or equal than min_support value ( probably default is 1 ).

I would expect similarity not to be equal 1 as ratings are not the same ( 2.0 vs 3.0)

Actual Results

The 3 nearest neighbours of user 171 are:
user 1123 with sim 1.00

Versions

macOS-10.16-x86_64-i386-64bit
Python 3.10.14 (main, May 6 2024, 14:47:20) [Clang 14.0.6 ]
surprise 1.1.4

@deil87
Copy link
Author

deil87 commented Aug 22, 2024

I think the problem here is that cosine similarity is failing to see the difference when we have 1D support vector

To highlight my point I will use 5 vs 1 ratings instead of 3 and 2 in my example above.

a= [5], b= [1]
(a·b) / (‖a‖ × ‖b‖) = 5 / (5.000000 × 1.000000) = 1.000000

Even if we had second dimension with equal values we would get a slightly better result ( it would start to notice some difference between vectors)

a= [3,1], b= [2,1]

a·b = 5×1 + 1×1 = 6.

‖a‖ = √[5² + 1²] = 5.099020 and
‖b‖ = √[1² + 1²] = 1.414214

(a·b) / (‖a‖ × ‖b‖) = 6 / (5.099020 × 1.414214) = 0.832050.

So I believe that min_support should be always >= 2 when cosine similarity is specified in parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant