Surprise scikit#

Surprise Logo


Suprise is a Python scikit for recommender systems based on explicit rating data. Thus, it does not support implicit ratings or content-based information.

It is an easy-to-use scikit to build, test and compare different algorithms for recommender systems. A complete documentation was created and can be found in the Documentation section.

The name Surprise stands for Simple Python RecommendatIon System Engine.


With pip:

$ pip install numpy
$ pip install scikit-surprise

With conda:

$ conda install -c conda-forge scikit-surprise

Getting started#

Here is a simple example showing how you can (down)load a dataset, split it for 5-fold cross-validation and compute the MAE and RMSE of the SVD algorithm.

from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

# Load the movielens-100k dataset (download it if needed).
data = Dataset.load_builtin('ml-100k')

# Use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)


Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

            Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std
RMSE        0.9311  0.9370  0.9320  0.9317  0.9391  0.9342  0.0032
MAE         0.7350  0.7375  0.7341  0.7342  0.7375  0.7357  0.0015
Fit time    6.53    7.11    7.23    7.15    3.99    6.40    1.23
Test time   0.26    0.26    0.25    0.15    0.13    0.21    0.06

How to use#

The first thing to do is to select a algorithm and import it. It will be used later when we’ll do a prediction of ratings. A full list of all available algorithms can be found in the documentation.

Then we need to load the data, either with a file, or with the load_from_df() method. The pandas Dataframe must possess three distinct fields : Users, Items, Ratings.

For loading the data from a pandas Dataframe :

from surprise import Dataset
from surprise import Reader
from surprise import KNNWithMeans

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df_alloys_feedback[["User", "Item", "Rating"]], reader)

We could also simply fit our algorithm to the whole dataset, rather than running cross-validation. This can be done by using the build_full_trainset() method which will build a trainset object :

#similarity parameters
sim_options = {
        "name": "cosine",   #pearson, cosine, msd
        "user_based": True, #user or item based similarity

algo = KNNWithMeans(sim_options=sim_options)
trainingSet = data.build_full_trainset() #train on the full set of data

You can now predict ratings by calling the function predict(). Let’s say you want the prediction of the user 213 for the item 132 (make sure they are both in the trainset!).

user_id = 213
item_id = 132

# get a prediction for specific users and items.
prediction = algo.predict(user_id, item_id)


This will return the estimation of the prediction computed by the algorithm.

Output :


If you want to get more information on the prediction by passing the actual value of the rating :

user_id = 213
item_id = 132

# get a prediction for specific users and items.
prediction = algo.predict(user_id, item_id, r_ui=4, verbose=True)

This should return you :

user: 196        item: 302        r_ui = 4.00   est = 4.06   {'actual_k': 40, 'was_impossible': False}

Create your own prediction algorithm#

If you are interested in creating your own prediction algorithm, the documentation is there to help you with a step by step guide on how to do it.

Create your own prediction algorithm


For additional information and documentation on how to use the scikit Surprise:

Surprise documentation

Machine Learning Surprise Python