推荐
此笔记本展示了如何将 scikit-network 应用于内容推荐。
我们使用 Movielens 数据集,它是 netset 集合的一部分,对应于 671 名用户对 9066 部电影的评分。
[1]:
from IPython.display import SVG
[2]:
import numpy as np
from scipy.cluster.hierarchy import linkage
[3]:
from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.embedding import Spectral
from sknetwork.utils import get_neighbors
from sknetwork.visualization import visualize_dendrogram
数据
[4]:
dataset = load_netset('movielens')
Downloading movielens from NetSet...
Unpacking archive...
Parsing files...
Done.
[5]:
biadjacency = dataset.biadjacency
names = dataset.names
labels = dataset.labels
names_labels = dataset.names_labels
[6]:
biadjacency
[6]:
<9066x671 sparse matrix of type '<class 'numpy.float64'>'
with 100004 stored elements in Compressed Sparse Row format>
[7]:
n_movies, n_users = biadjacency.shape
[8]:
# ratings
np.unique(biadjacency.data, return_counts=True)
[8]:
(array([0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ]),
array([ 1101, 3326, 1687, 7271, 4449, 20064, 10538, 28750, 7723,
15095]))
[9]:
# positive ratings
positive = biadjacency >= 3
[10]:
positive
[10]:
<9066x671 sparse matrix of type '<class 'numpy.bool_'>'
with 82170 stored elements in Compressed Sparse Row format>
[11]:
names_labels
[11]:
array(['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime',
'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX',
'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
'Western'], dtype='<U11')
[12]:
labels.shape
[12]:
(9066, 19)
PageRank
我们首先使用(个性化)PageRank 来获取每个类别中最受欢迎的电影。
[13]:
pagerank = PageRank()
[14]:
# top-10 movies
scores = pagerank.fit_predict(positive)
names[top_k(scores, 10)]
[14]:
array(['Forrest Gump (1994)', 'Pulp Fiction (1994)',
'Shawshank Redemption, The (1994)',
'Silence of the Lambs, The (1991)',
'Star Wars: Episode IV - A New Hope (1977)', 'Matrix, The (1999)',
'Jurassic Park (1993)', "Schindler's List (1993)",
'Back to the Future (1985)',
'Star Wars: Episode V - The Empire Strikes Back (1980)'],
dtype=object)
[15]:
# number of movies per genre
n_selection = 10
[16]:
# selection
selection = []
for label in np.arange(len(names_labels)):
ppr = pagerank.fit_predict(positive, weights=labels[:, label])
scores = ppr * labels[:, label]
selection.append(top_k(scores, n_selection))
selection = np.array(selection)
[17]:
# show selection (some movies may have several genres)
for label, name_label in enumerate(names_labels):
print('---')
print(label, name_label)
print(names[selection[label, :5]])
---
0 Action
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
'Jurassic Park (1993)'
'Star Wars: Episode V - The Empire Strikes Back (1980)'
'Terminator 2: Judgment Day (1991)']
---
1 Adventure
['Star Wars: Episode IV - A New Hope (1977)' 'Jurassic Park (1993)'
'Star Wars: Episode V - The Empire Strikes Back (1980)'
'Back to the Future (1985)' 'Toy Story (1995)']
---
2 Animation
['SpongeBob SquarePants Movie, The (2004)' 'Tangled Ever After (2012)'
'Space Chimps (2008)' 'Pokémon 3: The Movie (2001)' 'Valiant (2005)']
---
3 Children
['Thomas and the Magic Railroad (2000)' 'Smurfs 2, The (2013)'
'Like Mike (2002)' 'Hey Arnold! The Movie (2002)'
'Race to Witch Mountain (2009)']
---
4 Comedy
['Forrest Gump (1994)' 'Pulp Fiction (1994)' 'Back to the Future (1985)'
'Toy Story (1995)' 'Fargo (1996)']
---
5 Crime
['Pulp Fiction (1994)' 'Shawshank Redemption, The (1994)'
'Silence of the Lambs, The (1991)' 'Fargo (1996)' 'Godfather, The (1972)']
---
6 Documentary
['SOMM: Into the Bottle (2016)' 'Cocaine Cowboys: Reloaded (2014)'
"Cocaine Cowboys II: Hustlin' With the Godmother (2008)"
'Agony and the Ecstasy of Phil Spector, The (2009)' 'Promises (2001)']
---
7 Drama
['Pulp Fiction (1994)' 'Forrest Gump (1994)'
'Shawshank Redemption, The (1994)' "Schindler's List (1993)"
'American Beauty (1999)']
---
8 Fantasy
['Twilight Saga: Eclipse, The (2010)' 'Fat Albert (2004)'
'Nightbreed (1990)' 'Beastmaster 2: Through the Portal of Time (1991)'
'Solace (2015)']
---
9 Film-Noir
['Kiss Before Dying, A (1956)' 'T-Men (1947)' 'No Way Out (1950)'
'Force of Evil (1948)' 'Bullet to the Head (2012)']
---
10 Horror
['Silence of the Lambs, The (1991)' 'Rogue (2007)'
'Paranormal Activity: The Marked Ones (2014)' 'Ring of Terror (1962)'
'Carnosaur 3: Primal Species (1996)']
---
11 IMAX
['Jack the Giant Slayer (2013)' "Dr. Seuss' The Lorax (2012)"
'After Earth (2013)' 'Resident Evil: Retribution (2012)'
'Mars Needs Moms (2011)']
---
12 Musical
['First Nudie Musical, The (1976)' 'Zoot Suit (1981)' 'Yentl (1983)'
"Dr. Seuss' The Lorax (2012)" 'Singing Detective, The (2003)']
---
13 Mystery
['Spirits of the Dead (1968)' 'Oscar (1991)' 'Solace (2015)'
'Nomads (1986)'
'Adventures of Mary-Kate and Ashley, The: The Case of the United States Navy Adventure (1997)']
---
14 Romance
['Forrest Gump (1994)' 'American Beauty (1999)'
'Princess Bride, The (1987)' 'Beauty and the Beast (1991)'
'Good Will Hunting (1997)']
---
15 Sci-Fi
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
'Star Wars: Episode V - The Empire Strikes Back (1980)'
'Jurassic Park (1993)' 'Back to the Future (1985)']
---
16 Thriller
['Pulp Fiction (1994)' 'Silence of the Lambs, The (1991)'
'Matrix, The (1999)' 'Jurassic Park (1993)' 'Fargo (1996)']
---
17 War
['Iron Eagle II (1988)' 'Dark Blue World (Tmavomodrý svet) (2001)'
'Wind That Shakes the Barley, The (2006)' 'Pathfinder (2007)'
'Night of the Generals, The (1967)']
---
18 Western
['The Ridiculous 6 (2015)' 'Shakiest Gun in the West, The (1968)'
"'Neath the Arizona Skies (1934)" 'Stagecoach (1966)'
'Missing, The (2003)']
我们现在应用 PageRank 来获取与给定电影相关的最相关的电影。
[18]:
target = {i: name for i, name in enumerate(names) if 'Cherbourg' in name}
[19]:
target
[19]:
{175: 'Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)'}
[20]:
scores_ppr = pagerank.fit_predict(positive, weights={175:1})
[21]:
names[top_k(scores_ppr - scores, 10)]
[21]:
array(['Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)',
'Fargo (1996)', 'Pulp Fiction (1994)',
'Star Wars: Episode IV - A New Hope (1977)',
'L.A. Confidential (1997)', 'Matrix, The (1999)',
'Shawshank Redemption, The (1994)', 'American Beauty (1999)',
'Clockwork Orange, A (1971)', 'Jurassic Park (1993)'], dtype=object)
我们还可以应用 PageRank 向用户推荐电影。
[22]:
user = 1
targets = get_neighbors(positive, user, transpose=True)
[23]:
# seen movies (sample)
names[targets][:10]
[23]:
array(['GoldenEye (1995)', 'Sense and Sensibility (1995)',
'Clueless (1995)', 'Seven (a.k.a. Se7en) (1995)',
'Usual Suspects, The (1995)', 'Mighty Aphrodite (1995)',
"Mr. Holland's Opus (1995)", 'Braveheart (1995)',
'Brothers McMullen, The (1995)', 'Apollo 13 (1995)'], dtype=object)
[24]:
mask = np.zeros(len(names), dtype=bool)
mask[targets] = 1
[25]:
scores_ppr = pagerank.fit_predict(positive, weights=mask)
[26]:
# top-10 recommendation
names[top_k((scores_ppr - scores) * (1 - mask), 10)]
[26]:
array(['Shawshank Redemption, The (1994)', 'True Lies (1994)',
'Star Wars: Episode IV - A New Hope (1977)',
'Beauty and the Beast (1991)', 'Toy Story (1995)',
'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)', 'Fargo (1996)',
'Independence Day (a.k.a. ID4) (1996)', 'Matrix, The (1999)',
'Star Wars: Episode V - The Empire Strikes Back (1980)'],
dtype=object)
嵌入
我们现在用低维向量表示每部电影,并使用层次聚类来可视化前 100 部电影的嵌入结构。
[27]:
# embedding
spectral = Spectral(10)
embedding = spectral.fit_transform(positive)
[28]:
# top-100 movies
scores = pagerank.fit_predict(positive)
index = top_k(scores, 100)
dendrogram = linkage(embedding[index], method='ward')
[29]:
# visualization
image = visualize_dendrogram(dendrogram, names=names[index], rotate=True, width=200, height=1000, n_clusters=6)
SVG(image)
[29]: