教程

此笔记本概述了scikit-network，一个用于图上机器学习的开源 Python 包。

1. 安装

安装scikit-network最简单的方法是通过pip。

[1]:

# uncomment to install
# !pip install scikit-network

2. 数据

邻接矩阵

每个具有 \(n\) 个节点的图都由一个大小为 \(n \times n\) 的方阵邻接矩阵 \(A\) 表示。在最简单的形式中，矩阵 \(A\) 是二进制的，并且只有当从节点 \(i\) 到节点 \(j\) 存在链接时，我们才有 \(A_{ij} =1\)。如果图是无向的，则矩阵 \(A\) 是对称的。

[2]:

import numpy as np

[3]:

adjacency = np.array([[0, 1, 1, 1, 0], [1, 0, 0, 0, 1], [1, 0, 0, 1, 0], [1, 0, 1, 0, 1], [0, 1, 0, 1, 0]])

[4]:

adjacency

[4]:

array([[0, 1, 1, 1, 0],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 1, 0],
       [1, 0, 1, 0, 1],
       [0, 1, 0, 1, 0]])

稀疏矩阵

在 scikit-network 中，邻接矩阵以 scipy 的稀疏 CSR 格式表示。

[5]:

from scipy import sparse

[6]:

adjacency = sparse.csr_matrix(adjacency)

[7]:

adjacency

[7]:

<5x5 sparse matrix of type '<class 'numpy.int64'>'
        with 12 stored elements in Compressed Sparse Row format>

可视化

以下是上面图的可视化。

[8]:

from IPython.display import SVG
from sknetwork.visualization import visualize_graph

[9]:

# name nodes by indices
n_nodes, _ = adjacency.shape
names = np.arange(n_nodes)

[10]:

# visualization
SVG(visualize_graph(adjacency, names=names))

[10]:

../../_images/tutorials_overview_get_started_16_0.svg

有向图

邻接矩阵不一定是对称的。可能存在从节点 \(i\) 到节点 \(j\) 的链接，但不存在从节点 \(j\) 到节点 \(i\) 的链接。

[11]:

adjacency = np.array([[0, 1, 1, 1, 0], [0, 0, 0, 1, 1], [0, 1, 0, 0, 0], [1, 0, 0, 0, 1], [0, 1, 0, 0, 0]])

[12]:

adjacency

[12]:

array([[0, 1, 1, 1, 0],
       [0, 0, 0, 1, 1],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1],
       [0, 1, 0, 0, 0]])

[13]:

adjacency = sparse.csr_matrix(adjacency)

[14]:

# visualization
SVG(visualize_graph(adjacency, names=names))

[14]:

../../_images/tutorials_overview_get_started_21_0.svg

二部图

二部图是特殊的图，其边在两组节点之间。二部图可以用其二部邻接矩阵 \(B\) 表示，它是一个矩形矩阵，表示两组节点之间的链接。当且仅当第一组（由 \(B\) 的行表示）中的节点 \(i\) 与第二组（由 \(B\) 的列表示）中的节点 \(j\) 之间存在边时，我们才有 \(B_{ij}=1\)。

[15]:

biadjacency = np.array([[1, 1, 0, 1, 0], [0, 0, 1, 0, 1], [1, 1, 1, 1, 0], [0, 0, 0, 1, 1]])

[16]:

biadjacency

[16]:

array([[1, 1, 0, 1, 0],
       [0, 0, 1, 0, 1],
       [1, 1, 1, 1, 0],
       [0, 0, 0, 1, 1]])

[17]:

biadjacency = sparse.csr_matrix(biadjacency)

[18]:

biadjacency

[18]:

<4x5 sparse matrix of type '<class 'numpy.int64'>'
        with 11 stored elements in Compressed Sparse Row format>

[19]:

from sknetwork.visualization import visualize_bigraph

[20]:

# name nodes by indices
n_row, n_col = biadjacency.shape
names_row = np.arange(n_row)
names_col = np.arange(n_col)

[21]:

# visualization
SVG(visualize_bigraph(biadjacency, names_row=names_row, names_col=names_col, reorder=False))

[21]:

../../_images/tutorials_overview_get_started_29_0.svg

加权图

边上的权重可以用具有非负项的邻接矩阵（或二部图的二部邻接矩阵）表示。

[22]:

adjacency = np.array([[0, 2, 1, 3, 0], [2, 0, 0, 0, 3], [1, 0, 0, 2, 0], [3, 0, 2, 0, 1], [0, 3, 0, 1, 0]])

[23]:

adjacency

[23]:

array([[0, 2, 1, 3, 0],
       [2, 0, 0, 0, 3],
       [1, 0, 0, 2, 0],
       [3, 0, 2, 0, 1],
       [0, 3, 0, 1, 0]])

[24]:

adjacency = sparse.csr_matrix(adjacency)

[25]:

# visualization
SVG(visualize_graph(adjacency, names=names))

[25]:

../../_images/tutorials_overview_get_started_34_0.svg

子图

子图可以通过切片邻接矩阵轻松获得。

[26]:

index = [0, 2, 3, 4]
sub_adjacency = adjacency[index][:, index]

[27]:

# visualization
SVG(visualize_graph(sub_adjacency, names=names[index]))

[27]:

../../_images/tutorials_overview_get_started_37_0.svg

3. 算法

基本工具

一些用于加载、处理和可视化图的基本工具可用作函数。

模块	描述	函数
数据	加载和保存图	`from_csv`, `save`, `load`, `load_netset`, …
拓扑	连接性和结构	`get_connected_components`, `is_acyclic`, …
路径	最短路径和距离	`get_distances`, `get_shortest_path`, …
线性代数	矩阵运算	`normalize`, `diagonal_pseudo_inverse`, …
实用工具	有用工具	`directed2undirected`, `get_degrees`, `get_neighbors`, …
可视化	可视化工具	`visualize_graph`, `svg_bigraphs`, …

[28]:

from sknetwork.data import karate_club
from sknetwork.utils import get_degrees

[29]:

adjacency = karate_club()

[30]:

get_degrees(adjacency)

[30]:

array([16,  9, 10,  6,  3,  4,  4,  4,  5,  2,  3,  1,  2,  5,  2,  2,  2,
        2,  2,  3,  2,  2,  2,  5,  3,  3,  2,  4,  3,  4,  4,  6, 12, 17],
      dtype=int32)

机器学习

Scikit-network 适用于图上的机器学习。

每个算法都作为一个对象提供，其中包含以下方法中的某些方法

fit: 将算法拟合到图上。此方法是强制性的。
predict: 预测输出（例如，节点的标签）。
predict_proba: 预测输出的概率（例如，标签的概率分布）。
transform: 转换图。
fit_predict: 应用拟合 + 预测。
fit_predict_proba: 应用拟合 + predict_proba。
fit_transform: 应用拟合 + 转换。

输出是带有下划线的对象的属性（例如，labels_）。

模块	描述	算法	输出
聚类	形成节点的簇	`Louvain`, `PropagationClustering`	`labels_`
分类	对节点进行分类	`DiffusionClassifier`, `NNClassifier`, …	`labels_`
GNN	图神经网络	`GNNClassifier`	`labels_`
回归	将值分配给节点	`Diffusion`, `Dirichlet`	`values_`
层次结构	获取节点的层次结构表示	`Paris`, `LouvainHierarchy`	`dendrogram_`
嵌入	获取节点的向量表示	`Spectral`, `SVD`, `LouvainEmbedding`, …	`embedding_`
排名	为节点提供重要性得分	`PageRank`, `Katz`, `Betweenness`, …	`scores_`
链接预测	预测节点之间的链接	`NNLinker`	`links_`

[31]:

from sknetwork.data import karate_club
from sknetwork.classification import DiffusionClassifier

[32]:

dataset = karate_club(metadata=True)
adjacency = dataset.adjacency
position = dataset.position
labels = dataset.labels

[33]:

classifier = DiffusionClassifier()
classifier.fit(adjacency, {node: labels[node] for node in [0, 1, 33]})

[33]:

DiffusionClassifier(n_iter=10, centering=True, scale=5)

[34]:

labels_pred = classifier.predict()

[35]:

SVG(visualize_graph(adjacency, position, labels=labels_pred))

[35]:

../../_images/tutorials_overview_get_started_51_0.svg

[36]:

probs = classifier.predict_proba()

[37]:

scores = probs[:, 1]

[38]:

SVG(visualize_graph(adjacency, position, scores=scores))

[38]:

../../_images/tutorials_overview_get_started_54_0.svg

4. 示例

我们在这里展示如何使用 scikit-network 对图进行聚类，图最初以文本文件的形式存储为边列表。

数据

[39]:

# show first lines
with open('miserables.tsv', 'r') as f:
    data = f.readlines()
    print(''.join(data[:5]))

Myriel  Napoleon
Myriel  Mlle Baptistine
Myriel  Mme Magloire
Myriel  Countess de Lo
Myriel  Geborand

加载

[40]:

from sknetwork.data import from_csv

[41]:

graph = from_csv('miserables.tsv')

[42]:

adjacency = graph.adjacency
names = graph.names

[43]:

adjacency

[43]:

<77x77 sparse matrix of type '<class 'numpy.int64'>'
        with 508 stored elements in Compressed Sparse Row format>

嵌入

我们在这里计算二维嵌入以进行可视化。

[44]:

from sknetwork.embedding import Spring

[45]:

algorithm = Spring()

[46]:

position = algorithm.fit_transform(adjacency)

[47]:

SVG(visualize_graph(adjacency, position, names=names))

[47]:

../../_images/tutorials_overview_get_started_67_0.svg

聚类

最后，我们通过 Louvain 算法对图进行聚类。

[48]:

from sknetwork.clustering import Louvain

[49]:

algorithm = Louvain()
labels = algorithm.fit_predict(adjacency)

[50]:

SVG(visualize_graph(adjacency, position, names=names, labels=labels))

[50]:

../../_images/tutorials_overview_get_started_72_0.svg

二部图

请注意，大多数算法都适用于二部图，其中 fit 方法应用于二部邻接矩阵。我们以电影和演员之间的二部图为例。

[51]:

# show first lines
with open('movie_actor.tsv', 'r') as f:
    data = f.readlines()
    print(''.join(data[:5]))

Inception       Leonardo DiCaprio
Inception       Marion Cotillard
Inception       Joseph Gordon Lewitt
The Dark Knight Rises   Marion Cotillard
The Dark Knight Rises   Joseph Gordon Lewitt

[52]:

graph = from_csv('movie_actor.tsv', bipartite=True)

[53]:

biadjacency = graph.biadjacency
movies = graph.names_row
actors = graph.names_col

[54]:

algorithm = Louvain()
algorithm.fit(biadjacency)
labels_row = algorithm.labels_row_
labels_col = algorithm.labels_col_

[55]:

SVG(visualize_bigraph(biadjacency, names_row=movies, names_col=actors, labels_row=labels_row, labels_col=labels_col))

[55]:

../../_images/tutorials_overview_get_started_79_0.svg