加载您的数据

在 scikit-network 中,图由其 邻接矩阵(或二部图的二部邻接矩阵)表示,该矩阵采用 SciPy 的 压缩稀疏行 格式。

在本教程中,我们将介绍几种在该格式中实例化图的方法。

[1]:
from IPython.display import SVG

import numpy as np
from scipy import sparse
import pandas as pd

from sknetwork.data import from_edge_list, from_adjacency_list, from_graphml, from_csv
from sknetwork.visualization import visualize_graph, visualize_bigraph

从 NumPy 数组

对于小型图,您可以将邻接矩阵实例化为密集的 NumPy 数组,并将其转换为 CSR 格式的稀疏矩阵。

[2]:
adjacency = np.array([[0, 1, 1, 0], [1, 0, 1, 1], [1, 1, 0, 0], [0, 1, 0, 0]])
adjacency = sparse.csr_matrix(adjacency)

image = visualize_graph(adjacency)
SVG(image)
[2]:
../../_images/tutorials_data_load_data_4_0.svg

从边列表

另一种构建图的自然方式是从边列表构建。

[3]:
edge_list = [(0, 1), (1, 2), (2, 3), (3, 0), (0, 2)]
adjacency = from_edge_list(edge_list)

image = visualize_graph(adjacency)
SVG(image)
[3]:
../../_images/tutorials_data_load_data_6_0.svg

默认情况下,图是无向的,但您可以轻松地将其设为有向。

[4]:
adjacency = from_edge_list(edge_list, directed=True)

image = visualize_graph(adjacency)
SVG(image)
[4]:
../../_images/tutorials_data_load_data_8_0.svg

您可能还想为边添加权重。只需使用三元组而不是对即可!

[5]:
edge_list = [(0, 1, 1), (1, 2, 0.5), (2, 3, 1), (3, 0, 0.5), (0, 2, 2)]
adjacency = from_edge_list(edge_list)

image = visualize_graph(adjacency)
SVG(image)
[5]:
../../_images/tutorials_data_load_data_10_0.svg

您也可以实例化一个二部图。

[6]:
edge_list = [(0, 0), (1, 0), (1, 1), (2, 1)]
biadjacency = from_edge_list(edge_list, bipartite=True)

image = visualize_bigraph(biadjacency)
SVG(image)
[6]:
../../_images/tutorials_data_load_data_12_0.svg

如果节点没有索引,您将获得一个 Bunch 类型的对象,其中包含图属性(节点名称)。

[7]:
edge_list = [("Alice", "Bob"), ("Bob", "Carey"), ("Alice", "David"), ("Carey", "David"), ("Bob", "David")]
graph = from_edge_list(edge_list)
[8]:
graph
[8]:
{'names': array(['Alice', 'Bob', 'Carey', 'David'], dtype='<U5'),
 'adjacency': <4x4 sparse matrix of type '<class 'numpy.int64'>'
        with 10 stored elements in Compressed Sparse Row format>}
[9]:
adjacency = graph.adjacency
names = graph.names
[10]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[10]:
../../_images/tutorials_data_load_data_17_0.svg

默认情况下,每条边的权重是对应链接出现的次数。

[11]:
edge_list_new = edge_list + [("Alice", "Bob"), ("Alice", "David"), ("Alice", "Bob")]
graph = from_edge_list(edge_list_new)
[12]:
adjacency = graph.adjacency
names = graph.names
[13]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[13]:
../../_images/tutorials_data_load_data_21_0.svg

您可以使图无权。

[14]:
graph = from_edge_list(edge_list_new, weighted=False)
[15]:
adjacency = graph.adjacency
names = graph.names
[16]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[16]:
../../_images/tutorials_data_load_data_25_0.svg

同样,您可以使图有向。

[17]:
graph = from_edge_list(edge_list, directed=True)
[18]:
graph
[18]:
{'names': array(['Alice', 'Bob', 'Carey', 'David'], dtype='<U5'),
 'adjacency': <4x4 sparse matrix of type '<class 'numpy.int64'>'
        with 5 stored elements in Compressed Sparse Row format>}
[19]:
adjacency = graph.adjacency
names = graph.names
[20]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[20]:
../../_images/tutorials_data_load_data_30_0.svg

图也可以具有显式权重。

[21]:
edge_list = [("Alice", "Bob", 3), ("Bob", "Carey", 2), ("Alice", "David", 1), ("Carey", "David", 2), ("Bob", "David", 3)]
graph = from_edge_list(edge_list)
[22]:
adjacency = graph.adjacency
names = graph.names
[23]:
image = visualize_graph(adjacency, names=names, display_edge_weight=True, display_node_weight=True)
SVG(image)
[23]:
../../_images/tutorials_data_load_data_34_0.svg

对于二部图。

[24]:
edge_list = [("Alice", "Football"), ("Bob", "Tennis"), ("David", "Football"), ("Carey", "Tennis"), ("Carey", "Football")]
graph = from_edge_list(edge_list, bipartite=True)
[25]:
biadjacency = graph.biadjacency
names = graph.names
names_col = graph.names_col
[26]:
image = visualize_bigraph(biadjacency, names_row=names, names_col=names_col)
SVG(image)
[26]:
../../_images/tutorials_data_load_data_38_0.svg

从邻接列表

您还可以从邻接列表加载图,该列表以列表列表或字典列表的形式给出。

[27]:
adjacency_list =[[0, 1, 2], [2, 3]]
adjacency = from_adjacency_list(adjacency_list, directed=True)
[28]:
image = visualize_graph(adjacency)
SVG(image)
[28]:
../../_images/tutorials_data_load_data_41_0.svg
[29]:
adjacency_dict = {"Alice": ["Bob", "David"], "Bob": ["Carey", "David"]}
graph = from_adjacency_list(adjacency_dict, directed=True)
[30]:
adjacency = graph.adjacency
names = graph.names
[31]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[31]:
../../_images/tutorials_data_load_data_44_0.svg

从数据框

您的数据框可能包含边列表。

[32]:
df = pd.read_csv('miserables.tsv', sep='\t', names=['character_1', 'character_2'])
[33]:
df.head()
[33]:
character_1 character_2
0 Myriel Napoleon
1 Myriel Mlle Baptistine
2 Myriel Mme Magloire
3 Myriel Countess de Lo
4 Myriel Geborand
[34]:
edge_list = list(df.itertuples(index=False))
[35]:
graph = from_edge_list(edge_list)
[36]:
graph
[36]:
{'names': array(['Anzelma', 'Babet', 'Bahorel', 'Bamatabois', 'Baroness',
        'Blacheville', 'Bossuet', 'Boulatruelle', 'Brevet', 'Brujon',
        'Champmathieu', 'Champtercier', 'Chenildieu', 'Child1', 'Child2',
        'Claquesous', 'Cochepaille', 'Combeferre', 'Cosette', 'Count',
        'Countess de Lo', 'Courfeyrac', 'Cravatte', 'Dahlia', 'Enjolras',
        'Eponine', 'Fameuil', 'Fantine', 'Fauchelevent', 'Favourite',
        'Feuilly', 'Gavroche', 'Geborand', 'Gervais', 'Gillenormand',
        'Grantaire', 'Gribier', 'Gueulemer', 'Isabeau', 'Javert', 'Joly',
        'Jondrette', 'Judge', 'Labarre', 'Listolier', 'Lt Gillenormand',
        'Mabeuf', 'Magnon', 'Marguerite', 'Marius', 'Mlle Baptistine',
        'Mlle Gillenormand', 'Mlle Vaubois', 'Mme Burgon', 'Mme Der',
        'Mme Hucheloup', 'Mme Magloire', 'Mme Pontmercy', 'Mme Thenardier',
        'Montparnasse', 'MotherInnocent', 'MotherPlutarch', 'Myriel',
        'Napoleon', 'Old man', 'Perpetue', 'Pontmercy', 'Prouvaire',
        'Scaufflaire', 'Simplice', 'Thenardier', 'Tholomyes', 'Toussaint',
        'Valjean', 'Woman1', 'Woman2', 'Zephine'], dtype='<U17'),
 'adjacency': <77x77 sparse matrix of type '<class 'numpy.int64'>'
        with 508 stored elements in Compressed Sparse Row format>}
[37]:
df = pd.read_csv('movie_actor.tsv', sep='\t', names=['movie', 'actor'])
[38]:
df.head()
[38]:
movie actor
0 Inception Leonardo DiCaprio
1 Inception Marion Cotillard
2 Inception Joseph Gordon Lewitt
3 The Dark Knight Rises Marion Cotillard
4 The Dark Knight Rises Joseph Gordon Lewitt
[39]:
edge_list = list(df.itertuples(index=False))
[40]:
graph = from_edge_list(edge_list, bipartite=True)
[41]:
graph
[41]:
{'names_row': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names_col': array(['Brad Pitt', 'Carey Mulligan', 'Christian Bale',
        'Christophe Waltz', 'Emma Stone', 'Johnny Depp',
        'Joseph Gordon Lewitt', 'Jude Law', 'Lea Seydoux',
        'Leonardo DiCaprio', 'Marion Cotillard', 'Owen Wilson',
        'Ralph Fiennes', 'Ryan Gosling', 'Steve Carell', 'Willem Dafoe'],
       dtype='<U28'),
 'biadjacency': <15x16 sparse matrix of type '<class 'numpy.int64'>'
        with 41 stored elements in Compressed Sparse Row format>}

对于分类数据,您可以使用 pandas 获取样本和特征之间的二部图。我们展示了从 成人收入 数据集获得的一个示例。

[42]:
df = pd.read_csv('adult-income.csv')
[43]:
df.head()
[43]:
age workclass occupation relationship gender income
0 40-49 State-gov Adm-clerical Not-in-family Male <=50K
1 50-59 Self-emp-not-inc Exec-managerial Husband Male <=50K
2 40-49 Private Handlers-cleaners Not-in-family Male <=50K
3 50-59 Private Handlers-cleaners Husband Male <=50K
4 30-39 Private Prof-specialty Wife Female <=50K
[44]:
df_binary = pd.get_dummies(df, sparse=True)
[45]:
df_binary.head()
[45]:
age_20-29 age_30-39 age_40-49 age_50-59 age_60-69 age_70-79 age_80-89 age_90-99 workclass_ ? workclass_ Federal-gov ... relationship_ Husband relationship_ Not-in-family relationship_ Other-relative relationship_ Own-child relationship_ Unmarried relationship_ Wife gender_ Female gender_ Male income_ <=50K income_ >50K
0 False False True False False False False False False False ... False True False False False False False True True False
1 False False False True False False False False False False ... True False False False False False False True True False
2 False False True False False False False False False False ... False True False False False False False True True False
3 False False False True False False False False False False ... True False False False False False False True True False
4 False True False False False False False False False False ... False False False False False True True False True False

5 rows × 42 columns

[46]:
biadjacency = df_binary.sparse.to_coo()
[47]:
biadjacency = sparse.csr_matrix(biadjacency)
[48]:
# biadjacency matrix of the bipartite graph
biadjacency
[48]:
<32561x42 sparse matrix of type '<class 'numpy.bool_'>'
        with 195366 stored elements in Compressed Sparse Row format>
[49]:
# names of columns
names_col = list(df_binary)
[50]:
len(names_col)
[50]:
42
[51]:
names_col[:8]
[51]:
['age_20-29',
 'age_30-39',
 'age_40-49',
 'age_50-59',
 'age_60-69',
 'age_70-79',
 'age_80-89',
 'age_90-99']

从 CSV 文件

您可以直接从 CSV 或 TSV 文件加载图。

[52]:
graph = from_csv('miserables.tsv')
[53]:
graph
[53]:
{'names': array(['Anzelma', 'Babet', 'Bahorel', 'Bamatabois', 'Baroness',
        'Blacheville', 'Bossuet', 'Boulatruelle', 'Brevet', 'Brujon',
        'Champmathieu', 'Champtercier', 'Chenildieu', 'Child1', 'Child2',
        'Claquesous', 'Cochepaille', 'Combeferre', 'Cosette', 'Count',
        'Countess de Lo', 'Courfeyrac', 'Cravatte', 'Dahlia', 'Enjolras',
        'Eponine', 'Fameuil', 'Fantine', 'Fauchelevent', 'Favourite',
        'Feuilly', 'Gavroche', 'Geborand', 'Gervais', 'Gillenormand',
        'Grantaire', 'Gribier', 'Gueulemer', 'Isabeau', 'Javert', 'Joly',
        'Jondrette', 'Judge', 'Labarre', 'Listolier', 'Lt Gillenormand',
        'Mabeuf', 'Magnon', 'Marguerite', 'Marius', 'Mlle Baptistine',
        'Mlle Gillenormand', 'Mlle Vaubois', 'Mme Burgon', 'Mme Der',
        'Mme Hucheloup', 'Mme Magloire', 'Mme Pontmercy', 'Mme Thenardier',
        'Montparnasse', 'MotherInnocent', 'MotherPlutarch', 'Myriel',
        'Napoleon', 'Old man', 'Perpetue', 'Pontmercy', 'Prouvaire',
        'Scaufflaire', 'Simplice', 'Thenardier', 'Tholomyes', 'Toussaint',
        'Valjean', 'Woman1', 'Woman2', 'Zephine'], dtype='<U17'),
 'adjacency': <77x77 sparse matrix of type '<class 'numpy.int64'>'
        with 508 stored elements in Compressed Sparse Row format>}
[54]:
graph = from_csv('movie_actor.tsv', bipartite=True)
[55]:
graph
[55]:
{'names_row': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names_col': array(['Brad Pitt', 'Carey Mulligan', 'Christian Bale',
        'Christophe Waltz', 'Emma Stone', 'Johnny Depp',
        'Joseph Gordon Lewitt', 'Jude Law', 'Lea Seydoux',
        'Leonardo DiCaprio', 'Marion Cotillard', 'Owen Wilson',
        'Ralph Fiennes', 'Ryan Gosling', 'Steve Carell', 'Willem Dafoe'],
       dtype='<U28'),
 'biadjacency': <15x16 sparse matrix of type '<class 'numpy.int64'>'
        with 41 stored elements in Compressed Sparse Row format>}

该图也可以以邻接列表的形式给出(检查函数 from_csv)。

从 GraphML 文件

您还可以加载存储在 GraphML 格式中的图。

[56]:
graph = from_graphml('miserables.graphml')
adjacency = graph.adjacency
names = graph.names
[57]:
# Directed graph
graph = from_graphml('painters.graphml')
adjacency = graph.adjacency
names = graph.names

从 NetworkX

NetworkX 具有从 CSR 格式 导入导出 函数。

其他选项

  • 您想测试我们的玩具图。

  • 您想从模型生成图。

  • 您想从现有存储库加载图(请参阅 NetSetKONECT)。

查看 数据 部分的其他教程!