维基百科

此笔记本展示了如何使用scikit-network通过其超链接来分析维基百科的网络结构。

我们考虑Wikivitals数据集，该数据集来自netset集合。此数据集包含（大约）维基百科前 10,000 个（重要）文章。

[1]:

from IPython.display import SVG

[2]:

import numpy as np
from scipy.cluster.hierarchy import linkage

[3]:

from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.embedding import Spectral
from sknetwork.clustering import Louvain
from sknetwork.classification import DiffusionClassifier
from sknetwork.utils import get_neighbors
from sknetwork.visualization import visualize_dendrogram

数据

所有来自netset集合的数据集都可以轻松地使用 scikit-network 导入。

[4]:

wikivitals = load_netset('wikivitals')

Parsing files...
Done.

[5]:

# graph of links
adjacency = wikivitals.adjacency
names = wikivitals.names
labels = wikivitals.labels
names_labels = wikivitals.names_labels

[6]:

adjacency

[6]:

<10011x10011 sparse matrix of type '<class 'numpy.bool_'>'
        with 824999 stored elements in Compressed Sparse Row format>

[7]:

# categories
print(names_labels)

['Arts' 'Biological and health sciences' 'Everyday life' 'Geography'
 'History' 'Mathematics' 'People' 'Philosophy and religion'
 'Physical sciences' 'Society and social sciences' 'Technology']

[8]:

# get label
label_id = {name: i for i, name in enumerate(names_labels)}

示例

让我们看一下文章。

[9]:

i = 10000
print(names[i])

Édouard Manet

[10]:

# label
label = labels[i]
print(names_labels[label])

People

[11]:

# some hyperlinks
neighbors = get_neighbors(adjacency, i)
print(names[neighbors[:10]])

['Adolphe Thiers' 'American Civil War' 'Bordeaux' 'Camille Pissarro'
 'Carmen' 'Charles Baudelaire' 'Claude Monet' 'Diego Velázquez'
 'Edgar Allan Poe' 'Edgar Degas']

[12]:

len(neighbors)

[12]:

PageRank

我们首先使用（个性化的）PageRank 来选择每个类别的典型文章。

[13]:

pagerank = PageRank()

[14]:

# number of articles per category
n_selection = 50

[15]:

# selection of articles
selection = []
for label in np.arange(len(names_labels)):
    ppr = pagerank.fit_predict(adjacency, weights=(labels==label))
    scores = ppr * (labels==label)
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)

[16]:

selection.shape

[16]:

(11, 50)

[17]:

# show selection
for label, name_label in enumerate(names_labels):
    print('---')
    print(label, name_label)
    print(names[selection[label, :5]])

---
0 Arts
['Encyclopædia Britannica' 'Romanticism' 'Jazz' 'Modernism' 'Baroque']
---
1 Biological and health sciences
['Taxonomy (biology)' 'Animal' 'Chordate' 'Plant' 'Species']
---
2 Everyday life
['Olympic Games' 'Association football' 'Basketball' 'Baseball' 'Softball']
---
3 Geography
['Geographic coordinate system' 'United States' 'China' 'France' 'India']
---
4 History
['World War II' 'World War I' 'Roman Empire' 'Ottoman Empire'
 'Middle Ages']
---
5 Mathematics
['Real number' 'Function (mathematics)' 'Complex number'
 'Set (mathematics)' 'Integer']
---
6 People
['Aristotle' 'Plato' 'Augustine of Hippo' 'Winston Churchill'
 'Thomas Aquinas']
---
7 Philosophy and religion
['Christianity' 'Islam' 'Buddhism' 'Hinduism' 'Catholic Church']
---
8 Physical sciences
['Oxygen' 'Hydrogen' 'Earth' 'Kelvin' 'Density']
---
9 Society and social sciences
['The New York Times' 'Latin' 'English language' 'French language'
 'United Nations']
---
10 Technology
['NASA' 'Internet' 'Operating system' 'Computer network' 'Computer']

嵌入

我们现在用低维向量表示图中的每个节点，并使用层次聚类来可视化这种嵌入的结构。

[18]:

# dimension of the embedding
n_components = 20

[19]:

# embedding
spectral = Spectral(n_components)
embedding = spectral.fit_transform(adjacency)

[20]:

embedding.shape

[20]:

(10011, 20)

[21]:

# hierarchy of articles
label = label_id['Physical sciences']
index = selection[label]
dendrogram_articles = linkage(embedding[index], method='ward')

[22]:

# visualization
image = visualize_dendrogram(dendrogram_articles, names=names[index], rotate=True, width=200, scale=2, n_clusters=4)
SVG(image)

[22]:

聚类

我们现在应用 Louvain 来对图进行聚类，独立于已知的标签。

[23]:

algo = Louvain()

[24]:

labels_pred = algo.fit_predict(adjacency)

[25]:

np.unique(labels_pred, return_counts=True)

[25]:

(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),
 array([1836, 1800, 1315, 1262, 1225, 1067,  804,  672,   30]))

我们再次使用 PageRank 来获取每个聚类的顶部页面。

[26]:

n_selection = 5

[27]:

selection = []
for label in np.arange(len(set(labels_pred))):
    ppr = pagerank.fit_predict(adjacency, weights=(labels_pred==label))
    scores = ppr * (labels_pred==label)
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)

[28]:

# show selection
for label in np.arange(len(set(labels_pred))):
    print('---')
    print(label)
    print(names[selection[label]])

---
0
['Physics' 'Hydrogen' 'Oxygen' 'Kelvin' 'Albert Einstein']
---
1
['Taxonomy (biology)' 'Animal' 'Plant' 'Protein' 'Species']
---
2
['Latin' 'World War I' 'Roman Empire' 'Middle Ages' 'Greek language']
---
3
['Christianity' 'Aristotle' 'Catholic Church' 'Plato'
 'Age of Enlightenment']
---
4
['United States' 'World War II' 'Geographic coordinate system'
 'United Kingdom' 'France']
---
5
['China' 'India' 'Buddhism' 'Islam' 'Chinese language']
---
6
['The New York Times' 'New York City' 'Time (magazine)' 'BBC'
 'The Washington Post']
---
7
['Earth' 'Atlantic Ocean' 'Europe' 'Drainage basin' 'Pacific Ocean']
---
8
['Handbag' 'Hat' 'Veil' 'Uniform' 'Clothing']

分类

最后，我们使用热扩散来预测“人物”类别中每个页面的最接近类别。

[29]:

algo = DiffusionClassifier()

[30]:

people = label_id['People']

[31]:

labels_people = algo.fit_predict(adjacency, labels = {i: label for i, label in enumerate(labels) if label != people})

[32]:

n_selection = 5

[33]:

selection = []
for label in np.arange(len(names_labels)):
    if label != people:
        ppr = pagerank.fit_predict(adjacency, weights=(labels==people)*(labels_people==label))
        scores = ppr * (labels==people)*(labels_people==label)
        selection.append(top_k(scores, n_selection))
selection = np.array(selection)

[34]:

# show selection
i = 0
for label, name_label in enumerate(names_labels):
    if label != people:
        print('---')
        print(label, name_label)
        print(names[selection[i]])
        i += 1

---
0 Arts
['Richard Wagner' 'Igor Stravinsky' 'Bob Dylan' 'Fred Astaire'
 'Ludwig van Beethoven']
---
1 Biological and health sciences
['Charles Darwin' 'Francis Crick' 'Robert Koch' 'Alexander Fleming'
 'Carl Linnaeus']
---
2 Everyday life
['Wayne Gretzky' 'Jim Thorpe' 'Jackie Robinson' 'LeBron James'
 'Willie Mays']
---
3 Geography
['Elizabeth II' 'Carl Lewis' 'Dwight D. Eisenhower' 'Vladimir Putin'
 'Muhammad Ali']
---
4 History
['Alexander the Great' 'Napoleon' 'Charlemagne' 'Philip II of Spain'
 'Charles V, Holy Roman Emperor']
---
5 Mathematics
['Euclid' 'Augustin-Louis Cauchy' 'Archimedes' 'John von Neumann'
 'Pierre de Fermat']
---
7 Philosophy and religion
['Augustine of Hippo' 'Aristotle' 'Thomas Aquinas' 'Plato' 'Immanuel Kant']
---
8 Physical sciences
['Albert Einstein' 'Isaac Newton' 'J. J. Thomson' 'Marie Curie'
 'Niels Bohr']
---
9 Society and social sciences
['Barack Obama' 'Noam Chomsky' 'Karl Marx' 'Ralph Waldo Emerson'
 'Jean-Paul Sartre']
---
10 Technology
['Tim Berners-Lee' 'Donald Knuth' 'Edsger W. Dijkstra' 'Douglas Engelbart'
 'Dennis Ritchie']