维基百科

此笔记本展示了如何使用scikit-network通过其超链接来分析维基百科的网络结构。

我们考虑Wikivitals数据集,该数据集来自netset集合。此数据集包含(大约)维基百科前 10,000 个(重要)文章

[1]:
from IPython.display import SVG
[2]:
import numpy as np
from scipy.cluster.hierarchy import linkage
[3]:
from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.embedding import Spectral
from sknetwork.clustering import Louvain
from sknetwork.classification import DiffusionClassifier
from sknetwork.utils import get_neighbors
from sknetwork.visualization import visualize_dendrogram

数据

所有来自netset集合的数据集都可以轻松地使用 scikit-network 导入。

[4]:
wikivitals = load_netset('wikivitals')
Parsing files...
Done.
[5]:
# graph of links
adjacency = wikivitals.adjacency
names = wikivitals.names
labels = wikivitals.labels
names_labels = wikivitals.names_labels
[6]:
adjacency
[6]:
<10011x10011 sparse matrix of type '<class 'numpy.bool_'>'
        with 824999 stored elements in Compressed Sparse Row format>
[7]:
# categories
print(names_labels)
['Arts' 'Biological and health sciences' 'Everyday life' 'Geography'
 'History' 'Mathematics' 'People' 'Philosophy and religion'
 'Physical sciences' 'Society and social sciences' 'Technology']
[8]:
# get label
label_id = {name: i for i, name in enumerate(names_labels)}

示例

让我们看一下文章。

[9]:
i = 10000
print(names[i])
Édouard Manet
[10]:
# label
label = labels[i]
print(names_labels[label])
People
[11]:
# some hyperlinks
neighbors = get_neighbors(adjacency, i)
print(names[neighbors[:10]])
['Adolphe Thiers' 'American Civil War' 'Bordeaux' 'Camille Pissarro'
 'Carmen' 'Charles Baudelaire' 'Claude Monet' 'Diego Velázquez'
 'Edgar Allan Poe' 'Edgar Degas']
[12]:
len(neighbors)
[12]:
38

PageRank

我们首先使用(个性化的)PageRank 来选择每个类别的典型文章。

[13]:
pagerank = PageRank()
[14]:
# number of articles per category
n_selection = 50
[15]:
# selection of articles
selection = []
for label in np.arange(len(names_labels)):
    ppr = pagerank.fit_predict(adjacency, weights=(labels==label))
    scores = ppr * (labels==label)
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)
[16]:
selection.shape
[16]:
(11, 50)
[17]:
# show selection
for label, name_label in enumerate(names_labels):
    print('---')
    print(label, name_label)
    print(names[selection[label, :5]])
---
0 Arts
['Encyclopædia Britannica' 'Romanticism' 'Jazz' 'Modernism' 'Baroque']
---
1 Biological and health sciences
['Taxonomy (biology)' 'Animal' 'Chordate' 'Plant' 'Species']
---
2 Everyday life
['Olympic Games' 'Association football' 'Basketball' 'Baseball' 'Softball']
---
3 Geography
['Geographic coordinate system' 'United States' 'China' 'France' 'India']
---
4 History
['World War II' 'World War I' 'Roman Empire' 'Ottoman Empire'
 'Middle Ages']
---
5 Mathematics
['Real number' 'Function (mathematics)' 'Complex number'
 'Set (mathematics)' 'Integer']
---
6 People
['Aristotle' 'Plato' 'Augustine of Hippo' 'Winston Churchill'
 'Thomas Aquinas']
---
7 Philosophy and religion
['Christianity' 'Islam' 'Buddhism' 'Hinduism' 'Catholic Church']
---
8 Physical sciences
['Oxygen' 'Hydrogen' 'Earth' 'Kelvin' 'Density']
---
9 Society and social sciences
['The New York Times' 'Latin' 'English language' 'French language'
 'United Nations']
---
10 Technology
['NASA' 'Internet' 'Operating system' 'Computer network' 'Computer']

嵌入

我们现在用低维向量表示图中的每个节点,并使用层次聚类来可视化这种嵌入的结构。

[18]:
# dimension of the embedding
n_components = 20
[19]:
# embedding
spectral = Spectral(n_components)
embedding = spectral.fit_transform(adjacency)
[20]:
embedding.shape
[20]:
(10011, 20)
[21]:
# hierarchy of articles
label = label_id['Physical sciences']
index = selection[label]
dendrogram_articles = linkage(embedding[index], method='ward')
[22]:
# visualization
image = visualize_dendrogram(dendrogram_articles, names=names[index], rotate=True, width=200, scale=2, n_clusters=4)
SVG(image)
[22]:
../_images/use_cases_wikipedia_27_0.svg

聚类

我们现在应用 Louvain 来对图进行聚类,独立于已知的标签。

[23]:
algo = Louvain()
[24]:
labels_pred = algo.fit_predict(adjacency)
[25]:
np.unique(labels_pred, return_counts=True)
[25]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),
 array([1836, 1800, 1315, 1262, 1225, 1067,  804,  672,   30]))

我们再次使用 PageRank 来获取每个聚类的顶部页面。

[26]:
n_selection = 5
[27]:
selection = []
for label in np.arange(len(set(labels_pred))):
    ppr = pagerank.fit_predict(adjacency, weights=(labels_pred==label))
    scores = ppr * (labels_pred==label)
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)
[28]:
# show selection
for label in np.arange(len(set(labels_pred))):
    print('---')
    print(label)
    print(names[selection[label]])
---
0
['Physics' 'Hydrogen' 'Oxygen' 'Kelvin' 'Albert Einstein']
---
1
['Taxonomy (biology)' 'Animal' 'Plant' 'Protein' 'Species']
---
2
['Latin' 'World War I' 'Roman Empire' 'Middle Ages' 'Greek language']
---
3
['Christianity' 'Aristotle' 'Catholic Church' 'Plato'
 'Age of Enlightenment']
---
4
['United States' 'World War II' 'Geographic coordinate system'
 'United Kingdom' 'France']
---
5
['China' 'India' 'Buddhism' 'Islam' 'Chinese language']
---
6
['The New York Times' 'New York City' 'Time (magazine)' 'BBC'
 'The Washington Post']
---
7
['Earth' 'Atlantic Ocean' 'Europe' 'Drainage basin' 'Pacific Ocean']
---
8
['Handbag' 'Hat' 'Veil' 'Uniform' 'Clothing']

分类

最后,我们使用热扩散来预测“人物”类别中每个页面的最接近类别。

[29]:
algo = DiffusionClassifier()
[30]:
people = label_id['People']
[31]:
labels_people = algo.fit_predict(adjacency, labels = {i: label for i, label in enumerate(labels) if label != people})
[32]:
n_selection = 5
[33]:
selection = []
for label in np.arange(len(names_labels)):
    if label != people:
        ppr = pagerank.fit_predict(adjacency, weights=(labels==people)*(labels_people==label))
        scores = ppr * (labels==people)*(labels_people==label)
        selection.append(top_k(scores, n_selection))
selection = np.array(selection)
[34]:
# show selection
i = 0
for label, name_label in enumerate(names_labels):
    if label != people:
        print('---')
        print(label, name_label)
        print(names[selection[i]])
        i += 1

---
0 Arts
['Richard Wagner' 'Igor Stravinsky' 'Bob Dylan' 'Fred Astaire'
 'Ludwig van Beethoven']
---
1 Biological and health sciences
['Charles Darwin' 'Francis Crick' 'Robert Koch' 'Alexander Fleming'
 'Carl Linnaeus']
---
2 Everyday life
['Wayne Gretzky' 'Jim Thorpe' 'Jackie Robinson' 'LeBron James'
 'Willie Mays']
---
3 Geography
['Elizabeth II' 'Carl Lewis' 'Dwight D. Eisenhower' 'Vladimir Putin'
 'Muhammad Ali']
---
4 History
['Alexander the Great' 'Napoleon' 'Charlemagne' 'Philip II of Spain'
 'Charles V, Holy Roman Emperor']
---
5 Mathematics
['Euclid' 'Augustin-Louis Cauchy' 'Archimedes' 'John von Neumann'
 'Pierre de Fermat']
---
7 Philosophy and religion
['Augustine of Hippo' 'Aristotle' 'Thomas Aquinas' 'Plato' 'Immanuel Kant']
---
8 Physical sciences
['Albert Einstein' 'Isaac Newton' 'J. J. Thomson' 'Marie Curie'
 'Niels Bohr']
---
9 Society and social sciences
['Barack Obama' 'Noam Chomsky' 'Karl Marx' 'Ralph Waldo Emerson'
 'Jean-Paul Sartre']
---
10 Technology
['Tim Berners-Lee' 'Donald Knuth' 'Edsger W. Dijkstra' 'Douglas Engelbart'
 'Dennis Ritchie']