文本挖掘
我们展示了如何使用 scikit-network 进行文本挖掘。我们这里考虑了维克多·雨果的著名小说 悲惨世界(古腾堡计划,伊莎贝尔·F·哈普古德翻译)。通过考虑单词和段落之间的图,我们可以将单词和段落嵌入到同一个向量空间中,并计算它们之间的余弦相似度。
每个单词都按原文考虑;可以使用更高级的 分词器。
可以考虑其他图,例如,在 5 个词的窗口内单词共现的图,或者章节和单词的图。可以将这些图组合起来以获得更丰富的信息和更好的嵌入。
[1]:
from re import sub
[2]:
import numpy as np
[3]:
from sknetwork.data import from_adjacency_list
from sknetwork.embedding import Spectral
加载数据
[4]:
filename = 'miserables-en.txt'
[5]:
with open(filename, 'r') as f:
text = f.read()
[6]:
len(text)
[6]:
3254333
[7]:
print(text[:494])
The Project Gutenberg EBook of Les Misérables, by Victor Hugo
This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.org
Title: Les Misérables
Complete in Five Volumes
Author: Victor Hugo
Translator: Isabel F. Hapgood
Release Date: June 22, 2008 [EBook #135]
Last Updated: January 18, 2016
预处理
[8]:
# extract main text
main = text.split('LES MISÉRABLES')[-2].lower()
[9]:
len(main)
[9]:
3215017
[10]:
# remove ponctuation
main = sub(r"[,.;:()@#?!&$'_*]", " ", main)
main = sub(r'["-]', ' ', main)
[11]:
# extract paragraphs
sep = '|||'
main = sub(r'\n\n+', sep, main)
main = sub('\n', ' ', main)
paragraphs = main.split(sep)
[12]:
len(paragraphs)
[12]:
13499
[13]:
paragraphs[1000]
[13]:
'after leaving the asses there was a fresh delight they crossed the seine in a boat and proceeding from passy on foot they reached the barrier of l étoile they had been up since five o clock that morning as the reader will remember but bah there is no such thing as fatigue on sunday said favourite on sunday fatigue does not work '
构建图
[14]:
paragraph_words = [paragraph.split(' ') for paragraph in paragraphs]
[15]:
graph = from_adjacency_list(paragraph_words, bipartite=True)
[16]:
biadjacency = graph.biadjacency
words = graph.names_col
[17]:
biadjacency
[17]:
<13499x23093 sparse matrix of type '<class 'numpy.int64'>'
with 416331 stored elements in Compressed Sparse Row format>
[18]:
len(words)
[18]:
23093
统计
[19]:
n_row, n_col = biadjacency.shape
[20]:
paragraph_lengths = biadjacency.dot(np.ones(n_col))
[21]:
np.quantile(paragraph_lengths, [0.1, 0.5, 0.9, 0.99])
[21]:
array([ 6., 23., 127., 379.])
[22]:
word_counts = biadjacency.T.dot(np.ones(n_row))
[23]:
np.quantile(word_counts, [0.1, 0.5, 0.9, 0.99])
[23]:
array([ 1. , 2. , 23. , 282.08])
嵌入
[24]:
dimension = 50
spectral = Spectral(dimension, regularization=100)
[25]:
spectral.fit(biadjacency)
[25]:
Spectral(n_components=50, decomposition='rw', regularization=100, normalized=True)
[26]:
embedding_paragraph = spectral.embedding_row_
embedding_word = spectral.embedding_col_
[27]:
# some word
i = int(np.argwhere(words == 'love'))
/tmp/ipykernel_4628/582388984.py:2: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
i = int(np.argwhere(words == 'love'))
[28]:
# most similar words
cosines_word = embedding_word.dot(embedding_word[i])
words[np.argsort(-cosines_word)[:20]]
[28]:
array(['love', 'kiss', 'ye', 'celestial', 'hearts', 'loved', 'tender',
'roses', 'joys', 'sweet', 'wedded', 'charming', 'angelic', 'adore',
'aurora', 'pearl', 'voluptuousness', 'chaste', 'innumerable',
'heart'], dtype='<U21')
[29]:
np.quantile(cosines_word, [0.01, 0.1, 0.5, 0.9, 0.99])
[29]:
array([-0.24307366, -0.14047851, -0.02607974, 0.14319717, 0.42843234])
[30]:
# some paragraph
i = 1000
print(paragraphs[i])
after leaving the asses there was a fresh delight they crossed the seine in a boat and proceeding from passy on foot they reached the barrier of l étoile they had been up since five o clock that morning as the reader will remember but bah there is no such thing as fatigue on sunday said favourite on sunday fatigue does not work
[31]:
# most similar paragraphs
cosines_paragraph = embedding_paragraph.dot(embedding_paragraph[i])
for j in np.argsort(-cosines_paragraph)[:3]:
print(paragraphs[j])
print()
after leaving the asses there was a fresh delight they crossed the seine in a boat and proceeding from passy on foot they reached the barrier of l étoile they had been up since five o clock that morning as the reader will remember but bah there is no such thing as fatigue on sunday said favourite on sunday fatigue does not work
he was a man of lofty stature half peasant half artisan he wore a huge leather apron which reached to his left shoulder and which a hammer a red handkerchief a powder horn and all sorts of objects which were upheld by the girdle as in a pocket caused to bulge out he carried his head thrown backwards his shirt widely opened and turned back displayed his bull neck white and bare he had thick eyelashes enormous black whiskers prominent eyes the lower part of his face like a snout and besides all this that air of being on his own ground which is indescribable
this was the state which the shepherd idyl begun at five o clock in the morning had reached at half past four in the afternoon the sun was setting their appetites were satisfied
[32]:
np.quantile(cosines_paragraph, [0.01, 0.1, 0.5, 0.9, 0.99])
[32]:
array([-0.30671191, -0.17309593, -0.00319729, 0.21574375, 0.45969887])