python爬取指定微博用户并基于微博内容生成词云图

😄程序用途

该程序用来对感兴趣的微博博主进行分析,根据微博内容生成词云图,效果如下:

📌一.爬取指定博主的微博内容

爬虫部分主要利用Requests包爬取相应的信息,感兴趣大家可以自行阅读代码,这里主要介绍词云的生成,就不展开说爬虫程序了~

爬虫全部代码:https://github.com/YUTING0907/pythonTools/tree/main/WeiboCrawler

📌二.根据微博内容生成词云图

1.中文词分割
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def stop_word_list(filepath):
"""创建停用词list"""
f = codecs.open(filepath, 'r', encoding='utf-8')
stopwords = [line.strip() for line in f.readlines()]
f.close()
return stopwords

def seg_sentence(sentence):
"""中文分词"""
sentence_seged = jieba.cut(sentence.strip())
stopwords = stop_word_list('stopword.txt')
outstr = ''
for word in sentence_seged:
if word not in stopwords:
if word != '\t':
outstr += word
outstr += " "
return outstr

stop_word_list函数作用是通过jieba分词库进行中文词的分割,然后加载停用词,也就是将一些语气词或者一些不想看到的词语不做统计。

2.词频统计
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def write_word_count(filename):
"""词频统计"""
with open('seged_'+filename+'.txt', 'r', encoding="utf-8-sig") as fr:
data = jieba.cut(fr.read())
data = dict(Counter(data))
sorted_data = []
for k, v in data.items():
sorted_data.append([k, v])
sorted_data = sorted(sorted_data, key=lambda x: x[1], reverse=True)
with open(filename+'_word_count.txt', 'w', encoding="utf-8-sig") as fw:
for k, v in sorted_data:
if k != ' ' and k != '\n' and len(k) != len(u"一") and k.isdigit() != True:
fw.write("{0:25}{1:>25}\n".format(k, v))
print("词频统计成功,写入路径:")
print(filename+"_word_count.txt")

def read_counter(filename):
file = codecs.open(filename+"_word_count.txt", "r", "utf-8")
tmp_count = file.readlines()
word_count = {}
for row in tmp_count:
row = re.sub(r',{2,}', ',',
re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])",
',', row), re.S)
row = row.split(',')
print(row)
if row[0] == '':
continue
#row = row[1:-1]
else:
row = row[:-1]

word_count[row[0]] = int(row[1])

return word_count

这段代码的功能是对给定的文本文件进行词频统计并保存统计结果,然后可以读取统计结果并将其转化为字典结构

3.生成图云
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def gen_tag_cloud(frequencies, filename):
"""生成标签云"""
d = path.dirname(path.abspath(__file__))

mask = np.array(Image.open(path.join(d, "view.jpg")))
stopwords = STOPWORDS.copy()

wc = WordCloud(background_color="white", max_words=2000, mask=mask, stopwords=stopwords, margin=10,
random_state=42, font_path="msyh.ttf", width=1280, height=1024).fit_words(frequencies)
image_colors = ImageColorGenerator(mask)

plt.imshow(wc)
plt.axis("off")
plt.figure()
wc.to_file(filename + "_tag_cloud_default.png")

plt.imshow(wc.recolor(color_func=image_colors))
plt.axis("off")
plt.figure()
wc.to_file(filename+"_tag_cloud_colored.png")

这段代码的作用是根据词频数据生成标签云(Tag Cloud),并保存成图片文件,可以将view.jpg背景图片替换为自己喜欢的照片,生成出来的词云形状将是背景图片的形状。

4.主程序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def main():
print("Usage:[filename] e.g. rmrb_hot_weibo_content")
print("filename: ", end='')
filename = input()
try:
inputs = codecs.open(filename+'.txt', 'r', 'utf-8')
outputs = codecs.open('seged_'+filename+'.txt', 'w', 'utf-8')
for line in inputs:
line_seg = seg_sentence(line.replace(u'\u200b', ''))
outputs.write(line_seg+'\n')
print("分词成功,写入路径:")
print("seged_"+filename+'.txt')
outputs.close()
inputs.close()
write_word_count(filename)
word_freq = read_counter(filename)
gen_tag_cloud(word_freq, filename)
except Exception as e:
print("Error: ", e)
traceback.print_exc()

将以上函数合并起来就是整个词云生成程序了~

整体代码:https://github.com/YUTING0907/pythonTools/tree/main/WordCloud


觉得不错的话,支持一根棒棒糖吧 ୧(๑•̀⌄•́๑)૭



wechat pay



alipay

python爬取指定微博用户并基于微博内容生成词云图
http://yuting0907.github.io/posts/4f0d2a06.html
作者
Echo Yu
发布于
2024年12月26日
许可协议