7 y+ n9 {" J+ j) }, n) @/ H. e 至于LDA主题模型在计算机中具体是怎么实现的,我们也可以不必细究,现在有很多可以直接用来进行LDA主题分析的包,我们直接用就行。(没错,我就是调包侠)" s- g: A; S' g$ k4 t6 p* b7 f
& f8 V" L: y- h" T1 a, E
二、Python实现 $ z9 [. v9 _8 J6 V: [ 在用Python进行LDA主题模型分析之前,我先对文档进行了分词和去停用词处理(详情可以看我之前的文章:用python对单一微博文档进行分词——jieba分词(加保留词和停用词)_阿丢是丢心心的博客-CSDN博客_jieba 停用词) ) e: X9 B1 ]2 X& t. r* f' R2 {6 R7 v& j9 i
我下面的输入文件也是已经分好词的文件. I8 p, S# m# M, J" M
5 e q% R8 I5 p1.导入算法包7 }+ U/ V: d; l) M) p. X7 i2 ~
import gensim' C- I: O( H( |: X
from gensim import corpora' l& {/ O3 W$ {/ v- s5 _2 e& d
import matplotlib.pyplot as plt( k; A: a' |1 o) ?; t
import matplotlib: R5 K$ P) m/ |& {+ z: B( s, K z
import numpy as np" D1 e6 e6 X! ]" U# w* [$ o! z) a% m
import warnings : R$ `1 n# s8 n& w" u4 k4 Awarnings.filterwarnings('ignore') # To ignore all warnings that arise here to enhance clarity! A8 `1 T- }' Y6 |
* N) L1 X. h- F4 K. O, A, Mfrom gensim.models.coherencemodel import CoherenceModel 6 ^9 m* T3 j6 j. efrom gensim.models.ldamodel import LdaModel6 Y2 V' S7 S3 W
2.加载数据+ F( x$ M2 `; X
先将文档转化为一个二元列表,其中每个子列表代表一条微博:4 g6 x8 [; Z1 d7 i, e; I
2 i7 _/ Q7 J6 W: \5 k% \PATH = "E:/data/output.csv" 3 {" R) _( z f5 D! R " C$ ~- }2 ~0 _5 q% tfile_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n') #一行行的读取内容 2 q' p6 ~$ ^5 h! O8 udata_set=[] #建立存储分词的列表 , F& r( M& f( \- l# Q9 B5 b' _for i in range(len(file_object2)): 8 K$ C9 N7 ^" e* j: ]) Q9 _& z result=[] . E7 P* \% r5 `; R( Q7 M seg_list = file_object2.split()2 }9 t; Z' o3 k2 G4 h5 e
for w in seg_list : #读取每一行分词2 G+ c5 S% S0 K, L
result.append(w)5 O- l9 x$ Z7 G( i$ d
data_set.append(result) ; y( k4 u: p( _1 c% Y, fprint(data_set) 1 A2 g9 W: O8 k8 e) |. U z 构建词典,语料向量化表示: " H" O$ K) \0 v) m! M. J* K0 T5 c; ]$ ]2 ^! @. E
dictionary = corpora.Dictionary(data_set) # 构建词典/ d9 b# J- _7 z' n! x/ }, X
corpus = [dictionary.doc2bow(text) for text in data_set] #表示为第几个单词出现了几次& m' O& a+ A, m2 p5 G
3.构建LDA模型 % ~9 Z1 V( |+ k! e; v+ G( U& Tldamodel = LdaModel(corpus, num_topics=10, id2word = dictionary, passes=30,random_state = 1) #分为10个主题- V$ n$ Z) I! d; p: N: G6 s
print(ldamodel.print_topics(num_topics=num_topics, num_words=15)) #每个主题输出15个单词. p+ |7 z" X1 a3 y" @
这是确定主题数时LDA模型的构建方法,一般我们可以用指标来评估模型好坏,也可以用这些指标来确定最优主题数。一般用来评价LDA主题模型的指标有困惑度(perplexity)和主题一致性(coherence),困惑度越低或者一致性越高说明模型越好。一些研究表明perplexity并不是一个好的指标,所以一般我用coherence来评价模型并选择最优主题,但下面代码两种方法我都用了。 " Z4 }6 `$ b! g( `* E- ^* E9 s/ | : }4 m7 E3 b( P1 x% `6 T#计算困惑度* K; D% U9 h3 L! J+ A! i/ {" I- ^6 e/ o+ V
def perplexity(num_topics): ! t3 {, k( V: r; Z% K% | ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30) 0 c0 X1 T% R) X print(ldamodel.print_topics(num_topics=num_topics, num_words=15)) * y( o7 l3 R1 g print(ldamodel.log_perplexity(corpus)) , r* N. ~; K! v1 ^$ y' F+ K: l' ^2 b return ldamodel.log_perplexity(corpus) + A5 s0 [- F3 i" e; g#计算coherence2 r5 V! i7 M& F6 T3 w
def coherence(num_topics): j$ f# j& N2 k) W5 l! A! w# a3 { ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30,random_state = 1) 9 Z3 n4 u& o4 I- r5 t! V6 X0 e print(ldamodel.print_topics(num_topics=num_topics, num_words=10))9 a8 k0 j4 {& |9 w$ h6 x
ldacm = CoherenceModel(model=ldamodel, texts=data_set, dictionary=dictionary, coherence='c_v')5 ~0 z. d* ?- L3 v: S( K
print(ldacm.get_coherence()): _1 @9 w( `+ g, I8 k% q$ v0 |
return ldacm.get_coherence()) i5 Z- K) N! j
4.绘制主题-coherence曲线,选择最佳主题数 % ?" N. G8 l: R% g: Qx = range(1,15)* C1 N9 f `' c8 U7 e& G
# z = [perplexity(i) for i in x] #如果想用困惑度就选这个- M- a; R0 R- Z
y = [coherence(i) for i in x] . s+ E! M3 a, E0 F1 t% cplt.plot(x, y) ; ~& s% A* \0 t! Q) z1 Z7 Zplt.xlabel('主题数目') 0 f( D1 Q% K( C6 _plt.ylabel('coherence大小')5 ~0 h5 I; a5 [6 ` |' m! V1 s
plt.rcParams['font.sans-serif']=['SimHei']! h& x) | P& O$ e( L; c8 |8 B/ M
matplotlib.rcParams['axes.unicode_minus']=False 3 K' i0 J0 d2 z( X( x' Xplt.title('主题-coherence变化情况')! V7 S( ], u2 |& j8 C) O
plt.show() , I9 l* J% ~( v$ w0 ] 最终能得到各主题的词语分布和这样的图形:1 q0 u( P8 }. b$ [
% c+ U' G- C, b/ M Z1 Q" b! a
+ B: Z L6 [' s4 y5 f
- V; n- o8 e* Q/ ^ 5.结果输出与可视化 1 r3 f' V8 d0 R/ r; v 通过上述主题评估,我们发现可以选择5作为主题个数,接下来我们可以再跑一次模型,设定主题数为5,并输出每个文档最有可能对应的主题 + g. [. f& t) C7 [( d* ]9 v6 s( j4 _% a# [. B
from gensim.models import LdaModel' d7 b4 P e/ Y" n9 [7 f6 v% W- R
import pandas as pd * S$ f% s/ z3 Y O; S1 O, ~0 p( _3 Jfrom gensim.corpora import Dictionary % [+ R/ }1 R p. h4 qfrom gensim import corpora, models# J+ N# D+ Y5 Q8 A) a1 o) @- X
import csv2 R m) q' X+ Q" E6 s
- |6 Y9 p6 T4 I( B; e1 d, e0 a# 准备数据3 M% y* o4 a3 I( X3 w7 G0 N7 x
PATH = "E:/data/output1.csv" 9 N) R* ?4 _) a$ K# {2 @2 Z2 b' `$ q
file_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n') #一行行的读取内容 ( G( J( l x. j7 h' l( ?" zdata_set=[] #建立存储分词的列表" O- T) u6 Y, @9 B h
for i in range(len(file_object2)):0 t, S7 ^5 G7 v
result=[]) G' B. a5 C7 ^. |/ i
seg_list = file_object2.split() - n- E# u/ F: d: Q9 S5 _ for w in seg_list :#读取每一行分词 , S$ n* R" w0 A7 R6 K( j result.append(w): D2 o! Y, \5 n, G- b5 v. \
data_set.append(result)- k, G b; h5 [
9 z* g3 Z; O. \! K
dictionary = corpora.Dictionary(data_set) # 构建词典/ ]' g/ r3 C1 g2 N9 |! t
corpus = [dictionary.doc2bow(text) for text in data_set]# S1 e7 `' J* Z# L) t
" y# o+ l+ V: {2 U
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes = 30,random_state=1); |+ D' e( |- N6 ]4 T9 b
topic_list=lda.print_topics()& X( g3 R& |" k, }' a/ T
print(topic_list) W5 R5 G# r: v8 v
% e+ ?1 P. t% K! @9 ]' I+ ~
for i in lda.get_document_topics(corpus)[:]: + ~, d0 r+ b* R1 W& V3 y. t! P listj=[] d `1 d+ |: }8 C0 V3 p for j in i:; i6 ]/ t/ x6 F$ R9 P/ ^/ ~& x
listj.append(j[1]) P, T' q( \1 h3 B/ u bz=listj.index(max(listj))3 X# [0 p( M) q
print(i[bz][0]) R! r- G& t0 s$ X* |# k# w6 e9 @5 c; a; B# z( x. I3 S% c2 V
同时我们可以用pyLDAvis对LDA模型结果进行可视化: ; ?1 c4 \3 [7 h A( }& H5 v% W; ^9 Z( E
import pyLDAvis.gensim ! U8 h; T9 x( z9 P/ f- v0 kpyLDAvis.enable_notebook(); ?& y2 o, l& T3 W/ p4 F
data = pyLDAvis.gensim.prepare(lda, corpus, dictionary); w( a- i' \, h$ Q5 r C% I
pyLDAvis.save_html(data, 'E:/data/3topic.html') 6 }+ p! x2 j: w; \ 大概能得到这样的结果: 6 T4 R$ n: \% o: x2 {% d- W& [: B6 |4 }& V4 N
7 K* r% g1 T: B; r9 ?9 y: ~) d% h) w
左侧圆圈表示主题,右侧表示各个词语对主题的贡献度。& R2 L$ g! c0 e, m2 B6 M
" `1 k# _; ]# }0 Q4 g4 c所有代码如下: ' q& B9 B4 R: Q, B+ }import gensim 4 q# s, F. k& `2 s2 m! V; s) a1 pfrom gensim import corpora8 _9 o' ?& B# m& K1 E1 S
import matplotlib.pyplot as plt4 m' h7 h4 T5 O, ~7 D) P( \
import matplotlib : i" z6 R( e! {- e5 Y: n/ Oimport numpy as np) j. G/ O* t+ Z0 j; }
import warnings 8 ~ V/ i6 f: H9 W7 Ywarnings.filterwarnings('ignore') # To ignore all warnings that arise here to enhance clarity - I; \% F9 N0 \ U) T+ x' g0 h' a9 Z# f1 ^0 I. p
from gensim.models.coherencemodel import CoherenceModel $ t0 q1 Z! y; i2 `: g6 M0 [# ufrom gensim.models.ldamodel import LdaModel% C8 l, |2 P' V9 _2 v" \8 F9 v
6 Z5 _; M& s( k; ~5 ~0 Y" {
6 R4 g9 d, A/ z0 T8 a7 A
* C$ N. D+ N5 V$ P& G3 w0 a
# 准备数据 . Y+ @% N6 O9 R) \+ NPATH = "E:/data/output.csv" " y! \, ?5 f: [" K' w9 q- u 3 V8 ?# l' z1 o7 Yfile_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n') #一行行的读取内容 1 G( m3 |1 e1 R$ ydata_set=[] #建立存储分词的列表 ! Y. r3 m" e3 Nfor i in range(len(file_object2)): \9 \( R: P2 u) O q' |3 ? result=[] I4 ]+ Z8 h; k* n; V5 l: w: e seg_list = file_object2.split() 3 }0 y% X1 T u& {8 V for w in seg_list :#读取每一行分词 8 v0 k' D, u9 ^ result.append(w) J- \0 ?( `1 ?$ R/ j
data_set.append(result), [3 B1 s" `1 r3 i1 v0 J! S
print(data_set)2 d$ f+ A I q7 C. H! ?" }# I
3 R6 P5 h% q! d% v* U
1 v- r5 q( c5 X D, n9 vdictionary = corpora.Dictionary(data_set) # 构建 document-term matrix/ q/ t7 P+ d& {; }
corpus = [dictionary.doc2bow(text) for text in data_set] * @. ~ G3 _: j2 W B#Lda = gensim.models.ldamodel.LdaModel # 创建LDA对象 + y8 r3 ^# j; f" V( R* F ( j" o3 f& x) ^$ b6 @#计算困惑度 & S: u+ W) W( z/ gdef perplexity(num_topics):( W- [( d8 T; V H. E
ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30) , u. ~8 R4 T- ~5 G7 ~; p print(ldamodel.print_topics(num_topics=num_topics, num_words=15)) i3 r8 ^: i% E& q8 U$ x print(ldamodel.log_perplexity(corpus))& L3 D7 r, g+ O$ `4 ~
return ldamodel.log_perplexity(corpus), O( E0 m- z! U- T( E( h' p
& `3 e0 g6 K. X! M5 S( S
#计算coherence2 ~$ M' T. S, }6 J
def coherence(num_topics): , ?. c/ E" a. E$ B E' {& l ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30,random_state = 1)4 o' M+ h" q; A7 h; Z/ l' F9 S
print(ldamodel.print_topics(num_topics=num_topics, num_words=10)) 4 {& q; y3 p3 y/ e$ ~8 ? ~ ldacm = CoherenceModel(model=ldamodel, texts=data_set, dictionary=dictionary, coherence='c_v')3 J, j; y/ q& h) ^* c* U
print(ldacm.get_coherence())2 B7 b X6 |6 D1 [
return ldacm.get_coherence()6 W( @+ `# }# n9 h
}' M7 ~+ u# f1 X7 \, X$ x# 绘制困惑度折线图( n; N8 T/ Y9 ]& N
x = range(1,15) " P1 Z( e! @$ L3 b" s# z = [perplexity(i) for i in x] , c' A: i2 t+ m- ~5 Z( ~y = [coherence(i) for i in x]$ h; r' S7 K$ {. K0 h- d! K% q
plt.plot(x, y) 0 z& n* D6 j/ _* B; ~plt.xlabel('主题数目') * z }: T* H/ i; Lplt.ylabel('coherence大小') / s( B" \8 Z9 Lplt.rcParams['font.sans-serif']=['SimHei']8 L. W, c, S4 y
matplotlib.rcParams['axes.unicode_minus']=False4 S' F) N B: U% J2 J) h
plt.title('主题-coherence变化情况'): V$ w0 }6 V9 D+ f1 p
plt.show() ) u* Y$ e$ M( X& x @- l 0 f/ p9 y- t4 ]6 Tfrom gensim.models import LdaModel5 i" B. d, U. d% x
import pandas as pd' e$ n/ g, }3 l: G
from gensim.corpora import Dictionary j% b& L! Q- k
from gensim import corpora, models: [1 v" O) |% t# `) G7 o
import csv! w$ J) W- D( o3 q8 E0 o: l
+ L- _. t: O, ~2 J9 y& s# 准备数据 , r* O8 A7 W3 j! ? i9 ~2 k& h" ^PATH = "E:/data/output1.csv"' G0 }8 e6 Y; }6 I
0 c# k9 C9 _+ n- J
file_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n') #一行行的读取内容& e$ L% S. u8 r- S8 r0 j
data_set=[] #建立存储分词的列表+ G) G# H) _ x; o6 D. d# F6 v& x: j
for i in range(len(file_object2)):6 b# y; Z, A# J# s
result=[] / {; m& D0 L Q seg_list = file_object2.split() $ X$ {3 Q- J u6 l for w in seg_list :#读取每一行分词 . o: _0 [3 n' v$ A+ g9 C% J4 w result.append(w) . w$ M! _. G7 M# C data_set.append(result)# y2 L& S- X j# Q! o
F: W' y- ?% X) [# H! c9 gdictionary = corpora.Dictionary(data_set) # 构建 document-term matrix, [2 {. j/ J& { {
corpus = [dictionary.doc2bow(text) for text in data_set] $ A# T M/ X0 J' t0 X, a! P& i% H- r# z6 {
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes = 30,random_state=1) ! {$ }* v+ P8 H9 o0 d, C& otopic_list=lda.print_topics() 8 N$ ^5 a& m r2 Nprint(topic_list) 4 k4 z1 G5 ^4 `2 ~- j/ \9 d u/ l- Q 7 W# i2 m. d5 V+ \2 Jresult_list =[] . Q7 C- P2 o* g5 ~( F, K, ffor i in lda.get_document_topics(corpus)[:]: 4 q$ I% V/ e& E3 F( z/ H' U: H; N listj=[]8 Q" C' o+ D7 }
for j in i: & a" Q% D Y( E% N" Z2 v listj.append(j[1])8 X2 v0 c8 E+ w: A" K/ {! m
bz=listj.index(max(listj))7 I3 c \$ H: a# d& V
result_list.append(i[bz][0])" ^- @* p* H$ ^0 o
print(result_list)0 b. ?" a5 [8 I. x+ Q
* ]8 e I: C. L' Y9 k
import pyLDAvis.gensim- l5 a5 v c% @' j* q# t
pyLDAvis.enable_notebook() : e# X8 k7 l% t0 Q+ u P, rdata = pyLDAvis.gensim.prepare(lda, corpus, dictionary) + p4 N: [9 s" z" R- a* ~; \2 EpyLDAvis.save_html(data, 'E:/data/topic.html') # @: G" J+ m _有需要自取~% e8 }/ g- A( m0 c: Z. c \
————————————————4 K9 C1 J" {9 r' N! B
版权声明:本文为CSDN博主「阿丢是丢心心」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 8 ]) W# O# t: R3 ?) D3 h* |原文链接:https://blog.csdn.net/weixin_41168304/article/details/122389948 . X8 y, F% O# E 6 t6 G8 L. B" \* p/ N2 J! e6 b [! M1 u, f! u- @' U2 u% {