# h0 C: M- n" r, s* G ! {/ A# f7 H2 p' x三、LDA 的参数& D" z- e) ?$ \/ }, O. b4 C$ S1 U
αα :表示 document-topic 密度, αα 越高,文档包含的主题更多,反之包含的主题更少+ E+ c: X1 e( B$ k& I1 o Y5 b
1 T( o6 V9 ]6 y6 V9 z
ββ :表示 topic-word 密度, ββ 越高,主题包含的单词更多,反之包含的单词更少 ( W# e" |; h( c, V. X, s7 E6 _ . f/ S. b* O# o* g: N主题数量:主题数量从语料中抽取得到,使用 Kullback Leibler Divergence Score 可以获取最好的主题数量。 9 s% `- d" S3 a- a' ]' X w7 ]5 p' t! S! h8 h4 O* i
主题词数:组成一个主题所需要的词的数量。这些词的数量通常根据需求得到,如果说需求是抽取特征或者关键词,那么主题词数比较少,如果是抽取概念或者论点,那么主题词数比较多。 ) V% {0 d; y# g" o/ W6 Q* E* y , U5 j- v, r9 E! ?% z% N, w$ ]迭代次数:使得 LDA 算法收敛的最大迭代次数. Z( P0 Z; N% _6 [* g' X
9 `6 m' l ~9 R; Z* j9 B! w$ N5 v5 |: b% j3 S3 c% G8 l! ^
3 ~# U+ Y' b( a+ A: Q
四、Running in Python" E+ G# E J a* ], B
准备文档集合8 Y, c2 Z; R! K, Z- f
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."" W% C' s, P' n2 ?# _
doc2 = "My father spends a lot of time driving my sister around to dance practice." 5 t9 k7 N5 `' A4 m& z; F- W" kdoc3 = "Doctors suggest that driving may cause increased stress and blood pressure." + m# `/ e+ M% udoc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."! v, c( @; x; Q3 Y4 ~; ^5 p
doc5 = "Health experts say that Sugar is not good for your lifestyle."3 ^( ^# x, h6 q/ K
. N8 S) b2 Z7 A! M( r6 Y4 o5 ?
# 整合文档数据 2 d: M! \9 O! R/ r( ?% wdoc_complete = [doc1, doc2, doc3, doc4, doc5]. U9 m- M s# M
. x% ^9 `% y0 P- @
数据清洗和预处理) a2 ] I/ |$ G7 ?7 d+ o
数据清洗对于任何文本挖掘任务来说都非常重要,在这个任务中,移除标点符号,停用词和标准化语料库(Lemmatizer,对于英文,将词归元)。 8 ~9 y! t2 C( @7 z0 f, X! p3 I
from nltk import stopwords1 x4 C& b4 o1 H5 r! `2 J/ j+ G/ d
from nltk.stem.wordnet import WordNetLemmatizer 8 f: ^# H* x' I% j& } t! `& t x, iimport string ) P5 ]0 k6 E; \ # W6 `. D( J, Q) }stop = set(stopwords.words('english')) 9 S- F" b O. j+ hexclude = set(string.punctuation) " M9 e, n( H) b7 Q8 V. o4 n& {lemma = WordNetLemmatizer(): i' n& y8 f K% V
' R: P; ^0 K5 y- w9 v
def clean(doc):3 K; O$ F. G$ x& J( I- x9 m
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])% [$ |* [+ F( i( t5 d8 `! I
punc_free = ''.join(ch for ch in stop_free if ch not in exclude) % g- I3 `! c6 I normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) # J1 x" n! E% |* ` return normalized" b! K" o2 S0 x# E/ T; N4 e
, N; m# ?/ E: M
doc_clean = [clean(doc).split() for doc in doc_complete]4 E; y' t: q8 N' Q