& L* @% d) a- f/ Z6 ~迭代次数:使得 LDA 算法收敛的最大迭代次数3 L* t, S+ `: t4 h# _
: N' v% S) ?4 O) p
8 K) a/ B; |1 l' A1 y3 C
$ d+ d6 y7 g/ X( Q' G n
四、Running in Python: I; X9 S" ?/ B
准备文档集合6 ?: Y8 U8 h8 W$ I3 a$ a6 I
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."1 { J6 n8 S' O
doc2 = "My father spends a lot of time driving my sister around to dance practice."& P- V3 O( i a7 d! Z
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure.") b0 q1 Z! {% b3 r7 X/ E# b
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better." 1 K; _7 K% _" M H5 y8 @( x$ bdoc5 = "Health experts say that Sugar is not good for your lifestyle."2 Z0 p! k8 O" u' Y4 P
i$ a- p1 @3 a! R% S5 I# 整合文档数据 / K& ?% ]" X! I& xdoc_complete = [doc1, doc2, doc3, doc4, doc5] # T: C5 ?2 I5 F \8 x7 s: u' {% [, X% g- r4 _8 f0 v0 N7 j6 L
数据清洗和预处理 ; {, {( e9 j V; ?2 b. r2 ]. D数据清洗对于任何文本挖掘任务来说都非常重要,在这个任务中,移除标点符号,停用词和标准化语料库(Lemmatizer,对于英文,将词归元)。 7 K+ l( O' q" q( g+ D # `2 R' ]) O3 |# b# ?0 Zfrom nltk import stopwords ! S8 p0 G) D. I/ Ufrom nltk.stem.wordnet import WordNetLemmatizer , q6 [5 o* `4 U/ G& d. _9 \+ K+ n) ?import string2 Y% n; q; {, S y$ u
6 E3 J" T L" x; q) E* A6 R5 @4 b' n
stop = set(stopwords.words('english')) : r: \' `% C D% M" G. Wexclude = set(string.punctuation) 1 c9 S+ M8 p# ~# ^4 alemma = WordNetLemmatizer() 8 @' _: u, Q9 v R8 D0 k2 |: A' X% f" Y/ _& F7 p3 y
def clean(doc):7 {! v* a' @. {6 t1 Y K
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])8 f% r3 E6 u6 d% g
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)2 N9 b/ v! r/ I
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())( G4 N" C4 M, x& x" V9 ~
return normalized/ _* h' `1 J8 A6 S6 j
6 J3 T& m8 X$ y' @7 L9 Fdoc_clean = [clean(doc).split() for doc in doc_complete]3 U& U- r4 K( O5 A2 t
4 }5 `) a% h5 A9 M+ I
准备 Document - Term 矩阵 + {' h8 Y9 |7 M* L; r9 {; \) v2 M; N语料是由所有的文档组成的,要运行数学模型,将语料转化为矩阵来表达是比较好的方式。LDA 模型在整个 DT 矩阵中寻找重复的词语模式。Python 提供了许多很好的库来进行文本挖掘任务,“genism” 是处理文本数据比较好的库。下面的代码掩饰如何转换语料为 Document - Term 矩阵: " O' K" U5 @* u . j" Y# ~+ p6 v/ {1 L6 mimport genism ' H9 l, T. I5 g3 u3 Afrom gensim import corpora6 h8 J+ k0 @2 R5 ^) H, h