+ m% Y5 _; E% S: `+ [( @5 f, F/ J主题词数:组成一个主题所需要的词的数量。这些词的数量通常根据需求得到,如果说需求是抽取特征或者关键词,那么主题词数比较少,如果是抽取概念或者论点,那么主题词数比较多。' k$ k I. h: s3 Q! I
" p. d% |& m( s! l0 u
迭代次数:使得 LDA 算法收敛的最大迭代次数& h9 l) {/ P6 p' d- G* P. I7 }
3 P. s& _! t4 u% h1 M 3 ]7 `9 `) q3 J/ ?6 C/ ?1 B$ {9 k8 r% Y7 e5 a2 ]( e% i
四、Running in Python ) f. t0 [( F7 o# D) q9 R# w准备文档集合 5 J9 u2 a% {2 h& D! j7 X' S! v' vdoc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."& U; i- c, h3 _6 \0 [
doc2 = "My father spends a lot of time driving my sister around to dance practice." 4 l9 j# r! A' l3 k( L2 Ndoc3 = "Doctors suggest that driving may cause increased stress and blood pressure."5 c, Z$ c4 G6 G' }4 _
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better." " p Z' e- T4 c4 J# P3 |, xdoc5 = "Health experts say that Sugar is not good for your lifestyle." ) V& x. [+ r6 u2 U/ D ; j9 b5 B0 o' v8 l$ G# 整合文档数据5 o c* C& n! R: d2 c8 Q) G Y
doc_complete = [doc1, doc2, doc3, doc4, doc5]0 i8 E+ p% e4 H
6 S1 ]" U3 e3 e5 W( t数据清洗和预处理 - B* D. b* r0 y& h" C数据清洗对于任何文本挖掘任务来说都非常重要,在这个任务中,移除标点符号,停用词和标准化语料库(Lemmatizer,对于英文,将词归元)。 # F: U; ^$ j$ o' E# p ) I0 g2 O& `- J+ m M9 yfrom nltk import stopwords& o9 w' c6 d) ?
from nltk.stem.wordnet import WordNetLemmatizer + j3 A1 {" q7 V0 Jimport string3 s2 z V B% k
" r$ w4 l; P2 Q, ostop = set(stopwords.words('english')) . v1 {- ?; G. R7 hexclude = set(string.punctuation)2 x$ A0 P1 K; w
lemma = WordNetLemmatizer()6 b. N3 L- {, A, H E
N0 e6 s$ C' @, s) Z/ P6 S' K2 @
def clean(doc): + R- u) w0 @2 W8 t E3 P stop_free = " ".join([i for i in doc.lower().split() if i not in stop]), E- A( X/ T. i! H. H, K
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)) k0 m a" h! e
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())1 D8 k! m8 n# d9 d C
return normalized / N9 F+ ]! \7 v/ z" i! a- b& |+ m/ a' U- e2 w8 J7 \
doc_clean = [clean(doc).split() for doc in doc_complete]5 F( O% e0 v, @- g0 U$ o
' u2 e; d0 k, |: B w' \0 ]准备 Document - Term 矩阵; o2 X/ H6 U2 k2 g: a& ^9 q
语料是由所有的文档组成的,要运行数学模型,将语料转化为矩阵来表达是比较好的方式。LDA 模型在整个 DT 矩阵中寻找重复的词语模式。Python 提供了许多很好的库来进行文本挖掘任务,“genism” 是处理文本数据比较好的库。下面的代码掩饰如何转换语料为 Document - Term 矩阵: " B, i( s6 I" ~/ M j" y2 h8 c2 p/ T+ a- H( X+ [
import genism ! R$ b$ J6 [3 Y/ `1 g+ {' Ofrom gensim import corpora 9 j5 W/ a+ z+ P1 {' v $ } p/ u$ K. p& a T# 创建语料的词语词典,每个单独的词语都会被赋予一个索引* l9 U3 o6 e8 W+ X, s! N
dictionary = corpora.Dictionary(doc_clean)3 S/ F8 Z c6 p' @9 R