' k0 l4 W% N& K ]1 M( z* S; n主题词数:组成一个主题所需要的词的数量。这些词的数量通常根据需求得到,如果说需求是抽取特征或者关键词,那么主题词数比较少,如果是抽取概念或者论点,那么主题词数比较多。 $ _* \3 G" e. V' E$ N' @- Q3 e 8 g9 K* E( [4 z) Z迭代次数:使得 LDA 算法收敛的最大迭代次数; z: @! E L ]8 `+ U6 ]% O6 p Z
+ K& O/ g- O4 z9 j$ G" g7 S0 S6 f' R- {3 X- M) p5 |
" z9 S& X6 h6 u0 ^8 \四、Running in Python- P: y2 f( l2 ]9 @1 P
准备文档集合$ K& a, i6 X3 ]7 k' Y
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." . ?: G0 M( g0 Rdoc2 = "My father spends a lot of time driving my sister around to dance practice."! u4 a0 i9 U" [' ~8 Y: @/ I
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure." 3 m- K5 d9 K7 @1 m$ t- }" L Pdoc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better." ( g7 |7 f' P9 e Udoc5 = "Health experts say that Sugar is not good for your lifestyle." & _# e; @3 p! b! F7 P/ M % Z2 a& s" q. q3 X: v) t+ ~. T# 整合文档数据- t8 } u- Y0 B
doc_complete = [doc1, doc2, doc3, doc4, doc5] & u$ n! D+ L4 g {8 G % n3 x' z' j: I) O6 n# ]' g- r' g数据清洗和预处理/ y2 E# J6 \5 U
数据清洗对于任何文本挖掘任务来说都非常重要,在这个任务中,移除标点符号,停用词和标准化语料库(Lemmatizer,对于英文,将词归元)。 2 ]2 k+ _1 D# d8 X; E* a0 C # p9 N: c6 q+ M3 Z# i0 kfrom nltk import stopwords + y$ k- A* M1 M2 {: xfrom nltk.stem.wordnet import WordNetLemmatizer Z& {- E) V5 d6 a9 jimport string 3 [/ I6 Y3 B7 \: H; K* Q 1 J. j( V: |1 Z/ o9 g2 Q, r, Hstop = set(stopwords.words('english')) : }, Y0 [3 S$ h2 q6 {, Mexclude = set(string.punctuation)6 Y3 [' R0 v+ J8 a2 j1 d
lemma = WordNetLemmatizer() ; k/ c/ b0 A y9 c9 Q; ? m$ B# m _& Q" H- v$ ^
def clean(doc):% D4 G8 Z8 p0 p/ d/ A& [6 K+ G7 r
stop_free = " ".join([i for i in doc.lower().split() if i not in stop]). Z, e, t2 b8 r) O
punc_free = ''.join(ch for ch in stop_free if ch not in exclude) ( Y0 p7 L1 I& i$ _ normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())9 }8 W3 _1 {' l( B0 k6 ^; @9 c
return normalized 0 F8 i a* i2 S1 a5 \, m 1 G: Y( v- s; H F# m% \' p- zdoc_clean = [clean(doc).split() for doc in doc_complete] ) b/ M& ~+ l+ a1 y& ]* X" o Q% |7 F; z
准备 Document - Term 矩阵- F" P" M4 y* z! t: |
语料是由所有的文档组成的,要运行数学模型,将语料转化为矩阵来表达是比较好的方式。LDA 模型在整个 DT 矩阵中寻找重复的词语模式。Python 提供了许多很好的库来进行文本挖掘任务,“genism” 是处理文本数据比较好的库。下面的代码掩饰如何转换语料为 Document - Term 矩阵: ! o! Q& ~2 ~5 p/ |% b9 d' r, t5 c1 `6 Z
import genism- }3 l3 s$ W. H2 A# W2 w
from gensim import corpora0 g) F) i7 V& `1 f7 a
D0 x/ D1 ^' j0 Z4 {4 B3 N# 创建语料的词语词典,每个单独的词语都会被赋予一个索引7 O$ t& D7 V9 ~& ?/ W' g4 E
dictionary = corpora.Dictionary(doc_clean)% Q( N, b- [, F