, }# `9 t$ C8 ^' h4 A# O 0 A# o4 s: {! m1 U# M7 s三、LDA 的参数 8 v& v* H2 m8 p, Y ~) q, ]αα :表示 document-topic 密度, αα 越高,文档包含的主题更多,反之包含的主题更少 n9 [% Q, V+ F7 _! m6 L* `7 ^# o5 a" b6 X) s& w
ββ :表示 topic-word 密度, ββ 越高,主题包含的单词更多,反之包含的单词更少 % q/ G8 ]. ?8 P& G. f% E & G4 R7 W# U# Z. c主题数量:主题数量从语料中抽取得到,使用 Kullback Leibler Divergence Score 可以获取最好的主题数量。* O( _5 |! R* G% f& C ~; |1 ~
# ~- p8 h3 Z6 f4 r主题词数:组成一个主题所需要的词的数量。这些词的数量通常根据需求得到,如果说需求是抽取特征或者关键词,那么主题词数比较少,如果是抽取概念或者论点,那么主题词数比较多。) h. E# j* `2 n5 s: q- s
7 O+ |/ @2 |8 O! c# G, U
迭代次数:使得 LDA 算法收敛的最大迭代次数 u% @# Y: A1 x5 Q. |3 R+ m. z4 T
t( ?$ D! ]+ \" q- P+ Z, W2 t. l7 ]7 F- ^
, D$ G. d& f: x/ W( C3 L0 {* Y四、Running in Python/ }* g6 | H) h: |% U) Y7 T
准备文档集合5 o- j" ?" X# \, _' }( N
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."- s; f- K- z {
doc2 = "My father spends a lot of time driving my sister around to dance practice." , j7 K6 X6 @3 P. n, j( `doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."9 ~2 \, m- ^3 U3 ~2 ?* N
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."" |- I9 R* M6 v) D
doc5 = "Health experts say that Sugar is not good for your lifestyle." $ }) g( Q/ V1 s; } 4 I1 o$ h+ O3 J j! S+ I8 M( l B# 整合文档数据 3 x ^/ s Q+ x' V* s; ndoc_complete = [doc1, doc2, doc3, doc4, doc5] & P9 d8 P" r1 r) g3 J. H9 l( r5 L, y; u1 A+ k2 |, y
数据清洗和预处理4 n2 p7 F* I& U
数据清洗对于任何文本挖掘任务来说都非常重要,在这个任务中,移除标点符号,停用词和标准化语料库(Lemmatizer,对于英文,将词归元)。 3 J" A1 W+ B$ J( s: k" }. M7 p t1 x% M& G. B3 I7 m9 S$ ~
from nltk import stopwords 9 P. i7 S9 P0 sfrom nltk.stem.wordnet import WordNetLemmatizer, t8 E: c7 X( v0 q( U( `5 \# M/ k
import string 2 F- Z I: i. M" I, i2 @( y , L- [$ d( C& j! Y3 S* Astop = set(stopwords.words('english')) . \& v2 H- S! Aexclude = set(string.punctuation)( E" o( g$ }8 m) ]6 D$ b
lemma = WordNetLemmatizer()! g4 x" ^' u& x8 R- W3 x
5 Y8 w* p/ h" P3 u- K- v" Pdef clean(doc):3 e) t7 r C& R$ S1 r4 o3 [) v7 m
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])+ b% t' h/ w7 a4 }( O, T
punc_free = ''.join(ch for ch in stop_free if ch not in exclude) 9 A. c$ J/ ~% ^. D/ o normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) 2 Y0 u/ O) e: v X return normalized 5 X: k. f. M; L, A* g& t7 {: t1 {* B7 h. _: i( b( q
doc_clean = [clean(doc).split() for doc in doc_complete]+ S6 \/ l- u# n1 h2 d
) U2 P( C8 F3 w# t
准备 Document - Term 矩阵 $ }" w- X! I/ L0 a语料是由所有的文档组成的,要运行数学模型,将语料转化为矩阵来表达是比较好的方式。LDA 模型在整个 DT 矩阵中寻找重复的词语模式。Python 提供了许多很好的库来进行文本挖掘任务,“genism” 是处理文本数据比较好的库。下面的代码掩饰如何转换语料为 Document - Term 矩阵:, L7 ] z, F( s- I/ s; b+ ~. d