/ C$ @9 d! P# R J# j: H j7 J " a, u) L" k: c5 P 4 R" Z) V" p: u+ e0 O8 g0 G! C7 f# N8 g四、Running in Python7 |6 P$ Y7 ~" l4 A, T& H$ D
准备文档集合 ( p9 \) T1 v# h3 i( \( c+ Vdoc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 6 v( c2 l. s; e4 F! R5 t" gdoc2 = "My father spends a lot of time driving my sister around to dance practice." 7 Q+ a2 ~: d ydoc3 = "Doctors suggest that driving may cause increased stress and blood pressure." 2 R/ Q- T& c) ]; Idoc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."' }/ a( n; H1 ^- n6 X; _
doc5 = "Health experts say that Sugar is not good for your lifestyle." $ \" z; c: ], W; s3 b3 f6 J1 C , d5 Y: `% W& {: D9 e- w+ X$ ?% u, v# 整合文档数据 & _9 I, s, f2 j2 |! ?5 udoc_complete = [doc1, doc2, doc3, doc4, doc5] " P$ w6 U+ S* }4 b- S: a4 V \' F& Z- n
数据清洗和预处理 - J8 i" \* X2 h% b: r数据清洗对于任何文本挖掘任务来说都非常重要,在这个任务中,移除标点符号,停用词和标准化语料库(Lemmatizer,对于英文,将词归元)。 4 k0 W! V+ B6 o' s! R. g9 k, H8 u" J. O5 h
from nltk import stopwords s: c2 P4 O! n6 k) Ffrom nltk.stem.wordnet import WordNetLemmatizer; D, J6 w n1 T: [: D, Y U
import string 9 s" T/ C3 {, z3 M/ R$ d9 R" ^ , V* J) x) P7 X! n/ ?( ~4 Ostop = set(stopwords.words('english')) Q( }, `% A3 a# R
exclude = set(string.punctuation) 2 w5 p& q0 E2 H+ ^lemma = WordNetLemmatizer() 2 h/ s8 V$ |0 t( P8 k/ A# f. x7 U, x4 ^+ }4 h
def clean(doc):1 [" v& O# k) Q. s& K
stop_free = " ".join([i for i in doc.lower().split() if i not in stop]) ( `) k2 b7 Q" N punc_free = ''.join(ch for ch in stop_free if ch not in exclude) 2 Q% j) _% |: |" a0 T normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) , d5 ^" ^ r0 i4 P return normalized& X6 a" S( D2 Q: V! n7 m
# M: _* N5 ~( m9 adoc_clean = [clean(doc).split() for doc in doc_complete] " p6 t7 {3 u6 s8 R- c8 K: q8 f ! B0 n- j! f' e* }准备 Document - Term 矩阵 9 o, \; Z+ f* K% E$ D: j2 n语料是由所有的文档组成的,要运行数学模型,将语料转化为矩阵来表达是比较好的方式。LDA 模型在整个 DT 矩阵中寻找重复的词语模式。Python 提供了许多很好的库来进行文本挖掘任务,“genism” 是处理文本数据比较好的库。下面的代码掩饰如何转换语料为 Document - Term 矩阵: - @- K& k/ O* x: s 7 H3 @6 ]$ z& _3 c" `7 t% {3 L$ |import genism ( O& J+ D+ I6 ^5 R+ T' Ufrom gensim import corpora % }9 p. d& ?. ]; S1 f3 q# ^+ \) R! Y i. v; j: m
# 创建语料的词语词典,每个单独的词语都会被赋予一个索引 4 p: f! I6 c) A _; A6 L2 \dictionary = corpora.Dictionary(doc_clean) 6 }8 M$ e9 d8 j7 _ s; E ( T L1 O7 S+ D7 W, U' u# 使用上面的词典,将转换文档列表(语料)变成 DT 矩阵; i5 t( k4 e M
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]1 P- g: \$ f" d) d* E
6 N f3 K/ e$ }8 @& u构建 LDA 模型 7 d; Z3 }& u% I7 C9 @& a* T创建一个 LDA 对象,使用 DT 矩阵进行训练。训练需要上面的一些超参数,gensim 模块允许 LDA 模型从训练语料中进行估计,并且从新的文档中获得对主题分布的推断。5 m. \% |, K" D4 |) _9 \
: |; b# C$ A8 K! J+ m' p' O$ B& V K
# 使用 gensim 来创建 LDA 模型对象 ( {2 D# n" \, E: W; sLda = genism.models.ldamodel.LdaModel2 N! C8 s( n! ^7 v7 u5 z2 u% Y
c" A( R$ P. |$ `+ Y& L' p X3 l
# 在 DT 矩阵上运行和训练 LDA 模型 & e8 h) P- `% a3 M6 r* D m) p$ o8 kldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)1 i% ^0 {, ?- P: Z6 F5 J