; m& o3 y- J4 K3 s0 f4 bpip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple beautifulsoup4 ) B/ W9 t* q! W7 w# i
1 ( P! u2 z+ W( O6 H7 |导入即可 - D' F4 E- m, F e- q* b7 q- _- K$ L
7 ~& l+ c+ W$ A7 q# ^7 ]" n/ N' [
from bs4 import BeautifulSoup# ^1 o" B, t# |9 Y5 K$ R
1 0 a" F- X _+ Bhtml_doc = """ # ?( P; W( q$ O1 P( R2 R<html><head><title>The Dormouse's story</title></head> . G: \7 l- |, Y+ u8 O, l* y6 D<body>- I# Q+ F9 b: e! {
<p class="title"><b>The Dormouse's story</b></p>6 Q) A4 `: Q& a) o) V! t# h
- {8 `6 Y$ t' S' M o6 T 3 d- H( B5 b+ B4 |- ?" v+ s<p class="story">Once upon a time there were three little sisters; and their names were " U( s5 P8 }. U0 L<a class="sister" id="link1">Elsie</a>, o9 K! I+ ~" h) ?
<a class="sister" id="link2">Lacie</a> and ' o2 b, ?% G; T! M( k$ x9 e<a class="sister" id="link3">Tillie</a>;9 @4 Q5 ~% p$ b3 H5 P' y; }2 u
and they lived at the bottom of a well.</p>% P1 ^7 p; y# S5 _
' I8 G; Q2 T1 T7 P% c3 R, s4 e: D
5 b# ?# J0 j1 S, n1 [
<p class="story">...</p>, _' _4 i& i% l6 l
""" 2 z& X* C# ]; T7 I/ Q1 1 u6 u! ^7 S+ w) m6 R2 ' e* S& \3 c: f z" q# c4 V+ ?! U34 A/ E! F* r% S) c8 v- s/ |0 _* ^
4- ?" [. s$ `3 F! ~& Y- H! I! c
5 3 G. e1 W/ X S3 z! I6 : U3 v! B. t* j7 3 A S4 A1 j7 a6 _7 f87 V8 s' s2 f+ S( I2 Z! k
9 ) g+ J4 s$ `3 k( Z/ U& r. i2 G10! X2 z) H" i1 E7 _/ [4 G
11 7 `8 o. q" x9 X6 _; z* D12 $ O# k/ M3 i) B0 K( ]# a13+ Q3 G' h" i! L$ p) `3 k
soup = BeautifulSoup(html_doc,"lxml") 4 X" I& H0 h0 x7 S( E- H+ ]1# R d2 F2 J1 M6 o- y
几个简单的浏览结构化数据的方法) E% a. K' }6 `5 Y3 B) t; a& ]
soup.title2 E/ R% S/ s ^' `2 ?
1 # F( `7 L9 O( `) b& W<title>The Dormouse's story</title>; K8 }0 ? `4 F8 Y1 X$ j7 G
1" g( G( u; U# r4 L+ T
soup.title.name + @& U% {! r6 h0 C8 R18 q% d2 U! H! p* e5 F
'title'9 m3 `; d5 K" R+ z# c/ k
1 # ^0 s4 J: G* G* L" {9 ^soup.title.string+ A2 S% A$ E& ] L2 x4 U+ f8 H
1 2 T3 ? Z1 ?1 N8 R7 W; p$ ~) C"The Dormouse's story" 4 o) R; D2 x. _( ?( I. l1 ; _- d2 U2 s, B& g' ksoup.title.text: x' F; d) B' z. X L4 S- S/ y
1 ( p6 z7 |: ?/ K! d7 X"The Dormouse's story"1 P+ h* b, \. @2 c
1 , Y h! K! u j5 ]7 n9 y& x1 bsoup.title.parent.name" d* H' O1 d% B1 z: J
11 ~! R; o4 B6 F3 Z6 w
'head' U! |7 O! V" K l. ^6 |: V8 D1 & R9 \; ^5 }1 W8 S. _soup.p# S' }% K6 `2 K1 n; H7 x
1$ h/ ]& y6 n) b6 e$ U8 k
<p class="title"><b>The Dormouse's story</b></p> & Z" d1 [; d1 y1 i! @; i; s# a14 e7 n/ F a/ u b
soup.p.name ) P/ u, ?' y; t6 p" `4 P% U10 d+ k/ j7 t% A7 G: t* i
'p'% {! M C7 {0 ?- j6 s. C# r
1% I% J# i2 o7 i0 R& O. I
soup.p["class"] , E* m6 q# _, a L% Y! g1 + B. z4 `7 u" T# F7 b% b" J" f['title'] 4 A ~0 J5 I4 i3 S7 c: E) u; [1 # X+ _" o2 P" A* K4 D9 tsoup.a 8 x1 S4 F6 O, u! Q: H1 / d) O* o; _ {* R( N5 ~- y<a class="sister" id="link1">Elsie</a>' j7 F) J1 A; j" A* }$ l L
1' o% K1 A t) H, B" o% M
soup.find("a") & `3 P) Y$ x6 v$ G% P+ t, E1 - Y4 i0 }. \0 [<a class="sister" id="link1">Elsie</a>+ H+ V" P0 {: N" ~+ S/ A) }) W
1 . h, ]! X* {% H: rsoup.find_all("a") " Q7 M- K, x$ U7 x1, K0 [# K, ~+ a) {" V! p
[<a class="sister" id="link1">Elsie</a>,# n+ E; Y: O3 }! s) I0 q
<a class="sister" id="link2">Lacie</a>, * V- P1 k( k1 [3 |, m; g <a class="sister" id="link3">Tillie</a>] 4 R/ j& b2 q& T. Q* {1 ~9 v# q1 . u& |/ Z3 s3 W0 J% @2, @, n- M8 Q0 }" c; _
3: g3 C5 k: ~+ u6 w2 }- C) J
从文档中找到所有的< a>标签的链接 u( X3 C# y7 V3 ^; z
for link in soup.find_all("a"):! C# `3 o( |; p8 S1 m0 W
print(link.get("href")) ; N6 l X+ m( Q( U& l4 \ K( ?1 1 t* k0 [! E7 k# u K0 t1 _7 @ }2 7 o2 B! _0 K8 J5 U0 A2 {http://example.com/elsie3 `/ s% V. }( H- K
http://example.com/lacie0 u( q2 }* d) g ~0 k6 m
http://example.com/tillie2 y/ }4 R. T8 `, I& T7 c- j
1+ Q) |" M. Q" s0 P X z
2 " v {: Q4 F( g0 s, M9 V( Q3+ N, Q8 h P" T+ A4 X9 o, T
在文档中获取所有的文字内容 * Y3 y. P; V& q! O+ |9 l$ a6 pprint(soup.get_text()) 6 J0 q5 N# [ s/ `; v! O0 y. u) l1 : }1 Y, C8 ?, Q& E3 g0 y2 r, LThe Dormouse's story - Q# I' R- u/ I. B% R2 I3 |7 S+ Z- {
1 O ]# O& ^1 H8 W. Y& g" o' b2 k2 a
The Dormouse's story ~3 e- q S1 {% p; ^/ oOnce upon a time there were three little sisters; and their names were: ^ F; Z4 s& v& C% _5 s* }
Elsie,5 ^+ U" R+ Q: ]
Lacie and ' |" N8 `( |) X" Y) R' \7 X- o' PTillie;3 q2 w( \& }+ G4 \9 Z; C
and they lived at the bottom of a well.& H4 W) I! p: \6 D& T
... ( x. Y$ z; L5 Y# {9 e5 |1 4 ]6 |# t8 |; V$ J% t2, D- F2 L3 g$ z O4 b2 h
34 ~" ~& |+ E$ j" I/ W9 ~! e( A. E
4 $ i: N* N% m2 q( B0 b5) o5 Y! G/ Y7 K Q: O ^
6 7 G9 ?% ]8 r5 Q0 k4 {) l5 F7 , a& ?, F+ j: W) H. b8 ! h. J' _$ k. l( L; n i' R! Y- k1 J9 C( \$ w! H) n4 T' n- r9 {2 s2 r, ?7 `, [0 m
# O2 t7 t! r( v: q
; Z! ]/ y3 L/ e- L/ O通过标签和属性获取, d+ P. \ }2 K: Y) ^; w
Tag有很多方法和属性,在 遍历文档树 和 搜索文档树 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes8 f+ Q/ E+ s3 F# V( J1 j1 M
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') $ b: h# i! O8 S6 Ktag = soup.b* a i4 O8 ~, H) a8 M" K7 H
tag$ e" J' t2 Z0 @
1 ' i) j* I% b+ `9 v2 @+ v* K2, N/ Q! s/ j2 L# `2 N$ t3 o
37 W8 T' S, p4 x0 S
<b class="boldest">Extremely bold</b> $ M& ~- }' A5 e$ v% x1 2 A# k9 m0 E' b2 ^; s! w/ N% U; L& n* Qtype(tag)3 t+ m; P; _! @9 @
1, |3 K2 Q! u' ?
bs4.element.Tag6 S7 R9 r' s" J& D. R
1+ l; a* x$ V( o- w+ D% l
Name属性' g- Z7 a6 h3 _. T, Z& y
每个tag都有自己的名字,通过 .name 来获取: 9 O5 G6 V) A8 g1 b: |tag.name0 j8 i* f6 P: P: I( u
1 + y' R8 \6 M% \0 ?3 z* Q7 x9 u'b' b- |9 n9 ?# w' O1 ' }( g o& F, u' i/ H4 o$ V如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档2 L/ V* w g- e O" c1 o
tag.name = "blockquote" v# z3 F3 D1 m, \
tag " ^9 t- m0 b5 F1& J! S. k7 W7 ~+ A+ K; G* D/ _4 Z7 o
2 7 g! @, L) d: R* e9 a<blockquote class="boldest">Extremely bold</blockquote>% N# ?) O0 z- A; j5 M# n
1 " S. ^, ~# \3 I' F9 n多个属性 + h# O3 [' D r u- j2 N. Q( {一个tag可能有很多个属性.tag 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同: 3 |. \0 t& w: Ztag["class"]! P) v) m& p, F' G
1) k: V. R0 W; V) l& Z
['boldest']! n3 g. y6 T1 T# _ ]
1 7 o6 f* q/ e L6 N1 r" I3 ` _tag.attrs / {& E! s/ Z J$ r. N1 8 l2 ]3 x, |: a7 ~5 C; X{'class': ['boldest']}' v# a- m: C' w4 R( w3 R) L9 C" z9 j
1 ' n% a1 [, S- h% f" L: X) _" Atag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样 0 |+ y! V0 y5 c6 Ntag["class"] = "verybold"0 L; [: Q3 U! R5 e2 J, S
tag["id"] = 1 / G" O+ @) J: `6 }1 Q7 Mtag 0 C3 m' |0 \7 O0 F+ ^+ X1 l5 U1 l0 A& i% @& S2# [0 n, I0 @+ y( A) J4 K7 f& ]6 l
3. y. M6 \7 q/ j0 A% a
<blockquote class="verybold" id="1">Extremely bold</blockquote> v8 g y% A& P! V' m y- {" H8 Y
11 d6 t. ~+ x, m2 O
del tag["class"]! h2 v* w& ^- R- ?, `1 z5 m0 ]
tag ! z: M9 K. n3 Y+ d# f, D7 t1 ! i4 O' n, l9 @3 O7 Y. X, F+ Z2 4 W3 O1 M- Y7 h$ V i# Z<blockquote id="1">Extremely bold</blockquote># N* S( ~8 a9 _* W3 c
1 - l n7 b V3 i. {& h' v% S! W7 t多值属性 1 ^3 D, n, G+ y( C' l: jcss_soup = BeautifulSoup('<p class="body strikeout"></p>') $ F; r; W' T. {" R0 Ncss_soup.p['class'] ) f D( H( [! r. x- [1; O* |9 V9 v2 }" y
2% y) M6 M' c# _% Q5 M& s/ o4 t1 J
['body', 'strikeout']: c u/ Y& r# r- ]/ {- \
1' r$ f' S, A# Y" c% O0 j3 J U7 e
css_soup = BeautifulSoup('<p class="body"></p>')" t& h- m8 J/ g/ u- U$ ^+ |- }& Z
css_soup.p['class']. M# b+ `: t6 |% j
1! t8 G3 {, E8 p; @/ w
2 8 A+ I; D4 u* k& M7 m# S6 V0 A['body'] . o7 a* Z0 y3 }! r& _. V9 H1" G) b9 m/ D. h" v
可以遍历的字符串; a0 @+ V4 K& U
字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串: 2 j3 x6 U; \8 H1 _7 v3 |+ G' Ctag.string9 N# w7 |+ }& z( y( q
1 8 I0 W" u) _6 k, u5 k5 V, r' ^'Extremely bold') ^ |' |5 P1 R
1, t( \) S( y0 h
type(tag.string) * y& r& k9 v$ o1 ^% u1, e& L# _1 u) D9 {
bs4.element.NavigableString 0 {3 X4 Y. | `: C2 _! p7 H8 `1 # q' f z; `0 [ r8 T1 h9 o( R一个 NavigableString 字符串与Python中的Unicode字符串相同, 8 L- O1 C, F" b! Y5 s并且还支持包含在遍历文档树 和 搜索文档树 中的一些特性.; i9 U8 _5 Y* u b
通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串: # P$ D) o# D Z& s" i' q/ [; G $ a& A! f* q0 M5 L+ u$ `% n& i7 a; K3 f4 J( P7 ?& h7 i
tag中包含的字符串不能编辑,但是可以被替换成其他的字符串,用replace_with()方法+ a$ R Z, k [8 U3 [3 n
2 u( k. K/ U1 v- R 2 k8 h0 {6 |* g' c! qtag.string.replace_with("No longer bold") - L/ K7 ]5 r+ s( G" Atag . A& S# e* q& A# f1' @8 v/ @" c- ^4 w9 p2 i3 S8 Y
29 S2 C' K$ v: u9 `* k- b3 |; y
<blockquote id="1">No longer bold</blockquote> l: i1 C& M0 m( a
1 % M- x! n" b0 E! P# G+ S5 Y注释及特殊字符串" ~3 N5 z+ z0 H1 b. [/ C4 E
文档的注释部分8 V1 h+ `0 l$ Z3 t' ^2 d; A
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"; l2 y, ?- @% D0 v, ^$ @% {: n
soup = BeautifulSoup(markup) 8 _# O0 r L W9 I: pcomment = soup.b.string 3 B( ~+ S; Y$ Z* }9 D0 e& Q. l1 X7 I" Ucomment' `+ z% a" G+ T+ o3 g, Y, x* P0 x2 G
10 g, H2 a z; N! L" p! f5 f
2 : F' M* i! w4 F3 4 P1 {% w7 L+ i3 T) k+ s6 U: @( ~4+ C/ e$ C& s( k9 F. T
'Hey, buddy. Want to buy a used parser?' 8 k* _8 z6 e3 _) }2 r! W1/ i0 N0 p/ C- t
type(comment)" p+ l9 G* x$ v0 q" ~
1# s9 ^% }$ Q1 E- r
bs4.element.Comment0 |. z \7 ]4 R7 g2 U) Y
1 " r/ G, b5 t+ X" h3 r) y6 y" M* jComment 对象是一个特殊类型的 NavigableString 对象: 5 C# [) N+ c" N" \comment0 j7 I2 r$ G" o, a: R# K. H
1 3 v* ]% m8 W- S. e8 P3 G2 e. b6 L1 b b'Hey, buddy. Want to buy a used parser?'8 E7 P% N5 `6 F6 `
18 k. y: C; u8 C6 p/ F' O
但是当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出:5 b0 B8 ]2 O# Y b! F
. ?0 M: q1 j% y; s$ v
0 ?/ M5 o0 x( {4 a& w7 k
print(soup.prettify()) * I4 x: A2 ^' s" }8 t1 0 U4 w7 X! u7 L2 k3 _, F<html> ; v$ H& c& b1 [/ v$ n" ~& A/ t <body>/ O4 E# g% M/ P d
<b> 9 T: N% a! O3 S0 o <!--Hey, buddy. Want to buy a used parser?-->8 y6 [% B9 [" ]% g6 W! S: c, v
</b> ; o2 N8 {& D6 j* I2 E/ C3 T, k </body> ; F) F; L- N! \</html>) Y+ s, a% B/ `0 r- N6 _1 [: i
1' W- I( O( m9 l5 A3 ~5 o- @
2 8 j, F4 K# X0 T5 y; u, I" V3 & I: x0 u. c+ U4* e4 |9 t. y& h; K% C: d) N9 `
5 ) `9 U& x+ K# R# N6 7 I7 O. T3 t3 A9 d/ C9 ~1 k+ T7 2 a/ F: M$ t9 O! I' R5 k& k: }) mfrom bs4 import CData " q( K! w1 [. |8 U7 q! l0 J$ \- qcdata = CData("A CDATA block")! s0 o- X2 ~4 ?" \2 j
comment.replace_with(cdata) + r' u* e/ V! L1 @; Cprint(soup.b.prettify()) , K+ M/ H" Q9 o1 r" u/ n1 `% N1 x
2 / i# P( g$ `- q2 z& w! j1 {3$ u- m$ n5 O, Y/ h
47 N# Y% @2 r" w7 z5 a
<b> , e# U+ c* Y' K2 u3 M <![CDATA[A CDATA block]]>; y3 `: c4 p, }- T; H M3 w2 W
</b>' }' H1 m, M0 I. @' J. R0 u
18 [. q8 M2 O. ~ u$ @
2 * c5 }7 D( \# t4 W, r3 ! }( [" c% N, Q1 _( k遍历文档树 - m" l# R* W, y$ vhtml_doc = """ * |0 O) m! C3 d- m7 ^1 `, a<html><head><title>The Dormouse's story</title></head>$ ~) L& l) t0 p7 e8 ^% N- s# q2 D, z
<body> / q. H$ f3 n S P<p class="title"><b>The Dormouse's story</b></p> q/ U+ i6 x: V& n% O/ c
+ r3 N8 ? }; \. o& x& H# Q
1 Q5 f% [) ^) z2 D: u<p class="story">Once upon a time there were three little sisters; and their names were 0 X2 n0 }- m: z6 m1 ?9 C<a class="sister" id="link1">Elsie</a>,) i2 ?' U5 m+ q
<a class="sister" id="link2">Lacie</a> and # [( I* t, z4 A N' r2 z, S5 T<a class="sister" id="link3">Tillie</a>; 7 |" g$ n5 F9 O# z6 [and they lived at the bottom of a well.</p>( ]5 c1 o( k9 ~9 }9 F