4 _. W' e: M8 j9 Z7 |* n9 E8 P4 t3 w2 e
<p class="story">Once upon a time there were three little sisters; and their names were6 q D ]6 d, M. ^
<a class="sister" id="link1">Elsie</a>, ; \+ J! Q7 z3 u8 G+ S<a class="sister" id="link2">Lacie</a> and C* V) I. x' ~' v' d4 Q
<a class="sister" id="link3">Tillie</a>;& U9 ~9 \4 ]1 j1 x; s2 n
and they lived at the bottom of a well.</p> e; f! C" R! G% Q5 R: Z7 }! n
) O8 y7 u" D# s $ W2 L- U; G2 B7 {; `' j; H<p class="story">...</p> ; T3 f& `, W% y3 u1 ]% J""" 8 f) {& y/ R1 B: ^1( F- }/ U, |1 b# M
2 7 d& s4 u) M q k6 h }/ N3 $ Y. _! I8 @+ s& }3 g4 , F/ u4 Q a; j* V& z5 : M% Z+ \. l/ g/ d5 p6. |) F; | H& M9 N5 G" s
7 ; p) V1 M; f- ?: {* C. }8( P0 W p9 ^" g/ a
9- S' ]" O$ {! Q( {2 L; p
10+ [" { a+ ^2 ]- C) s0 Y
11$ n# L# B5 g* a
12 3 F# A; F3 }7 n13 f; ?: }9 y, F5 zsoup = BeautifulSoup(html_doc,"lxml") * c3 u" g) q- [# ^; A14 j5 P4 q N; d' t* E2 H
几个简单的浏览结构化数据的方法 . q C- R. Y- p8 fsoup.title9 @% \8 ]$ q! g& C- ^& ^5 w+ }
1 # j' X0 _/ o% |% C' ?9 r3 Z m" x<title>The Dormouse's story</title> , n4 B) l( Q5 I0 b6 S4 T- F. G1 ' h: _' ]/ B% B# _0 M1 Ksoup.title.name+ o$ g; h% u( z+ t r+ v- k; e
18 K0 g6 J) D# H; o7 m2 v! \, Q! {
'title' & J7 P: R$ f! Z; Z/ x1% \0 f6 y5 Y( p1 e
soup.title.string& M9 m! G' f' p: I; D V' @1 x2 y
1) i! r- I' E$ l, K' n
"The Dormouse's story"7 y( \! y! g* S2 j& P4 v! R' W
1' u7 s0 z/ w' d- f1 x) L( x2 A
soup.title.text ) E; e$ `: v% _/ W1 ) _% I8 E, g# x! h: U, w"The Dormouse's story" 4 U) E( m+ }9 a9 c& ?' c0 d1! f. ]3 e" R3 d$ \
soup.title.parent.name3 ?3 {8 P3 G! l2 K
1 % c* ]0 H( E) F. m: H- I'head' . j% E4 R6 D8 T* N" ^) m$ t1 ]1 ! l' M/ Y# e% Isoup.p1 y' t' k) a( f0 k& {- t% f
1 2 \* W6 l& Z' B/ R$ I' N<p class="title"><b>The Dormouse's story</b></p> ( E) K. @0 B: ^5 _) D1 ; u7 j, \/ j+ [) osoup.p.name1 i( C: x7 R+ m% z
1 9 q" y; v9 r( ~9 z'p' $ U% N* F% O/ h- v7 G6 r1# L) ?$ m4 ^5 t/ l
soup.p["class"]! Z: C' K: C9 ~, t7 c/ i
19 z0 h5 {; p9 g; y" ?- `9 r
['title'] & G! V$ }; W7 x2 E1& {+ F# H0 L N
soup.a & i6 C' S+ `& g7 B. L& a: I; a2 G13 ~5 D% T, m' U2 L4 Q9 }$ A& [# a/ Z# z' n
<a class="sister" id="link1">Elsie</a>) i1 l% `. G3 H% R
1 " q G1 [8 _8 V7 t i( v6 Lsoup.find("a") # B/ s/ s& D2 `" z) f$ B; G1 $ w) d8 H" m( n; Z" }% k; x/ ~3 w<a class="sister" id="link1">Elsie</a> . L2 q& n1 b' G1 ]1 X" O1 : K4 K1 }! l ~1 D5 P% m1 X+ lsoup.find_all("a") / a7 @! J+ `; |' O; ^1 2 A; c% y/ h! C5 P5 t ~9 H/ U[<a class="sister" id="link1">Elsie</a>,9 W5 p4 Q0 X7 A {) p' }/ A% z; k* S
<a class="sister" id="link2">Lacie</a>,. x% Q# U/ b( _, k% S
<a class="sister" id="link3">Tillie</a>] - d9 N( X. U3 {6 P1 4 T, ~5 H- h0 U( |. B# w2; q& i6 E9 ~. X" Y. k2 q
3$ q) M( X e8 ~" B- w8 H, _
从文档中找到所有的< a>标签的链接 3 A6 a2 \- q1 kfor link in soup.find_all("a"): 4 M, J8 D; t; A7 g O9 V! z" u print(link.get("href")) $ E1 m; u& k# k/ P% J13 `5 K7 D8 Q ~! D: L
2! o8 a8 r+ u' {" Q) R9 W; q0 s
http://example.com/elsie 1 G/ b6 y w/ i* Z' \8 p! ^$ lhttp://example.com/lacie/ i& |8 y! E4 W5 _8 [. g6 ~
http://example.com/tillie; `8 q6 q! y, x2 Q) g, m8 P
1 - @8 M6 r& A+ @& x2 ) C* [7 m0 d# r5 i3* a' `' V* M5 x E% D. E0 K
在文档中获取所有的文字内容6 W" @! Y( ^- e) R5 J
print(soup.get_text()) 8 ] _: b' E) }18 h1 s: R7 q' p/ f
The Dormouse's story7 R$ k) J: f" X
: W f' v5 Q7 j4 l% ^- ?
- V x3 I% y( ~1 g/ QThe Dormouse's story. H# y& ?; j" |: f" }
Once upon a time there were three little sisters; and their names were e. h, F5 a5 D) }8 k1 |Elsie, 6 a7 ^9 P7 {* ]( I' p. T! r3 HLacie and ( f. g, q3 m# Y2 rTillie; A# b7 ?" U7 `# A2 @and they lived at the bottom of a well.' P( J' N* B( @8 w" U
...( ?# P I6 g$ F8 P
1 6 t5 ^4 k6 R7 Q6 m25 @: c# f+ k' s( d, U5 ~+ M
3 0 M+ W0 e, @$ z' S$ @46 _" F9 o9 U* ^! r3 B
5 ) O) n* K- o, p& T66 U R, k( D! F
76 ^5 R/ u( @# V" b% h) L- w
86 ~5 N+ o8 Z( r$ j
9 " |0 z0 L& h" i* s- \' ~: U. q% ~* x2 \* o* k5 I2 H! V
7 U$ W3 l9 m9 {( g) G7 `( D 4 B8 U, c8 n9 a# e( d, n/ o通过标签和属性获取 & }/ t [5 T: H* B Y1 z" g5 ^+ VTag有很多方法和属性,在 遍历文档树 和 搜索文档树 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes$ P& t% T5 l; I, i u+ `8 x, {% D, p$ V! s
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') # H! j, H- B8 L4 B* s; Atag = soup.b/ S% N2 ]6 ^4 j* L! V/ Z. Y
tag4 V. e' M! p0 y3 u
14 I5 M. p7 Q7 F, @
29 n4 V' k( `5 O: o
37 v9 n/ v9 Q, Y* s. N9 v
<b class="boldest">Extremely bold</b> ) G7 ]4 X) t4 j' d: g# ^4 _1 ; v7 B; C: t' J8 ^$ w" {type(tag)7 A. W' l0 G% j
1 7 V4 ^- A7 V: u+ O, \bs4.element.Tag 1 o! n6 U9 d' s# {1- o: v/ U2 I. r
Name属性% t! V6 J/ G2 I% r% ]
每个tag都有自己的名字,通过 .name 来获取:6 _- p1 |- D; r3 E& W' D, M
tag.name4 {0 `2 P G4 T# u& w2 E
1 ' Z' q+ O/ p( N'b' - M) k2 @) P5 u+ l# h; |4 C/ G/ }1 # Q0 G0 k0 L) y; {: ~4 b如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档 6 M. c4 z# v" C& [- Ztag.name = "blockquote" - z( M f, h9 C% etag ) w. d2 c2 u" c3 k" f% M6 Q; \18 r4 i3 m- j% g' H- j! j( k
2 & U# u: B3 `) T8 E, [+ A; R! l, }<blockquote class="boldest">Extremely bold</blockquote> , U9 m) J7 U+ e @# s" [0 i" P" M1 7 P; R; w+ j! W% m1 H! }1 \& S多个属性 ! g) g3 k+ {6 D一个tag可能有很多个属性.tag 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:8 w. B$ {" } ~8 U) o
tag["class"] . y. g1 F {% A4 F" S( Q. G. m1 - Z& C3 s" r4 M' M7 I- a6 m" r, m% H: _['boldest'] ' p5 g0 F3 N! [, F) {3 Z7 W1 ' t- l3 Q B4 n0 ntag.attrs 1 _3 [( Q1 u- e8 z' O1" g9 P9 p! t& \2 |) a7 v- ~, L
{'class': ['boldest']} % w3 x. w b2 J( F1 1 R3 d& J% t8 n- ?* ntag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样) L. ^) Y: X) o9 k! {, G$ w
tag["class"] = "verybold"( t; Q+ p0 N s7 A
tag["id"] = 1 n, S4 f y3 [, D/ @. e/ z8 J: U G$ c
tag' N! c% Q2 C' i% p+ K# L* M
1! u t8 |1 L- B' h* L3 i g0 K
2 % `5 ` x& d' n6 Y" Q! `3+ p( w. M2 |8 o/ Q
<blockquote class="verybold" id="1">Extremely bold</blockquote> ; A) S; g' ^& z) E2 m1: ?1 ?) w0 [, y! n2 Y9 U2 y0 m
del tag["class"]: f9 ~6 a% E. G; P+ i9 a" G
tag9 _2 ? }8 C2 a5 I+ N" h
1 # g/ W5 b7 o- ?/ N! b" [2 ! |* B3 H: f; p: R; |5 x<blockquote id="1">Extremely bold</blockquote>7 i: c* m! P2 D( e' s8 I
1" O3 ?8 l6 o0 S1 R
多值属性: n4 w& p- b) q( Q, ]8 ?. `5 M% L0 h/ d$ O& W
css_soup = BeautifulSoup('<p class="body strikeout"></p>')/ t5 u" }- Y3 p* I; h4 B
css_soup.p['class'] ' R* k: T, ^& v1. Z" `' c. a( x$ n* v
2 6 L4 n- U# N8 \/ B( [['body', 'strikeout'] + w0 Z9 [, z c0 d; S1 % p! Y7 G7 ?' V8 dcss_soup = BeautifulSoup('<p class="body"></p>') $ w. L: B& u8 W- r# Gcss_soup.p['class']$ e' }0 Q( R6 e$ u
1 & U- L: O* O8 `2 # U( m$ K2 u7 X5 p, l, R['body']" |4 C8 m. V$ c- L
1 - r$ c& a2 ^+ ~+ u可以遍历的字符串0 q: w+ t4 Q4 p3 J5 Z8 Z; M
字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串: , @2 S7 E! Z0 P2 l$ N' \tag.string 0 O) n* t; L' ] O, y. c1 0 y$ ^2 C( ^: a7 `, C9 k* Q8 a'Extremely bold'0 j O. U7 ~5 b* z) T
1 . S- n; @* L2 |type(tag.string)" Q9 v* L$ V( E: { Y& u
1. {+ X4 @' }- | X; ^$ l
bs4.element.NavigableString $ h; j; [# ]9 e g1 5 |$ I b/ ~2 [4 e一个 NavigableString 字符串与Python中的Unicode字符串相同, 3 T) }1 d2 p5 d" k/ @8 p* x' I并且还支持包含在遍历文档树 和 搜索文档树 中的一些特性.- l; I L4 p! T, T' _3 t8 z
通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串:% W( q3 E' m6 D" M2 y" Q
- `4 w* h# P4 I4 K
1 j# Y* V Q- v, A5 ]( X
tag中包含的字符串不能编辑,但是可以被替换成其他的字符串,用replace_with()方法. y9 D. W1 K% f! u2 l0 r% `
/ O' S) ~( W5 A" d( U2 \4 z: S
' F& I/ \5 g+ ~$ r2 _tag.string.replace_with("No longer bold") $ \7 z; N4 W% s7 w3 H2 Q- Ttag 3 I J7 U4 m0 H7 i3 f" e- M1% _6 E/ G0 Y& T+ f2 ~
2; n; C/ A1 R D4 w/ R; X5 r4 ?
<blockquote id="1">No longer bold</blockquote> 7 B, e+ J0 C$ X& U( Q# N1 % L! a$ K; h, H+ S注释及特殊字符串 ' P9 X, j6 s1 }. B P9 O! \文档的注释部分 7 U( s c9 [1 P) _markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"5 P# u" H2 }) ~) G
soup = BeautifulSoup(markup)0 b" B# B7 O* x1 l
comment = soup.b.string * X. g# [$ W3 m7 B$ g' R ^comment 7 W' D! l( E0 p- d: c n17 N6 }- T- a- @1 W( r& D5 p
2! G) X/ T8 R- o! H+ K
3% w. h" C' K* x; @( R
4 # l' u! {; H: t'Hey, buddy. Want to buy a used parser?') u$ @0 `/ A3 b! c& _! m
1 1 \0 t8 e# z8 Q% h0 K7 [type(comment); G- M& I! l. b( g
1" C2 X8 U( c+ V! O: r
bs4.element.Comment 7 C! B; m$ @6 O1 - R, q* D7 ]2 M( QComment 对象是一个特殊类型的 NavigableString 对象: & f6 K' e( Y8 x3 kcomment i/ T s* w* A* E7 W) |1 ' N9 } ?5 n# l W4 X! W'Hey, buddy. Want to buy a used parser?'5 \3 m2 S( X& c2 t
1 0 p2 K% x+ T1 y7 l& s但是当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出:3 a0 H4 ~, G, W/ z9 s8 v5 O5 c
* p) L y; U! V0 C + _! h* {6 z& B' pprint(soup.prettify()) 2 h! o/ E. H9 k O- ^1 4 e: X, e! |; l" i<html> 8 p2 t" h8 j5 |. H <body> 4 n& N# x' ]. g6 c, P <b> 3 B' o, F/ s, J0 I% O$ K <!--Hey, buddy. Want to buy a used parser?-->5 f9 w# k% h9 G8 j* o) v5 \( D
</b> $ w5 z6 N0 ^0 D, ] </body># P0 r6 O0 s% S' z1 V6 A6 q; E
</html> " D( Z+ F- d# V0 H* K5 |1 U1" Q- r9 b9 n8 c8 B
2 ( o2 L: [! q0 N l) @/ f3 ! P$ |) ^$ X# q4& u# e; R1 E1 N
5* H& o6 Z2 } O8 [5 a2 D8 ?
6 7 G+ z" d" j4 J. _4 ]+ V7 / h2 I) `9 j; T* Y) |from bs4 import CData2 [' E( X4 r2 w b& x9 g, Q3 I
cdata = CData("A CDATA block") + D: V0 X8 n+ @3 U- O8 q8 i3 Wcomment.replace_with(cdata) Z7 R3 X$ z( z+ i$ G, [8 I5 z
print(soup.b.prettify()) # E1 C3 L A: {" ^8 b! D) `14 N. \7 | j- f! ~
2 1 n) ^/ i4 f8 {& X) \3 ) m6 D: {7 W3 E' n. D! r' _( f4& }( m# t2 R( L! w6 [. }9 {
<b># r7 ^+ a4 l& U0 q1 q9 n+ Y# J/ s
<![CDATA[A CDATA block]]> # e7 Q( ~( M" V</b> _: k ~4 J# f4 d- ^) [8 D4 W
1 * [# c5 q% `3 N% \2 `, a, `5 }- U6 u
3 8 R% [+ S4 G2 N0 l遍历文档树 - B6 `! e A7 g7 Mhtml_doc = """9 y) x; | ?0 K4 q# U. x
<html><head><title>The Dormouse's story</title></head> 1 w% ]: @4 U$ T. N <body>- w$ O j6 n1 \8 y1 N( R
<p class="title"><b>The Dormouse's story</b></p>1 |0 L. U$ t, K! t7 T9 s
3 B( c% ~0 z. B1 O2 P7 ^ 4 O% Y0 V. Q4 |. h t<p class="story">Once upon a time there were three little sisters; and their names were " f3 U6 ~( ~: a$ v. ]<a class="sister" id="link1">Elsie</a>,/ M7 M: ]" X: G+ {; d$ P* w. ]
<a class="sister" id="link2">Lacie</a> and 4 e$ i4 ^( S6 z8 D0 ?/ q: R5 S<a class="sister" id="link3">Tillie</a>; 7 f/ \4 i- D& m7 C- v3 Tand they lived at the bottom of a well.</p>! n- w) ~% W9 f- F
% o+ a' ^' I* O- a6 b
9 @/ }! K. g* ^: T) N8 N9 }
<p class="story">...</p>! A" ]7 L N4 p! n+ r: @1 K K
""" ( Z: a s, y; d1 L0 S$ M$ d' Y0 I1# E1 y. ~' t5 g2 M* M4 Q0 @: D
22 }9 A7 d+ G! Z) ?- r
3+ j$ H# w1 D( p8 ?2 G$ x0 m( E
4) e. _7 b: S7 e6 S& Z
5 $ f9 O$ H+ R1 Q0 M/ T8 Z68 @! t/ o5 m& Q$ ]9 ~7 w
74 b8 v. {( d& I4 F T- M: U
81 F' h+ E9 b' _7 M
9% {+ `+ {+ W0 j% I5 W) m
10" _2 Y7 y( x0 c0 ?) h- G
11/ U- j* c7 W( o- m; d
125 A8 v: G! n8 T, r) n" Z
13 7 {$ R, l& e2 L: A* I8 e1 j# W+ K9 gfrom bs4 import BeautifulSoup # I, @, b9 u N5 D, |% I1! i4 R; C1 N/ p- ~
soup = BeautifulSoup(html_doc,"html.parser") $ J. o7 r2 Q1 o& |( b3 [9 _1 A1+ G& G( L# c+ x5 s2 f
子节点8 M7 V3 T$ D3 g; k' r
一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.9 l. }) `4 u9 b3 i; z
% y- ~2 ?& k& D$ a8 |$ c 3 q5 U- j6 e% a. Msoup.head . ?, {: g& o5 n5 y- [2 a1 / K$ q0 Z! u E, Z2 w0 A0 J<head><title>The Dormouse's story</title></head> & G W$ Q& b4 M7 A) X1 7 N4 F% j: `% q2 Q& r) l, \7 y6 psoup.title # f( ?0 w8 h; K! X0 ?1 + @' O. o; w( @0 _+ p6 w<title>The Dormouse's story</title> 1 Z8 O' [+ D- ~' T1- E+ v( @5 {1 I N. Q2 T. l
这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取标签中的第一个标签: , u5 ?0 o2 _& [& Q$ A& x* ]* N8 R/ f: Q, C! ?4 z) l/ ~8 l
' {6 C8 c5 H8 H
soup.body.b/ o0 R2 i8 Y9 M; d v
1 % G% T) M2 u/ i2 P+ I<b>The Dormouse's story</b>5 \' {) |( s/ O$ z/ x" ]
1. [' v1 ^; q5 E# x
通过点取属性的方式只能获得当前名字的第一个tag:% {' g) A7 G& P( o `; j