9 Y2 E# ~& t2 l0 W% I4 [1 N1 Z<p class="story">Once upon a time there were three little sisters; and their names were 5 n; n9 C. F# M9 z6 {# y<a class="sister" id="link1">Elsie</a>,$ a4 x. } N' w" p) Y6 G; I# m
<a class="sister" id="link2">Lacie</a> and $ g2 |. |; g6 O- _! {<a class="sister" id="link3">Tillie</a>;) ]8 A" }1 E& o3 ` C9 Y
and they lived at the bottom of a well.</p>. [' n/ J! [' x1 W3 T
6 V0 c; o+ I! \ 8 j6 o3 F7 y2 l) i<p class="story">...</p> / z$ a6 E+ Z+ W& \; ?& T5 O""" 6 ~4 T! u1 v$ s1 P) v" M4 p6 A2! Z N; ?& O- l5 Q* \
3, z0 H, o( V# h$ v9 X
4+ j* W" d/ i0 t2 B- J
5 3 u2 m. ]$ E2 U3 W1 Y# b+ N5 a6/ y& ]3 a6 m. k" A: q' G1 g& @
7 : N# K/ A1 I- R6 }8 ( J! y( B& w; T! k, }$ C) B) n2 M9 $ s; Y# i+ Y0 q: D# o9 S# I10 / [% c& ~; U8 P2 i2 n4 t9 a& m# F8 P11 ' g, ]/ n; R; K! g) v1 j126 Q. ^" F8 k6 _4 Q
13+ g. E& V# ?1 [: T$ f
soup = BeautifulSoup(html_doc,"lxml"); T9 e& {* V4 Z2 A, r. |$ V1 j9 U5 U% m
1, o. o4 @& u6 j& a1 ?9 L0 A
几个简单的浏览结构化数据的方法, ? x- t& n# q+ R6 v
soup.title2 z3 t8 u5 p& C* J: ]5 A9 l
1 : _( u8 U" x9 b" A<title>The Dormouse's story</title># X( H7 e' \$ h3 v
1 & X4 S. Q$ D- D% u' P% { h( Ssoup.title.name ' d4 m$ o) C: h/ N18 k, J7 |, S8 ^7 F5 l* ]
'title'2 p: n4 m) B) @" w |+ t
1, ~, h3 }5 k; K G3 w; E! k1 ] K
soup.title.string/ [9 e7 X: ]- }: u( v% N) h7 a/ U
18 {0 U# m( V" P8 a7 l, A6 m# u n
"The Dormouse's story" 9 g ~& b% K% G; x1 + g: A$ [% k9 d/ Tsoup.title.text . T3 o. V3 j) m1 q! ]0 u* y( d1/ ^" _1 ~, e% s5 c0 [
"The Dormouse's story"; g! o" [* L. V* R( m- O
1; S6 e* c) r. }; U# W& X9 D% K
soup.title.parent.name * V: |" r6 S" ]: I3 z: B# A$ h* p1 , P, W- I9 j; T# _% T'head'( f" E+ l: U: G' ^! h; S C2 G
17 L6 K4 b3 P( ] \; ^2 x) ~
soup.p 9 U- q. S, L% d7 a1 $ k P1 H" m( A" ^, a6 x4 \<p class="title"><b>The Dormouse's story</b></p>0 f U; P0 f& F. x
1 4 @" V7 l$ R8 L) L$ isoup.p.name / X1 }5 H1 R+ i# Z- y1 _% @1: j! x9 r' q3 _8 G" z! x1 _- H
'p' ( e, c) }% V$ q) c1; c8 e. H& D, ^. c
soup.p["class"]0 a+ s5 \& M" @6 e7 K* D
1, s0 k' q4 ?# b: D! o
['title'] 3 s2 h3 q d6 E" F$ M0 S2 [5 T1* X5 Q: F% F$ }/ L+ O# I0 o5 ]
soup.a: G5 ~( z; _% A
1, O4 M) F4 F3 q! y2 u* X
<a class="sister" id="link1">Elsie</a> $ B3 A# H( m/ n& X( J6 m1 ; }* W2 U8 _8 l! F0 j- D; xsoup.find("a")3 ^# |# _; Q( P# R. D) W
1 $ N, _/ l$ Y5 q( e8 u<a class="sister" id="link1">Elsie</a>1 n) X& C! c- ^8 W5 j, F. k' V1 e# L
1 % a! k2 `: N4 D5 Y( Osoup.find_all("a") 7 _7 C2 J7 L0 _" Y1 " ] j- U/ f/ i0 Q3 y8 w[<a class="sister" id="link1">Elsie</a>, % \5 ]1 j' x, n% k. ?( a <a class="sister" id="link2">Lacie</a>, 0 X- ]: ]% A C' p! K+ i <a class="sister" id="link3">Tillie</a>]4 Z6 `0 C9 h8 B
1 : m7 v A B2 S) {* }+ Y4 q2 : p" F& e' i) R$ @& I3* l. V0 Z r: T# T9 z, W
从文档中找到所有的< a>标签的链接 2 W2 M; O2 n' a6 }, K: [0 M# yfor link in soup.find_all("a"): / ?0 W0 R3 i [- ]5 \( f print(link.get("href")) z. L' S7 x- o9 G0 t: o" o
1/ S1 ]& L) }+ B5 ~/ c, D1 h# {
2 : X5 [6 X3 O4 A9 O% A0 qhttp://example.com/elsie+ `- }$ M6 t0 w
http://example.com/lacie , s5 N/ z* d P0 yhttp://example.com/tillie: l! R5 B. Z+ m
1# X8 K# P! s. Q" U1 @
2& u) w. C2 c, [/ p- v
3 0 e9 ?' R! s' F7 U# W. q在文档中获取所有的文字内容 $ f. T. B, i7 c3 u! |& w' ~print(soup.get_text())7 N) D; ?+ Y8 p8 r. @, E
1' Z+ J; z( ~8 u4 H, B& m# i& y
The Dormouse's story 5 X( K3 u$ I0 E' g( H+ O # I+ u- s" @0 j3 X2 D& C, ]+ i % _6 ^3 A: P- Q+ TThe Dormouse's story. o1 {( ?2 ?- H' y
Once upon a time there were three little sisters; and their names were! ]) e# k# N1 G* \# ~' e6 w
Elsie,; n& q/ ^: c3 J" W% O4 z9 [
Lacie and 8 f* W* T7 `: S' h xTillie; + p6 ]: A+ P$ a F& n2 ?2 iand they lived at the bottom of a well. 9 p Y/ [7 Z. G: O0 Z" _' ~1 x... , B" F7 K1 q4 e; @/ l6 E% @1% I4 L' s! ~2 L1 M% m6 N
2 G( q! h5 r. h% m+ k0 t1 g38 R v: K* a9 b1 g6 P7 d* H& k \
4# g5 \' ?! ?* O3 H4 }/ v
56 I$ G0 U# x% x9 a9 c4 \( z4 |
6 ) Y/ O3 T6 C- W0 P4 {! M- \7 , A" o/ t* R* i& ^* R87 o1 a3 Y9 y Y* K
9 8 y( [6 ? V. u* q, c) C# `5 n9 `( v- l8 o4 I1 `
: p! h5 x: L- ~. T* l4 m
0 G8 ]" N7 s4 j" k+ t1 o通过标签和属性获取 % T* ^: O# Z9 |- |, mTag有很多方法和属性,在 遍历文档树 和 搜索文档树 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes & X" J2 p- r" p: `0 tsoup = BeautifulSoup('<b class="boldest">Extremely bold</b>') - @, n- Q& S' x' q3 l7 h2 D9 ztag = soup.b 3 ?: D) `* `2 I T5 X% w( d" ^tag' Y7 e" ^3 j7 F: J3 E# g @# K: c
11 ~3 M' h* j/ q- o! ]: _4 H, \1 G
2 & i+ Q; ]1 f1 s; v7 Y. I; q3; y5 R! M. r* E5 e+ C) s
<b class="boldest">Extremely bold</b>3 R/ L0 L+ k& u% z8 i3 K1 k5 W
1! v/ S! U G0 W0 \: K# q7 T) Z5 _, C. k
type(tag) ( R' f9 m$ \; [12 p |0 k1 t+ p8 ?$ Y4 Y
bs4.element.Tag $ q* N! I6 g% f9 p4 S! s" z% P1+ A8 V0 f9 U* [+ |% [
Name属性 _2 X; K! ?" t& f7 u6 X p0 [7 h( z每个tag都有自己的名字,通过 .name 来获取: ( Y- Z8 P+ F% G6 g5 Stag.name ; C* ^! V7 Z5 o! l1, I7 N' ~1 j# Z
'b'7 O- U3 M8 \) g0 @, A- D! X
1 7 F& ~9 o6 q# [/ D3 H2 Z% d如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档5 K0 o/ P6 s3 |/ R. v/ g o Q
tag.name = "blockquote" ! M7 R5 V3 j. T" {; Ftag: V4 |8 C/ b0 u; h
1 6 K0 L l$ M1 S, U7 t2 $ U9 }0 A6 C" a% Y& K# q<blockquote class="boldest">Extremely bold</blockquote>7 r. S" j. A( \1 z( v
1& z: Y; r: c( L2 e2 Y9 G
多个属性/ d. B3 K. A3 R
一个tag可能有很多个属性.tag 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:$ |& O& d6 V' H' T
tag["class"]9 J' U& C5 e' S
16 `3 L3 e% d2 F+ `
['boldest'] 4 l6 e- k" R/ @ \0 H18 U$ ?4 C! S q( P4 d% b
tag.attrs 8 B" a1 z7 j' \) z2 i7 B1 % j) o' B5 v8 ?3 r{'class': ['boldest']}4 F/ } `; J/ K) [3 s# R/ b
18 j" e( y+ ~3 f' O" Z
tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样( Z( ^* X5 K: u5 l
tag["class"] = "verybold" 2 h4 R. Y, S8 X. b1 wtag["id"] = 1, }! _/ f5 W' a: |0 I' j
tag, \1 d' x1 U. V6 b4 P
1 1 G3 t a2 Y9 L: b( Y7 Q2 , ]3 V* u ?: f) T0 l. g: U37 a4 c3 E) s7 Y4 e" ]6 j/ o
<blockquote class="verybold" id="1">Extremely bold</blockquote>7 J' S/ k% i9 n( Q- H4 |
1 & R3 t5 L# z+ adel tag["class"] # y E$ h+ r/ Ltag & c# x* r1 A2 ~6 [( G: d11 G0 y: Y0 q2 @1 g, N4 m
2 ! V P, I& O3 r$ E0 c. m% q, F% w6 M<blockquote id="1">Extremely bold</blockquote>+ x, O6 f4 L e8 u( x
1 ) M9 A6 ]+ M9 |' ~多值属性 & f% E( K; k' }& j. A3 h* p6 a( ]css_soup = BeautifulSoup('<p class="body strikeout"></p>') & `7 f/ @9 u* x F3 Z- x! i+ rcss_soup.p['class']. O' y1 l, f1 W J. y
1 2 K5 b, R Y# i( o2 ) W5 T7 w" Z% Z( O/ f['body', 'strikeout']7 y* y. ~$ y( t& F
1 ' |, D& v3 V8 o3 l M, E: P9 g& G3 tcss_soup = BeautifulSoup('<p class="body"></p>') ' E0 m& n9 f% y9 h! D2 [css_soup.p['class'], a2 g& L$ @+ b# e% Q
1 8 K0 k$ Y: t: S9 ^( n, I2 + s" F4 x1 S2 P0 P5 B4 I: ~['body'] 4 f- j' u* H. r6 E9 I1; g4 `) H; k& R+ S D w9 L
可以遍历的字符串5 O$ L! B+ K: g3 h+ q
字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串: : V3 y" F: {- c' Ktag.string# B! l3 H9 |4 x& R
13 X2 U, C$ w! D, C: n. I0 a% R/ m
'Extremely bold' " o! D4 k+ A1 B+ j i* D! j( j* `- u$ n10 u1 I' F7 [; }# [
type(tag.string)& Y4 A- ?1 l v+ @2 e/ M
1 ( U" ^& ?6 w' D1 P# @( f' Obs4.element.NavigableString4 O0 X& Y0 q8 u. k! H
1* r$ G# L5 t& f) C
一个 NavigableString 字符串与Python中的Unicode字符串相同,, M& G- \# h& l1 W: o- D
并且还支持包含在遍历文档树 和 搜索文档树 中的一些特性.5 p" H: @5 k. C# ^5 K8 n8 x' x' q
通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串: 0 N1 l( h( c1 y3 o 0 H B2 S' i* D + h3 |0 N3 [, X2 d# y" z1 ctag中包含的字符串不能编辑,但是可以被替换成其他的字符串,用replace_with()方法 6 r4 }( p/ `. G- n5 A, z& i6 r" _( ` 4 b. }! a8 r+ Z% y" A& E) \* E" ^4 Y/ b( Y% p$ a
tag.string.replace_with("No longer bold")4 b% f2 G, |, S; `2 g; D; w/ c
tag8 F: W. G+ \* }6 x; H7 @
1 / S' b, `) a- c( p. l% m2; |& x6 w- b1 r2 E
<blockquote id="1">No longer bold</blockquote>! v% N- i4 k! h5 R) `9 U0 l
1 : E6 r* n2 ]1 r注释及特殊字符串4 ?, p/ X# H( p2 C
文档的注释部分 , v& `- D, y* u* y w3 i8 tmarkup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" 4 y' C- Q( }9 v9 k$ ~2 Xsoup = BeautifulSoup(markup) + m) L6 S8 d' _( icomment = soup.b.string. W" V& T9 T7 G. [6 y
comment # a# u6 C) o# @# h- u' K1* A% r% k0 O$ W4 w- a2 O
24 e! z( m0 @; K
3, `* n' j( P/ d7 X" g- F' p
4( O( f* M% ^6 m% v7 U0 a# b2 C
'Hey, buddy. Want to buy a used parser?' " ~) c( o! c* P. ?$ Z13 v* _$ j& i* u2 O8 U% b1 B: i5 w
type(comment): \9 [, _) B0 O" W6 r3 W6 A
1% O% V5 x" V2 e ]; J( _0 k+ J) J' U
bs4.element.Comment % o, H) k9 O2 t8 B1 9 k0 }" ^/ ~# i/ o! n. p0 RComment 对象是一个特殊类型的 NavigableString 对象:5 W7 R! x" x/ x0 d; p" O" g
comment0 h2 Z& @8 j5 t& j8 s, {
1) X1 j. }* r( v% t/ Y P0 x
'Hey, buddy. Want to buy a used parser?' 9 {- F' k4 v4 B, [$ n* w, J& N16 a2 `# J% u' ` R* |
但是当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出: " Q( x2 z+ C1 N* z) F * L( Q4 O' X' @1 V4 P6 m) m ' O, X6 J+ {8 T. a2 s" w1 q6 j& ~print(soup.prettify()) * R" |) u! w& k$ l; u% R( D1+ C, C# | F$ u- D @
<html> / G$ N) T) X9 B" O+ P8 {1 A <body>( m' Y% S$ r5 l3 p7 l2 c
<b> - \8 Q% l3 T# ]0 \5 ~+ b8 f <!--Hey, buddy. Want to buy a used parser?--> 7 Y& ^# d& V6 c </b>9 |1 B9 s, S* X. i/ }; f- @) q
</body> " n5 z8 j- d6 h% N</html> . l3 C: m ] A; z! |: I1* f9 s$ F& v2 s) i$ V5 O
2 1 L9 E! s" R _4 o39 y- v9 n6 ]0 B1 u, ~3 P
4: ?1 U, T. H" h; t! y/ G |
52 S/ O# }/ K+ D m3 h6 F. o
6+ f( K6 g& X* i) x; H8 B0 u' m0 a
7 ) ~3 X/ ?/ s* h# ?7 ifrom bs4 import CData# a: S4 [& X- F. w4 ?
cdata = CData("A CDATA block")3 Z+ g& C, Y( k4 N
comment.replace_with(cdata)6 E# \3 e: w' L
print(soup.b.prettify()): c% q+ s' a4 H% @& W
1: U `- z4 \2 C/ G6 V# I7 R* T
2 0 I9 {! g7 t4 W0 ~/ f# Y5 z$ \3 f8 o32 t! K$ X/ z5 ^) B" l1 m) F
4 ; [6 w& L3 \. {4 R<b> - a* S1 {$ H# b5 h! n# o <![CDATA[A CDATA block]]> 2 q. h3 K2 K% f8 Z</b> 3 O- k1 J1 z, e- L( X" ~1/ d1 s! k3 l9 R! \4 L! D" H
2: t2 C- m" x2 }+ X! l
34 P$ s" h" S4 T9 A9 E
遍历文档树 ! h- a: J4 l7 p- ^& J( Fhtml_doc = """+ b7 p4 @! _& l5 M& c# u
<html><head><title>The Dormouse's story</title></head>7 ^0 `9 n& s! X
<body>; R3 T0 r6 \! `9 b0 V: [; \
<p class="title"><b>The Dormouse's story</b></p>7 X6 Y& L2 N$ D
* y/ M+ j* l2 [: g0 ^; g$ t5 A( X* P
; ]. A3 n5 k0 ?) J# r<p class="story">Once upon a time there were three little sisters; and their names were9 v( L) Q3 q; }
<a class="sister" id="link1">Elsie</a>,( W O* d( m% I7 y
<a class="sister" id="link2">Lacie</a> and # F9 I8 a, l8 l9 V<a class="sister" id="link3">Tillie</a>;+ g% p- Z" Z3 S( j! _
and they lived at the bottom of a well.</p>& m$ L, G' D$ z. V! V
1 j. `, k3 a2 T) a% u! q 3 o* m$ M4 g c! K3 Q$ s<p class="story">...</p># b' z, E% R- ]. B! x
"""# W9 `6 j/ i. [, I" }0 N. p
1" {* W4 }/ g: x! u( C, A5 ?* d4 c
25 }1 t" g' ?3 a: c
3* Z) d& {1 j% p& G% p" g
43 D' y3 \$ L5 a. x+ x
5 9 L0 |6 E0 F' s- k6/ X1 p& P) B1 P$ f# o. K. I, Q
7) l4 z( s' l7 f! C. }. }6 l
8& l( B4 q5 H. K0 v. h; b3 ~
9* J* I. E! A$ Y8 W, K( r5 q
10' S: A A2 V* C: e
11! @- |# t3 y# s5 _ a* u
127 F: }7 t/ Q7 x6 O g
136 N( x& |, v! x0 Y: \
from bs4 import BeautifulSoup & ^/ O* Y1 Q! d# d I$ r+ {1 ! u- k( X& N( }+ ^% H. Jsoup = BeautifulSoup(html_doc,"html.parser")( ~, e/ S/ {2 z' m
1 + T, `0 j% q0 r/ A! d( ^子节点 " c' U5 n- ]# m; R- ?0 T1 s一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性., _; K; H: J( U& e: E0 \) A K6 a
( M9 _* j7 ^7 P/ f9 Y% F( Q$ r6 W* w) t8 K# Z( O6 v/ `7 K
soup.head* W/ U7 W& Z2 G. H" b x9 @
1( P* Y0 h9 B% E% Q! o- }/ l
<head><title>The Dormouse's story</title></head> * N/ m; }# R3 U16 p% ]3 Q8 e! b' j) G$ Q) i
soup.title : g" a4 e& I7 j: t" X, ^0 m! W1) V. N- _9 l, t4 {* v& Y
<title>The Dormouse's story</title>* {3 |& G! @1 w% N5 M
1 # B: P! D; F, z, A: B2 L H) _* J这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取标签中的第一个标签:7 W- w6 i- g+ i( K0 B& L! E
2 x( _, E9 i8 q) k" J
1 h! _( H3 ?( B
soup.body.b ; {3 s4 Q+ ]" n6 t S0 e \19 E4 n* u- m% G
<b>The Dormouse's story</b>; H1 H" m4 ~/ f' h/ a) {
1 . ]3 L' Q) Q1 H- E& ~& M4 T/ V6 |% X通过点取属性的方式只能获得当前名字的第一个tag: ' x; i4 w! a Y6 p; ?$ N4 j- H 1 V7 R# |' ]$ k ; p( @" t. v; ?- osoup.a + X9 } F! ^2 J2 e1 ; f8 K* ]3 [1 k# O7 u5 n1 D<a class="sister" id="link1">Elsie</a> 3 j% l( T. _ J3 F2 t# Q/ M1 ( ]3 u) l2 o* ^8 b7 [find_all方法 % J0 D: i+ h( _" q( a4 W4 N如果想要得到所有的标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()7 `: {5 Y# [0 m& l( _- K$ u
- b! Z- r( L H( ]) }, M% I% l- H