7 {, p. |, s( \$ g$ mpip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple beautifulsoup4 + H9 T \+ k6 F% S3 K
1 , C! C1 V& B' m' V6 f' h- l. a导入即可* G; Q) c* q5 |& r
' D3 F: w; x; R$ h) d: e' S; y9 m& e. D/ b6 n& l3 _0 ?5 J
from bs4 import BeautifulSoup' W4 L: J: z) D; @% r
1 ; W; b! ~ x- A0 B0 `html_doc = """ 2 A& u& n8 D) ^( e<html><head><title>The Dormouse's story</title></head> 5 Y4 w' B# t) d1 w8 ]% N9 J$ t" T<body>0 W* R* F5 K- |$ h. {
<p class="title"><b>The Dormouse's story</b></p> ; v8 @# k3 |( E8 T* ` m6 U1 F % a: ^, c1 A" ]5 K( |$ A ; w& R9 i+ D) r: V4 r<p class="story">Once upon a time there were three little sisters; and their names were4 P( N4 N) N" K/ x. @
<a class="sister" id="link1">Elsie</a>, , F4 k! [4 I/ r- U; f. j# h' K<a class="sister" id="link2">Lacie</a> and " S D1 Q5 ~; j! }9 A! @, U/ q<a class="sister" id="link3">Tillie</a>;2 L1 e z& H& _5 e5 k4 O
and they lived at the bottom of a well.</p>1 Q$ f4 i* F7 Q r
9 j' \+ k p% D' H* m$ i3 h ) w% X! _! j! r; o<p class="story">...</p>6 P0 h3 {) [, ?. L5 ?5 S* z' b x
""" , B& m* }. c7 t6 a) X1" `0 [; }0 Y: g7 L# y6 M; e5 m' I
2 9 K1 T8 |" r8 m% a J: J3 k# ]5 p6 m9 s# M6 B) ^
4 ) `9 M9 N M7 F, }# @- }5 ; J/ Y& v" E8 M) z6 $ q2 C# B' _ H3 [; E) ] y# m7 & D h& O1 w4 B9 }8 2 f5 x) Y3 f" F* s9" ?. G% U7 V# }, [# y
10 + Q9 x! O) ^! m* ^3 {# Y11" q$ H) N# V4 n6 D( y0 x
12 3 u6 P9 W$ w) f/ ~13 & P; ]2 O; @7 L9 t8 t0 Bsoup = BeautifulSoup(html_doc,"lxml"). @3 t: I- q* `/ ?
1 . i5 n7 m N' x5 y几个简单的浏览结构化数据的方法 ! j& P! h4 b6 O# R% O3 ?3 Q- ssoup.title 5 \! ]' y. }. v* j& R19 c( H# o6 C7 {3 G' y8 `. C
<title>The Dormouse's story</title> M/ N3 ]4 h( u* [; M17 _8 @( y0 n7 ^* R F# Y
soup.title.name # R5 P' Y6 z" q' _! L1 : U b5 }1 U$ G: U7 p+ s& L/ b'title' 8 x4 i2 n' h% o, ~3 p9 w1 s/ X) S: @+ q) l7 ~& U
soup.title.string $ C: Y& f6 G8 M; M; \6 J7 g1 : T8 x s/ S7 k" t$ J"The Dormouse's story" ' C" g; N* d( n$ R0 B; d1 t# M4 N% I c& ?3 usoup.title.text, _, v" c6 p* [+ s
1 2 S" i5 y% D8 F' B# T% R"The Dormouse's story". L5 K- j& {. n9 h: k" A
1 2 ^/ d l4 r/ Q- U4 A& D! \soup.title.parent.name 4 B. u9 @# h" j" b1, \' I# Y: p+ p
'head' % ^! ?& E5 c/ R: ]% s' N1 6 [, {- l" x- A/ N1 Ksoup.p 5 i& C n S2 M$ `1 5 ^+ h( S) H% y' G( f8 u6 D2 p4 s<p class="title"><b>The Dormouse's story</b></p> & g) _9 d0 X2 Z9 b1 / M) ^: A7 S O" @! }/ E" {soup.p.name 6 C: w: D4 W% B2 D) _' V* t6 n1 / t& V% Z8 p% j; s, D4 T) m" G'p' ' A8 q$ Q5 Y& s' M/ r4 |1) z8 o: M$ E" s0 I0 O0 f
soup.p["class"]$ {6 @2 `- s' d6 _* N' g$ f
1& I( R$ j8 M% i
['title'] * L3 l/ Q/ M, v, E, G+ h8 x/ n1 # u: X0 C( J: Bsoup.a& `- l* y3 ~+ b# b% W% l
1 5 b* B5 L/ o6 S" {<a class="sister" id="link1">Elsie</a> - G- |+ k2 l( l1 L' y! q2 T17 Z& x5 ]( Z# G- u1 o; S) r M |9 ]
soup.find("a")# n; C, F1 t& L9 `: e5 Q) U" r6 t3 N# h
16 S! a3 A. g7 [, Q
<a class="sister" id="link1">Elsie</a> / d8 g* d+ g& ?0 X$ G1 % q5 q4 e2 l( msoup.find_all("a")! p, [: k2 x% N& W* Z5 Y( f! b H1 n
1# U0 i2 o) ]2 @' f, U0 d2 k
[<a class="sister" id="link1">Elsie</a>, 0 [" |' `7 T* d$ R! h. h <a class="sister" id="link2">Lacie</a>, R& u% {; F" P. [( T& \5 K
<a class="sister" id="link3">Tillie</a>]8 l& \# j' {$ ? q9 X6 z1 ]" I
1 4 P) }; m) j+ s3 n7 O2 ! r' a! z" l, _& k' W4 I3 , ?3 Z6 x8 `# R/ B6 s/ d0 Z( V从文档中找到所有的< a>标签的链接# ?0 W. `* H/ R0 u
for link in soup.find_all("a"):# O+ q4 G. p$ r, S/ Q
print(link.get("href"))) }7 c7 \; M& C; `$ b( E
1) z5 u5 R9 n! f! |+ w, A
2 ' }+ Z$ f- }! T5 Dhttp://example.com/elsie B8 J" r7 `, y+ v' ]5 G2 {
http://example.com/lacie # K' ~: x, l0 q$ V/ T. Lhttp://example.com/tillie/ c1 O' |% A8 {
1 2 M5 U9 m6 ~" l; S% v% ?27 i: j9 \* `5 i6 j! P4 y( m8 S
3. l; H, d1 {! g% M( [) o7 A& H
在文档中获取所有的文字内容9 H6 n" n& }7 n) u
print(soup.get_text()) : N2 I A4 `" @5 K1- {7 R( v) ~/ o
The Dormouse's story - {5 |* }( Z, H# y! U# ^ 6 k. l- b4 Y D1 y) e4 Q3 Q; V' W2 A, V. Z. D* u) [1 ~
The Dormouse's story 4 h& u3 s' Z, ZOnce upon a time there were three little sisters; and their names were5 O3 ^% v5 v- b* ]3 D7 m I) R
Elsie, $ t5 y/ [8 t- S- W- M/ Y" aLacie and 4 L0 }; \6 L% q0 STillie; ' f6 }# Q( T/ c1 @' u2 Yand they lived at the bottom of a well. - w" P( T. M- Q7 r1 f+ O... * @+ l$ w G$ d1' Y+ R3 V2 j. W8 m, `
2 + g1 ?6 G V: D) f3* U9 \5 F+ w. i# d# P
4 # r/ P$ ?$ {) |8 n" b59 w! k* s' i" e3 l; b, `5 H
6 - r7 l& t7 j( M [; m) n1 S7 ( K0 r+ _8 {: d. G+ F; L8 4 N# ~. u/ l* V9 . a, I; j0 z, Y) U% A- E9 r; E5 b; U# P9 h6 S" W1 L+ }
6 n& }& e& Z5 n) T) H2 q+ o6 Q4 }5 ]: t9 U
通过标签和属性获取 9 J5 h; N# a$ Q: w( j8 F. UTag有很多方法和属性,在 遍历文档树 和 搜索文档树 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes& ?* O( q& o# P5 ?+ b" ?; z$ t1 p
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')+ b( d% Z' y! ~/ m* v" h8 H
tag = soup.b: m. h1 O; H. ~6 i4 u8 X5 y
tag 7 i2 | }( ]- {8 ]& `) Y; c8 h1 - s$ J" a: X% x6 Q# l2 L2) S) R9 h& X" r# T& c5 M' c
3 8 C) J/ j) i$ G- J<b class="boldest">Extremely bold</b>. y! r% W; s1 x. R& r
1$ Q$ j( h% Z5 n& I
type(tag)3 D( r; J7 b9 `/ V' W
1 9 s) R3 S& x* x8 u' Mbs4.element.Tag# m; n4 n2 K6 l
12 l' Z5 q6 {5 E+ H" [
Name属性 l7 L' C: u: j* I5 h每个tag都有自己的名字,通过 .name 来获取:# R( \5 X$ c W6 J9 o# F; D0 X& H
tag.name! Z! u; y0 s+ u
12 ~' y) `! S+ m" h' a* K+ W6 ]$ F& S
'b'& C3 y% n: {+ T" O. k
1 . R9 H) ]1 b- y如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档1 P' S* l( ^! c+ D# m0 ~
tag.name = "blockquote" 0 O0 q5 T7 @" X% _ V+ F. H, Ptag& u" M! U( _+ L0 r6 I; v5 c. O
1/ V3 d) P5 N" m* u% `6 A5 \3 A, B
2 * E; f& G4 Z/ P6 |. J' R3 I k& v5 O<blockquote class="boldest">Extremely bold</blockquote> ( i8 Q& y3 C" A* X4 N( Z. E1 # @! C: r6 P& Z! x2 o多个属性$ q3 X8 Q& o+ v
一个tag可能有很多个属性.tag 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同: 2 `1 O8 q3 y+ v. J. Atag["class"] , P9 I- r3 a1 w9 i9 p10 V4 d' U4 D9 U+ S( F
['boldest'] - r* d1 l* c- O/ I* _1 6 b, ]; O: {& ?; Ptag.attrs ) L4 ~* f3 t4 Q7 S* T$ i1- t$ H$ l) S: `% t4 r/ k
{'class': ['boldest']}/ D8 ~- s! e, |! |" N: I
1 3 d8 } D+ H! Q) ^" R3 T5 [tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样6 u. I9 [" ~* z2 T3 x( @
tag["class"] = "verybold" 6 F6 h) f1 I1 _tag["id"] = 1 4 t5 x6 D2 w1 }: ?- U/ [tag - N! Z' S$ h! J* L$ E& e* C1" Q' M3 F/ w Y0 h b3 c
2 $ \* v( n$ p( Y3 9 Q5 X9 R6 h5 U. r+ f" I% H<blockquote class="verybold" id="1">Extremely bold</blockquote> ~; a: {+ I# w/ n1 " e' n( }$ o6 a) H0 x# u0 F5 v9 jdel tag["class"] 4 E2 ]: z$ l6 P: Atag( X; `& ~- K6 `0 }
1( P7 S$ E# U! V h* _
28 T- I4 p( n* I
<blockquote id="1">Extremely bold</blockquote> 6 }, U4 m/ Y V* a1 8 \' V' S8 N' ?& D6 v0 p多值属性 : R, J! ~! d# B9 Y$ Ecss_soup = BeautifulSoup('<p class="body strikeout"></p>')& @; K! A; m6 N" z+ ~" \6 J
css_soup.p['class']: f% m Q$ F9 w( p9 D9 ^
1$ v. s& X$ W' r5 O
22 S, t3 t" a% m6 a7 k
['body', 'strikeout'] 4 i! I0 X- s3 L7 g4 R4 l2 F1$ U' L( p0 o5 f2 X/ G/ K# t9 k9 S
css_soup = BeautifulSoup('<p class="body"></p>')# R1 }4 q5 Z$ `' ?0 G9 {
css_soup.p['class'] / D( O% l# R) _9 t3 M11 `) b1 h3 Z$ ^& p6 U n
2 * j2 L' k& _* V4 u4 {2 k+ D['body'] / k* ]" l( d( f. ^3 {$ Q8 q: q1 / G8 x3 k4 j4 q4 a8 p可以遍历的字符串8 ?( A( q3 } T. M1 K! e0 j1 i
字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:3 @5 u; `" u0 z# I T+ V
tag.string8 `. {+ }+ ?3 Q1 U
17 [! n$ P; e/ X+ q8 _* g4 o, V1 G' {4 J
'Extremely bold' 5 t w( }+ d3 p/ M3 ]: A$ y+ y: A e1 ^6 Z2 c" Z8 J, v& q; G: x2 e1 @type(tag.string)+ u' |, p1 r5 ]$ d
1 * F' F' y: }; [4 b; G4 nbs4.element.NavigableString" b5 ?7 W! @6 Z2 N- n
1! g2 Q4 C0 o! F: {6 O0 |0 |
一个 NavigableString 字符串与Python中的Unicode字符串相同, " E8 e7 c8 |, g$ M$ O; r% K: U0 ^并且还支持包含在遍历文档树 和 搜索文档树 中的一些特性. * h5 n! f: F# j# V' x通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串:6 n. }. ~! K5 c- ^8 L
* S- N7 b. ]4 I$ H- @
, l2 q# _; u# O& h. C- a! Z1 atag中包含的字符串不能编辑,但是可以被替换成其他的字符串,用replace_with()方法 4 G. c& h( {! f+ ]2 h ' {2 v" R: v* j8 C. _* p) M+ j# T! p
tag.string.replace_with("No longer bold")" z- m( t. \' Y; _
tag / j# H/ E+ q% G. X- U: j17 T- h/ k- ?8 C
2 " e) T/ S1 ^% ~/ W9 r- _/ Q' o0 M<blockquote id="1">No longer bold</blockquote> 4 C3 D5 e% g8 [1 `0 Z" y5 [& h2 u1# g( j, D7 f1 F! |! F: z4 T
注释及特殊字符串 7 b! X2 ?+ T6 V: G文档的注释部分 - \" Z" p: G- n0 W( |. k) H7 Imarkup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" O6 k& E) [7 B) @6 T
soup = BeautifulSoup(markup)6 y9 L2 Q, ^8 P' g* b( B
comment = soup.b.string2 o- c0 E6 b& m% s5 `6 C% O
comment0 Q8 O% @9 y, c0 |1 m4 N% [2 t
1 . T; M- N' S. ^+ M: U% d! I8 }& M7 s9 R2 4 Z5 Y. |" b( f- J3 + S4 i% S1 \, u& @41 R% h' \9 ?8 g K: S R7 c
'Hey, buddy. Want to buy a used parser?' * b2 z& J- t) Y4 e12 G s1 D- [# M; s
type(comment) . h3 s1 T/ N. v. Y( N2 \1" c6 q0 X/ K/ I# e4 B5 t% p: r( j
bs4.element.Comment# E$ r- b, \! Y4 f5 D: N' Z' C% I$ G
14 f2 j* H, Z9 O' c3 w
Comment 对象是一个特殊类型的 NavigableString 对象:: q) j2 |4 |9 \7 S' s& r! C
comment + M! X0 g. P1 M, ~3 r15 H/ d" r' d" r' U5 h z/ q* G+ ~: [
'Hey, buddy. Want to buy a used parser?'8 v9 E+ G, j, f1 ~3 V; \
1& X6 B+ t2 h. N' u. Z5 }- ?
但是当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出:1 `+ _& I7 X/ f2 X" G
2 a+ d+ S9 r1 @) p1 Q$ Y& Y / R0 g/ |7 u, `: dprint(soup.prettify()) 3 n# p$ h u5 O( z1 ' g5 D, L3 v- H4 D- m7 d7 c<html>" u2 P2 ~4 N1 W. y' I- B& _
<body>. y! ~& d ?% `; e: a6 e
<b> - e+ G/ V v# K& x2 C1 i6 V5 J1 l <!--Hey, buddy. Want to buy a used parser?-->) X) e5 F! D1 ^; h' C! s8 E
</b> : `1 A" u& ^5 o3 P </body> * x; r% E6 |% K# F% h( H</html>; |- i9 Y& Z3 A' k2 d7 L9 Y; e
1 0 Z8 u# G5 O8 F9 c. B2) {- c3 \! G: |9 |5 n. L
3 * \( R3 b2 Q% U; i' m& M4 1 k4 H+ M. i; A$ d s2 b/ Q5 1 M9 ^6 P: F/ P# f; y6 I! w' ^ S: H+ _. E7 0 p0 D! _! n- Xfrom bs4 import CData2 g+ a7 F& v p" E7 H4 H t
cdata = CData("A CDATA block") 8 K: A b/ U# |. j pcomment.replace_with(cdata)# R9 |/ i4 m( g- ^# X( w
print(soup.b.prettify())' {- w6 S3 t# u; N3 l( p2 m
1 1 Z9 C) b6 m: C- q2' o" D( i) Q" |( Y. F% T0 t
3" F! U1 k S# z9 Z; e
45 ?6 n) n, T" j+ m
<b> / Q8 Y) D" L/ R4 h <![CDATA[A CDATA block]]> s# f1 v8 Y$ }
</b> ( h( j3 v3 e6 S1* D# ~) W6 {3 G) f4 K: O% a# A
2* N% Z8 u2 Q f8 z
3 3 `6 i9 M9 k, s0 y9 Y3 d遍历文档树 3 w/ i6 `5 y: _ t; O2 k' V* V4 shtml_doc = """( ?* c. N+ O6 @' i( X& F& ]
<html><head><title>The Dormouse's story</title></head> " O# a3 \1 A; [; s. | <body> " j% X5 O. a1 G+ |9 B( _+ L<p class="title"><b>The Dormouse's story</b></p>) [: H3 J) r# G8 \+ f8 H' p
- u! I- T# F7 c9 Z7 [ + |4 m3 j3 k; H* `<p class="story">Once upon a time there were three little sisters; and their names were7 k' h+ L: `. [8 ^; \: w' V
<a class="sister" id="link1">Elsie</a>, 0 @: V1 p2 R% v q<a class="sister" id="link2">Lacie</a> and$ J# n1 i+ l* G8 ~) }
<a class="sister" id="link3">Tillie</a>; 2 s. K: B' v! Mand they lived at the bottom of a well.</p>4 G, e) c; }$ S# ?! G