' o/ w! `; w T" t* o3 W% J4 l1 i4 ]+ b% N4 Z
<p class="story">Once upon a time there were three little sisters; and their names were ! M( A: M3 Z r5 _. h5 g<a class="sister" id="link1">Elsie</a>, 9 ]) X `) P0 S6 i" l2 A<a class="sister" id="link2">Lacie</a> and8 Q2 U6 K( c* Y# ]
<a class="sister" id="link3">Tillie</a>;5 L3 `8 O. F5 j0 l' X1 g
and they lived at the bottom of a well.</p> , b3 y1 Z" j1 @+ ?3 Q6 b8 e8 {: v3 S: M- }$ L0 A
9 T$ q# N' p) e' A- Q2 i. _% j<p class="story">...</p>4 C- ]( W* P9 a
"""5 g6 ^2 _) e7 u7 ]
19 i& I& E) G n C- k
26 `! J& \, R' x: A
38 g3 P ^2 K5 S
4 # v1 F7 X& ]' N$ h# K0 L& ~5. j6 {; L% F Q* L" [
6. S3 Q' X. d0 K7 I' k0 v" W: B
7 6 ~6 P3 a$ J; V. v' E- S* Z! c8 ( q' ~$ h/ ^" M. D! L9 ! i _3 C# P: Q1 U. |3 I$ C10' w/ p: g3 z6 v$ K: ]7 q, {+ d
112 f; n, b. J" W
12 & M0 f4 T9 T5 H* O" Y13 - b& L F2 `- J; e! b+ Qsoup = BeautifulSoup(html_doc,"lxml") w4 z( T; C- r: H5 n2 [' d
1: |% l: l% D5 m! a5 {* e
几个简单的浏览结构化数据的方法, @, D0 @2 z& x# ]/ f$ `
soup.title 4 B/ G% c( u) Y9 v1* r2 J) a: d, h, w4 G
<title>The Dormouse's story</title> * _1 \4 m; P, N I0 Y1 . \$ x+ ], L: N5 q. Gsoup.title.name3 u5 m) e' A+ Y9 @2 X( |
1 . p7 I) ]$ @- G- ^3 ^6 N) g'title' # x' _9 i% g) c& l1' r- N; V3 F/ v6 K( p( k
soup.title.string) h+ e3 @2 i2 O$ T* h
1 / Z. ~8 }' I9 h! k"The Dormouse's story" ; l3 K7 y( @! G# `( M3 H1, p$ t* O p: v1 r: q1 ^% w
soup.title.text3 ^/ B! p; `, F# ^! a1 L: V
1 0 H' |) p) h, r7 R" R! ~& M! q; H: A"The Dormouse's story" : J7 r# E6 B* a6 A1 t: M8 f9 V: q1. G5 g+ O+ d( k# e) F
soup.title.parent.name ; r& l1 b6 I) D `1 ) A! t: D8 n% F$ m'head'/ B0 `" ~5 }, ?: [9 ?2 @
1 - r* w% a$ L0 I9 W& Q7 Nsoup.p " I( A: o. m7 u3 v* a% n- @( i' j3 X1 & i! G5 q6 s, l5 _2 j+ V! E<p class="title"><b>The Dormouse's story</b></p>( b" i3 F$ c& R8 u1 I4 K
1 7 R" W+ A! W0 psoup.p.name3 W9 k9 @ I8 |4 {
19 ?6 d8 Y: B9 Q
'p' ; `8 F3 N2 \( d5 b1 ! w9 s* G) P8 K: t1 A3 }soup.p["class"] 9 [7 I5 M" Q( W$ ~8 b/ `3 I1 " w) W( \7 ?4 K- X" |8 q['title'] 0 W, Y8 x2 g; ?1 1 K: O# T) j& f E j1 L( t# Zsoup.a 6 {/ h2 \$ P7 \" A, x- Q1! Q8 ~1 H8 S2 D# s4 {, g
<a class="sister" id="link1">Elsie</a>/ Q/ c. r' q" ]
1 . c6 D: Q; l, \3 q+ n6 Hsoup.find("a")" B/ P& J9 N, T: `6 N
1" x5 S3 }: d! \( g
<a class="sister" id="link1">Elsie</a> " n/ m7 u9 o+ ~( [+ c" ~) _" F8 R X1 }" H7 L, [3 d8 ~ T9 Esoup.find_all("a")1 X' G/ A, h( r$ O$ C% b7 U8 ]
1% A7 l: ]/ j- @" F6 {/ z
[<a class="sister" id="link1">Elsie</a>, . | ]# K$ Q1 E6 J- U <a class="sister" id="link2">Lacie</a>, ' X w2 ? t) Y* V <a class="sister" id="link3">Tillie</a>]$ X$ B$ O( ]- a4 ^: E/ f& n
1 ) D; e4 A: q1 g* v h2 ; h' C1 a1 m, S/ j/ k9 x31 ~3 s3 y; D& ^! Q4 ?. X
从文档中找到所有的< a>标签的链接( o; d: l2 R q) I
for link in soup.find_all("a"):9 e5 }( L+ K! @1 Y
print(link.get("href"))$ @: q+ S: @* s C- T
1 & M1 r. l& m: ]# x" B7 h7 H2 + y1 p# K. N2 z- Jhttp://example.com/elsie " C6 U3 r4 L5 v( d- I# |http://example.com/lacie ( @ D+ l$ Q) |4 z. ?" k. G# Qhttp://example.com/tillie* i! m! h- ]4 \( K4 [* S! H, O
1 , x( |9 u$ C: e- P+ L R! |2" X1 Z. f4 O+ B; d# F0 w/ T
3& o, w7 g0 p; v0 y' Y7 @& h4 E
在文档中获取所有的文字内容 ( n. }6 }8 s9 x" j1 R6 n2 cprint(soup.get_text())4 m! }4 j1 |( ?5 f' e' R# A- \% j
1 " R# o+ z* ~7 s( E0 v$ m0 OThe Dormouse's story1 L' J w7 k3 a
- j3 J% Z7 }' `7 F3 D8 u* }* n- y) c+ ~3 d& h
The Dormouse's story, Q3 k; p9 g% J5 u; s5 x. \" ?
Once upon a time there were three little sisters; and their names were 6 Y, S4 C3 \. c$ r+ ^, zElsie, + x7 v. g2 I- A ~! }Lacie and2 G( T/ @) D# Q$ ^! m4 Y; Q
Tillie;9 E% L) q* E6 k# |
and they lived at the bottom of a well./ _) l! z' ~# q% X4 y) [
... , z& t( Q- ~3 \. |( F! V1 I/ x5 T) [# q; F% [2 5 V2 U$ X6 }( R* E. o, f3% r7 ^7 @! L2 j, `) `* q W
4 ' v( _( _9 q. w% p, C! Q; N5 3 S. a6 j* n- G3 B2 [+ q/ y- M6 2 o5 p! S5 o7 s. I7( C) H. y+ c4 A! W' x8 Q. {6 J
8 " g# G) c0 Y5 e" @' ~& S$ X9 C9 " c) @0 S- b; B9 O* P$ {- M+ ~$ O* i0 y9 Z* H
( H7 j. D9 T9 _: G7 N+ f* Q; j 0 X. M3 }5 s) _$ P7 X1 P. \通过标签和属性获取* G4 H1 l* w% u7 G0 }1 F) c
Tag有很多方法和属性,在 遍历文档树 和 搜索文档树 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes& F3 r" u- f+ p4 P
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')8 |9 |2 P% q2 z+ Y
tag = soup.b " ^" I# `+ U# K' O; Rtag6 M9 u. ^0 Q6 B! e. `: J
1- x' U. y, p7 S
2- ^5 b0 X, s) {. Z5 k9 x5 J
3 u! Z7 m8 H6 \' H4 @7 {<b class="boldest">Extremely bold</b>% T- {$ W `" X! ^: Y; e1 X
1 6 h% {: }+ G) _! E" vtype(tag) 7 T# Q: ]+ k' l0 M# {0 |1 C12 U- g( u' K$ W v" K g
bs4.element.Tag ' Y; p i1 e5 ?, E6 ]1: v2 q5 E9 _+ w2 T
Name属性% g1 x# C, E7 c1 K4 m
每个tag都有自己的名字,通过 .name 来获取:5 Z0 Q) r( o' S
tag.name : ^+ M8 @1 Y$ l1 P1 , _, p0 l) b7 t6 z9 v J# G'b' i" `6 y2 N& B1 d
11 b2 _5 {4 ~4 u* L, L$ |! w7 w
如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档 , M% x: f6 q, p6 F' [! c0 Ltag.name = "blockquote"* }# b, V& R1 [1 J, o" d: x
tag $ L1 g* M3 t0 g5 u1 H" R- V' `1 9 n& F' g1 v5 q! ]. O# O" t; J21 q* y8 o9 q6 A+ ^ V1 G
<blockquote class="boldest">Extremely bold</blockquote>/ t% B$ @ b: h( P; L& d9 N# x
1 ; O9 o9 e$ v I# V多个属性 $ }/ ~! s7 p+ Q一个tag可能有很多个属性.tag 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同: 5 {. D( z4 P5 B/ S ftag["class"] O, w7 ?6 C! G2 r+ U
1/ \' v# @9 E( A" k! D. U; m: x
['boldest'] ) ?7 ? U$ N5 m( [8 x1+ z7 B( ]- f6 s6 j L
tag.attrs 4 _# W: A+ s4 ]4 T) g14 {3 H, @( G" M6 Q6 J8 U
{'class': ['boldest']}6 Y: |: \$ b6 X
1 / T1 F, l: M) T+ Y8 e6 ktag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样" Y$ y+ \& V" J g3 ]
tag["class"] = "verybold"7 _0 ^; ?0 M+ u! f1 o
tag["id"] = 14 R: N6 }$ w6 w9 Z. `
tag 2 v1 {/ m+ W3 C& E1 k$ l! d' |# E2 G28 Z3 \5 K0 {! ?: m& B1 Z& M
3 1 m" n7 G2 f+ z& h<blockquote class="verybold" id="1">Extremely bold</blockquote> 0 s2 t2 `' F/ V3 [4 W9 K1 ) S4 m/ C: q V) Odel tag["class"]& B3 L5 l: ^$ j V- K, O9 x
tag * A6 K3 s) `1 ^1$ d! \0 K2 h% i' c9 j( i; r: ^7 k& g7 N
2 ; U; W+ L6 e2 M, Z; S3 ^<blockquote id="1">Extremely bold</blockquote>; R! u' O8 r! V/ Y4 S% w0 o
16 k, G1 K9 M& g
多值属性6 @7 S! \ V" M5 p4 f. _
css_soup = BeautifulSoup('<p class="body strikeout"></p>')' S" V. d' t4 b2 v* e j& a
css_soup.p['class']6 l1 ]2 D7 h+ w1 G9 w
1 % I3 x4 q, }+ @4 `+ Z2 ! |& Y1 [: n8 z3 w['body', 'strikeout']2 N2 W0 a& Y0 `3 F
1 F+ ]5 W4 h# Ucss_soup = BeautifulSoup('<p class="body"></p>') . t) e$ u6 X, Z. u6 x, E1 e& Xcss_soup.p['class']! I( p# l4 |* V0 X
1# ` Z( `( j$ v; b, C1 C
21 I2 t( P5 s8 Q: ]5 C6 B- @# c
['body']5 b7 G& `. j0 G5 ]. I
1+ {5 t: l& I9 Z/ W* W* y4 e) Z
可以遍历的字符串+ b% ?( [# v* u5 _( |
字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:' T4 |7 y' X7 b
tag.string1 f3 x; G! `4 |; d+ ~
15 |: @2 m8 O" \1 C) j) ?
'Extremely bold' $ S$ R2 V" {; V+ v2 q1/ U* q# M. k7 b' C8 Q1 C, {
type(tag.string)7 o- M9 j( F* `) P+ w
1 5 L; O# a5 m' n5 [- @6 p; ~& ~/ cbs4.element.NavigableString; H6 a! ~$ O$ a; t. B1 j
1 9 c% ?: Z( C' e一个 NavigableString 字符串与Python中的Unicode字符串相同, 9 L5 N* e0 Y0 a5 F) C2 \. q& w并且还支持包含在遍历文档树 和 搜索文档树 中的一些特性. 6 p( ]. X0 O: ~7 a" f7 U. t通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串: 6 l4 ^& r6 N. n2 {1 u6 D) f; v: U2 k' j& i
6 x3 r; u8 h$ w" J8 }4 C
tag中包含的字符串不能编辑,但是可以被替换成其他的字符串,用replace_with()方法* K. }' P! b' ^ E! g
2 o! C1 F# B2 `9 k+ U0 G2 K4 s! W0 n) R- ]" g: L
tag.string.replace_with("No longer bold")' z2 D) e! m+ |6 b
tag9 }, n4 Q1 K2 O' q
14 r4 @0 i1 m! w; I0 }! e- T
2 L$ Y1 j0 a. N6 z
<blockquote id="1">No longer bold</blockquote>+ ~* r, b; l' I
15 X9 O. ?0 i5 ?% \8 C; C: _
注释及特殊字符串# G8 f* W2 c4 d( s) J) Z3 ^
文档的注释部分 0 V N% U" U$ t# N5 B- ymarkup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"+ z+ a1 V0 Y: U v
soup = BeautifulSoup(markup) ' V& ?* p3 L1 Y% P$ pcomment = soup.b.string" r& R2 F7 q @$ t/ J: K5 P
comment$ F% {) J1 Z, u9 j5 T
14 g& w8 k) H; X- h2 ^
2' `/ d: j! h$ W4 s+ ?' Q# Z
3 ' M6 `; D* t8 _4 + M5 S _# `/ a'Hey, buddy. Want to buy a used parser?'4 J- h H. @3 i" u
18 v" p2 p+ X3 k7 M( F
type(comment)1 r* |# u1 a2 D7 h# ]
1 6 S5 r* p2 ^6 f4 R$ [bs4.element.Comment, A3 j; @& H- f# n* y8 Z
1 ! ^$ X8 _: B1 aComment 对象是一个特殊类型的 NavigableString 对象: * g! A) r# m& W3 j' S% I4 @. ]% g5 q( fcomment1 Y0 f& D- D& c. g
1) j+ q/ j( B. p
'Hey, buddy. Want to buy a used parser?' % N, |( [) _' e9 J8 ^: @9 L10 F; |8 v: o+ H2 o& o1 e& n( @! E
但是当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出: : C/ q$ t/ Q. \1 q! {: {$ x* ]: s $ B, ~; b& }! T$ b/ d: \% S! M+ g5 D% V; P) s4 D# G! e
print(soup.prettify()): z: n* H6 T0 E) i. A! P; w" y
1 2 V8 G) b3 |/ z/ x( h& e$ ?<html>- k1 I$ ]/ j9 Z" p. A7 T+ e
<body> . b7 I) a9 Q8 q, L6 C! k <b> 1 \4 C6 E( B1 X( ^, P X <!--Hey, buddy. Want to buy a used parser?--> 2 X4 u) z/ s/ O6 O$ s </b>4 B5 G* b0 q- ]1 x1 D) I) {
</body> ; q8 \2 \, j3 Y. h$ x% u8 e</html> , B4 B+ G' x0 p8 Y" Y: L1$ d6 I; h% W7 Z J" w
2" o" ~# V: B* x2 y
39 u1 e* @* U1 M% r4 c
4 - ~/ ~: V1 ]7 p0 g: @# N" k4 R5 ! j* ~5 e: g( R% m% |- G6/ b4 G, a2 T/ ]% e( l+ r2 Z
79 ?) G/ t8 i7 l3 i
from bs4 import CData 3 s9 Y+ {, P0 m$ k0 kcdata = CData("A CDATA block") 1 J3 p1 H! K, U) o5 o0 ?; Ncomment.replace_with(cdata)* U4 W2 d% N/ e5 h* o. U0 j
print(soup.b.prettify()) 3 z7 V' z. f, L! L1& S9 ]7 U. z. S8 ]
2' M: ~2 B4 d: s7 Q$ _5 C, N$ }6 {3 {
30 W! H$ h) @, E5 R; C
4/ a5 h9 g, L9 o# ^# z
<b>) f* L3 Q3 M1 p) w2 F
<![CDATA[A CDATA block]]> 7 a1 X3 R5 c2 q! L; V; x</b> - s3 l$ }( B+ y+ c8 H& l; o2 S1 - ?2 A% _1 R4 q- E# ` |1 G* B; {2 # _4 J1 t' u: o" i! B0 R4 O3 ?5 b8 X1 ?& ^. a M' c遍历文档树 5 C j4 c m) X2 B7 r7 Ohtml_doc = """ : y$ ~9 M8 d3 ? T! j+ |( A<html><head><title>The Dormouse's story</title></head>3 R. g, T: }% q9 E3 ^9 B# T) w
<body>+ Z# o+ H D! r% a
<p class="title"><b>The Dormouse's story</b></p> . Y: H3 `, y! v2 W U9 `, ]; m- d( T( Z! T
- p( \6 w/ ?7 z. x/ Y6 _6 W<p class="story">Once upon a time there were three little sisters; and their names were1 C& |/ j$ A( M% y
<a class="sister" id="link1">Elsie</a>, : A# u- W$ p6 g<a class="sister" id="link2">Lacie</a> and9 j* m7 U+ b2 `# p9 V n! c
<a class="sister" id="link3">Tillie</a>; 4 T0 O* G6 _0 `$ b" E. b+ Pand they lived at the bottom of a well.</p> [8 p, g6 f7 g) u( U
0 F1 z1 Y* V5 D: k 5 j* i& f6 X. |) \3 E6 y<p class="story">...</p> i+ m9 I' C& C5 {( Y3 y6 }- p""" . y2 A& w4 d$ ~; {12 \7 n4 ~% Q# ^2 [0 v5 ]& M9 C7 {
2 8 a5 r7 C% N7 O2 L# G+ P3 9 n9 y- j3 L3 ]% U( w: p5 F4 $ ]9 C6 D' N/ Y6 e p! n F52 |+ s: O3 X& `3 d' p
6 & z' v# {9 r- U2 @- k$ O% A+ J7: g( `1 ~$ W2 L# l
8 " t5 I+ ]+ t# D( F/ V- D" @/ r9/ ~) j- `3 U0 C7 [! F ]9 _. e
10, Y. Z2 |/ X6 N- i, y+ x$ f/ R
11; s' p) x6 O, b( `. I6 j# S
12& S% Q( U" _& Y
13 3 c' u: c9 ?- |, e rfrom bs4 import BeautifulSoup d. U5 w$ c* @# W! ^
1 * e' o' e2 s! n8 I) Y4 ^soup = BeautifulSoup(html_doc,"html.parser")3 e ]8 f1 N! Z/ n) D9 `. x
1# r) z( W$ D: L. H( M% l2 O" Y
子节点 : f6 x# H8 k' V8 H+ T7 l一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.% z; e9 f/ X; i6 f
0 [0 W& [- i7 O: U- m6 a $ e) l# O( h! w) jsoup.head w% D* G6 H- _# x. Q7 c- S/ ]8 h
1& F$ b: f* s M. B: F, T
<head><title>The Dormouse's story</title></head>! t5 T# Q3 R0 D$ B0 M* z2 M1 C" H
1# [/ r6 }: b: K4 B, \5 _8 B
soup.title 6 b$ `. w1 L1 N4 f+ B8 b1# T8 A5 M8 v' M
<title>The Dormouse's story</title> 9 K: C2 \5 U1 v; M0 k& n# i7 y1. S8 k$ {# l9 o' ]$ F
这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取标签中的第一个标签:+ U& E, _# x, L& G