[Python爬蟲] 05. BeautifulSoup 解析 HTML 程式碼

BeautifulSoup 1 2

運作方式就是讀取 HTML 原始碼，自動進行解析並產生一個 BeautifulSoup 物件，此物件中包含了整個 HTML 文件的結構樹。

本篇文章需要有 HTML 跟 CSS 基礎，可參考 02-2. HTML 跟 CSS 基礎。
Beautiful Soup 4.4.0 文檔。

爬網頁的流程

選著要爬的網址（url）。
使用python 登錄上這個網址（urlopen 等）。
讀取網頁信息（read() 出來）。
將讀取的信息放入 BeautifulSoup。
使用 BeautifulSoup 選取 tag 信息等(代替正則表達式)。

win7 開發環境

python 3.5
beautifulsoup4
```
  $ pip3 install beautifulsoup4
```

引入 Beautiful Soup 模組

原始 HTML 程式碼。

  from bs4 import BeautifulSoup

  # 原始HTML程式碼
  html_doc = """
  <html><head><title>The Dormouse's story</title></head>
  <body>
      <p class="title"><b>The Dormouse's story</b></p>
      <p class="story">Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
      <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
      <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
      and they lived at the bottom of a well.</p>
      <p class="story">...</p>

BeautifulSoup()：以 Beautiful Soup 解析 HTML 程式碼。

  soup = BeautifulSoup(html_doc, 'html.parser')   #使用html.parser這個python解析程序
  soap = BeautifulSoup(soap,fromEncoding="UTF-8") #加入這個UTF-8編碼，可以讓編碼順利

prettify()：輸出物件美化後的 HTML 程式碼。

  print(soup.prettify())

  >>>
      <html>
      <head>
          <title>
          The Dormouse's story
          </title>
      </head>
      <body>
          <p class="title">
              <b>
              The Dormouse's story
              </b>
          </p>
          <p class="story">
          Once upon a time there were three little sisters; and their names were
              <a class="sister" href="http://example.com/elsie" id="link1">
              Elsie
              </a>
              ,
              <a class="sister" href="http://example.com/lacie" id="link2">
              Lacie
              </a>
              and
              <a class="sister" href="http://example.com/tillie" id="link2">
              Tillie
              </a>
              ; and they lived at the bottom of a well.
          </p>
          <p class="story">
          ...
          </p>
      </body>
  </html>

幾個簡單的瀏覽結構化數據的方法

soup.title & soup.head

  #1.
  title_tag = soup.title  
  print(title_tag)

  >>> 
  <title>The Dormouse's story</title>

  #2.
  soup.head

  >>> 
  <title>The Dormouse's story</title>

  #3.可以看到內部還有一層，也可以這樣取出title：
  soup.head.title

  #4.若只想取字串內容
  soup.title.string

  >>> 
  "Hello World"

  #5.
  soup.title.name

  >>> 
  u'title'

  #6.
  soup.title.string

  >>> 
  u'The Dormouse's story'

  #7.
  soup.title.parent.name

  >>> 
  u'head'

tag 標籤

  #1.
  soup.p

  >>> 
  <p class="title"><b>The Dormouse's story</b></p>

  #2.
  soup.p['class']

  >>> 
  u'title'

  #3.     
  soup.a

  >>> 
  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

  #4. 
  soup.find_all('a')

  >>> 
  [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

  #5.      
  soup.find(id="link3")

  >>> 
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 

深入用法

Attributes

tag[]：一個 tag 可能有很多個屬性，tag  有一個 “class” 的屬性，值為 “boldest” .tag 的属性的操作方法與字典相同。
```
 tag['class']

 >>> 
 u'boldest'
```

tag.attrs 屬性：也可以直接用 .attrs 取屬性。

  tag.attrs

  >>> 
  {u'class': u'boldest'}

get()：若不確定屬性是否存在，可以改用 get() 方法

  print(soup.find(id='p1')['style'])      #會出現錯誤訊息, 因為<p id="p1">沒有style屬性
  print(soup.find(id='p1').get('style'))  #None

tag[屬性]：多值屬性

  css_soup = BeautifulSoup('<p class="body strikeout"></p>')
  css_soup.p['class']

  >>> 
  ["body", "strikeout"]

  css_soup = BeautifulSoup('<p class="body"></p>')
  css_soup.p['class']

  >>> 
  ["body"]

select()

是使用 CSS 選擇器的語法找到 tag。

找出含有 h1 標籤的元素。
```
  header = soup.select('h1')
```
找出所有 id 為 title 的元素。（id 前面需要加 #）
```
  alink = soup.select('#title')
```
找出所有 class 為 link 的元素。（class 前面需要加 .）
```
  alink = soup.select('.link'):
```

find_all()

方法搜索當前 tag 的所有 tag 子節點，並判斷是否符合過濾器的條件。會以遞迴的方式尋找所有的子節點。

tag.string：使用 find_all 找出所有特定的 HTML 標籤節點。

  a_tags = soup.find_all('a')
  for tag in a_tags:
      # 輸出超連結的文字
      print(tag.string)

  >>> 
  Link 1
  Link 2

link.get('href')：找出所有超連結 a 的標籤，get href 屬性。

  for link in soup.find_all('a'):
      print(link.get('href'))

  >>> 
  http://example.com/elsie
  http://example.com/lacie
  http://example.com/tillie

className.get('class')：找出所有標籤內的 class 名稱。

  for className in soup.find_all('p'):
      print(className.get('class'))

  >>> 
  ['title']
  ['story']
  ['story']

find_all([ {tag_a}, {tag_b} ])：同時搜尋多種標籤。

  # 預設會以遞迴搜尋  
  # 搜尋所有超連結與粗體字
  tags = soup.find_all(["a", "b"])
  print(tags)

  >>> 
  [<a href="/my_link1" id="link1">Link 1</a>, <a href="/my_link2" id="link2">Link 2</a>, <b class="boldtext">Bold Text</b>]

find_all( {tag}, recursive=False)：限制 find_all 只找尋次一層的子節點。

  # 不使用遞迴搜尋，僅尋找次一層的子節點
  soup.html.find_all("title", recursive=False)

  >>> 
  []

find_all( {tag_a}, {tag_b} ], limit={數量} )：限制搜尋節點數量。

  #用limit參數指定搜尋節點數量的上限
  tags = soup.find_all(["a", "b"], limit=2)
  print(tags)

  >>> 
  [<a href="/my_link1" id="link1">Link 1</a>, <a href="/my_link2" id="link2">Link 2</a>]

find_all(href=re.compile( {比對內容} ))：以正規表示法比對超連結網址。

  import re 
  links = soup.find_all(href=re.compile("^/my_link\d"))
  print(links)

  >>> 
  [<a href="/my_link1" id="link1">Link 1</a>, <a href="/my_link2" id="link2">Link 2</a>]

find_all("a", href= {href內容} )：搜尋 href 屬性。

  # 搜尋href屬性為 "/my_link1" 的a節點
  a_tag = soup.find_all("a", href="/my_link1")
  print(a_tag)

  >>> 
  [<a href="/my_link1" id="link1">Link 1</a>]

find_all(href=re.compile( {href內容} ), id= {id內容} )：以多個屬性條件來篩選。

  link = soup.find_all(href=re.compile("^/my_link\d"), id="link1")
  print(link)

  >>> 
  [<a href="/my_link1" id="link1">Link 1</a>]

find_all( {tag_b}, class_= {class內容} )：搜尋 class 為 boldtext 的 b 節點。
一個 HTML 標籤元素可以同時有多個 CSS 的 class 屬性值，而我們在以 class_比對時，只要其中一個 class 符合就算比對成功，不過如果多個 class 名稱排列順序不同時，就會失敗，遇到多個 CSS class 的狀況，建議改用 CSS 選擇器來篩選。
```
 b_tag = soup.find_all("b", class_="boldtext")
 print(b_tag)

 >>>
 [Bold Text]
```
```
 # 以正規表示法搜尋 class 屬性
 b_tag = soup.find_all(class_=re.compile("^bold"))
 print(b_tag)

 >>>
 [Bold Text]
```

find_all( {tag_a}, string= {字串} )：以文字內容搜尋。

  links_html = """
  <a id="link1" href="/my_link1">Link One</a>
  <a id="link2" href="/my_link2">Link Two</a>
  <a id="link3" href="/my_link3">Link Three</a>
  """
  soup = BeautifulSoup(links_html, 'html.parser')
    
  # 搜尋文字為「Link One」的超連結
  soup.find_all("a", string="Link One")
  >>> 
  [<a href="/my_link1" id="link1">Link One</a>]
    
  # 以正規表示法搜尋文字為「Link」開頭的超連結
  soup.find_all("a", string=re.compile("^Link"))
  >>> 
  [<a href="/my_link1" id="link1">Link One</a>, <a href="/my_link2" id="link2">Link Two</a>, <a href="/my_link3" id="link3">Link Three</a>]

find()

如果只需要抓出第一個符合條件的節點，可以直接使用 find。

find( {tag} )：只抓出第一個符合條件的節點。

  a_tag = soup.find("a")
  print(a_tag)

  >>> 
  <a href="/my_link1" id="link1">Link 1</a>

find(id= {id內容} )：搜尋 id 屬性為 link2 的節點。

  # 根據id搜尋
  link2_tag = soup.find(id='link2')
  print(link2_tag)

  >>> 
  <a href="/my_link2" id="link2">Link 2</a>

get_text()

取出文字內容。

  print(soup.get_text())

  >>> 
  The Dormouse's story
  The Dormouse's story
  Once upon a time there were three little sisters; and their names were
  Elsie,

向上、向前與向後搜尋

find_all 都是向下搜尋子節點，如果需要向上搜尋父節點的話，可以改用 find_parents 函數。

find_parents( {tag} )：往上層尋找 p 節點。

  html_doc = """
  <body><p class="my_par">
  <a id="link1" href="/my_link1">Link 1</a>
  <a id="link2" href="/my_link2">Link 2</a>
  <a id="link3" href="/my_link3">Link 3</a>
  <a id="link3" href="/my_link4">Link 4</a>
  </p></body>
  """
  soup = BeautifulSoup(html_doc, 'html.parser')
  link2_tag = soup.find(id="link2")
    
  # 往上層尋找 p 節點
  p_tag = link2_tag.find_parents("p")
  print(p_tag)

  >>> 
  [<p class="my_par">
      <a href="/my_link1" id="link1">Link 1</a>
      <a href="/my_link2" id="link2">Link 2</a>
      <a href="/my_link3" id="link3">Link 3</a>
      <a href="/my_link4" id="link3">Link 4</a>
  </p>]

find_previous_siblings( {tag} )：同一層往前尋找特定節點。

  # 在同一層往前尋找a節點
  link_tag = link2_tag.find_previous_siblings("a")
  print(link_tag)

  >>>
  [<a href="/my_link1" id="link1">Link 1</a>]

find_next_siblings( {tag} )：在同一層往後尋找特定節點。

  # 在同一層往後尋找a節點
  link_tag = link2_tag.find_next_siblings("a")
  print(link_tag)

  >>> 
  [<a href="/my_link3" id="link3">Link 3</a>, <a href="/my_link4" id="link3">Link 4</a>]

開啟檔案

from bs4 import BeautifulSoup
# 從檔案讀取HTML程式碼進行解析
with open("index.html") as f:
    soup = BeautifulSoup(f)