BeautifulSoupを使ったスクレイピング（1） | IT工房｜AI入門とWeb開発

BeautifulSoupを使うと簡単にスクレイピングが行えます。
この名前は不思議の国のアリスで出てくる詩からつけられました。
確かに美味しいスープです。

beautifulsoup4の導入

ターミナルなどからpipでインストールします。

pip install beautifulsoup4

Colaboratoryの場合は最初からインストールされているようで、importするだけで使用できます。

使用方法

インポート

BeautifulSoup自体は「from bs4 import BeautifulSoup」ですが、URLの指定や、例外処理を行う場合は他のライブラリを読み込みます。

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

find_all()の使い方

スクレイピングする場合に、Webページのある特定の場所を収集したい場合が多いと思います。
ある特定の場所を特定する方法がfind関数になります。

find()は引数に指定した要素とマッチしたものを1つだけ取得します。
find_all()は引数に指定した要素とマッチしたものを全て取得します。

find_all()の引数の使い方

find_all()の引数は次のようになっています。
なお、find()の引数はlimitが無いだけです。

find_all(tag, attributes, recursive, text, limit, keywords)

tag : ここはタグの名前を指定します。
attributes : 属性の指定をします。例えばclass属性の場合は { ‘class’ : { ‘mydate’ , ‘mytitle’ } } 値を複数指定する場合は{}で囲みます。
recursive : Trueなら要素がマッチするまでDOMの子孫を辿ります。
Falseならトップレベルのみ辿ります。
text : 指定した要素のテキストコンテンツが合致するものを取り出します。新バージョンではstringとなった。
limit : これはfind_all()でのみ使用。ページの最初から何個の要素を取り出すか指定します。
keyword : これはキーワードを指定して要素を探します。

引数の「keyword」は少し使い方が難しいです。例えば id = ‘title’ です。
けれども、これは attributesを指定するのと同等です。find_all( ‘ ‘,{ ‘id’ ; ‘title’ })

「class」というキーワードはPyhtonでは予約語ですから別の意味で使用できません。
attributesでCSSのclassを指定するときはクオテーション「”」をつける必要があります。
keyword指定するときはCSSのclassを表現するために「class_ 」という記述をすることができます。

BeautifulSoupの実例

次の例はIT工房のHomeページでclass名が「mytitle」の「li」要素のテキスト内容を取得しています。
また、何らかエラーが出た時の例外処理も合わせて記述しています。

def getTitle(url):
  try:
    html = urlopen(url)
  except HTTPError as e:
    return None
  try:
    bs = BeautifulSoup(html,'html.parser')
    nameList = bs.find_all('li',{'class':'mytitle'})
  except AttributeError as e:
    return None
  return nameList

nameList = getTitle('https://itstudio.co')
if nameList == None:
  print('Title could not be found')
else:
  for name in nameList:
    print(name.get_text())

次の例では、IT工房のHomeページで、class名が「mytitle」の「li」要素を最初から3つだけ取得するものです。

def getTitle(url):
  try:
    html = urlopen(url)
  except HTTPError as e:
    return None
  try:
    bs = BeautifulSoup(html,'html.parser')
    nameList = bs.find_all('li',{'class':'mytitle'},limit=3)
  except AttributeError as e:
    return None
  return nameList

nameList = getTitle('https://itstudio.co')
if title == None:
  print('Title could not be found')
else:
  for name in nameList:
    print(name.get_text())

次の例では、IT工房のHomeページで、h2要素の内容が「最近の投稿（Lecture）」となっているものを取得しています。
まあ、内容がわかっているものの内容を取得しても意味がないのですが、これを基準にトラバースするときなどに使えそうです。

def getTitle(url):
  try:
    html = urlopen(url)
  except HTTPError as e:
    return None
  try:
    bs = BeautifulSoup(html,'html.parser')
    nameList = bs.find_all('h2',text='最近の投稿（Lecture）')
  except AttributeError as e:
    return None
  return nameList

nameList = getTitle('https://itstudio.co')
if title == None:
  print('Title could not be found')
else:
  for name in nameList:
    print(name.get_text())