Pythonを使用したWebスクレイピングとPDF変換の総合ガイド

はじめに

こんにちは！今回は、Pythonを使って特定のWebサイトから記事のリンクを抽出し、HTMLとして保存し、さらにPDFに変換する方法を詳しく解説します。このテクニックは、ブログの記事アーカイブ作成、オンラインドキュメントの保存、研究資料の収集など、様々な場面で活用できます。

本記事では、単なるコードの説明だけでなく、各ステップの詳細、起こりうる問題とその解決方法、さらには応用例まで幅広くカバーします。Webスクレイピングの初心者から中級者まで、きっと新しい発見があるはずです。

必要な環境とライブラリ

まず、以下の環境とライブラリが必要です：

Python 3.7以上
Selenium
WebDriver Manager
Requests
pdfkit
wkhtmltopdf

インストール方法

Pythonのインストール： Python公式サイトからダウンロードしてインストールします。

必要なPythonライブラリのインストール：

pip install selenium webdriver-manager requests pdfkit

wkhtmltopdfのインストール：
- Windows: wkhtmltopdf公式サイトからインストーラーをダウンロードして実行
- Mac: brew install wkhtmltopdf
- Linux: sudo apt-get install wkhtmltopdf

スクリプトの全体像

以下が完全なスクリプトです。各部分の詳細は後ほど解説します。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import requests
import os
import time
from urllib.parse import urlparse
import pdfkit

def setup_driver():
    options = Options()
    options.add_argument("--start-maximized")
    return webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

def extract_links(driver, url):
    driver.get(url)
    print(f"Accessing URL: {url}")
    
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
    print("Page loaded successfully")
    
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    print("Scrolled to bottom of page")
    time.sleep(5)
    
    try:
        toc_button = WebDriverWait(driver, 20).until(
            EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), '目次')]"))
        )
        driver.execute_script("arguments[0].click();", toc_button)
        print("Table of Contents button clicked")
        time.sleep(5)
    except Exception as e:
        print(f"Error clicking Table of Contents button: {str(e)}")
    
    links = driver.find_elements(By.XPATH, "//a[contains(@class, 'css-17g7y85')]")
    print(f"Found {len(links)} links")
    
    return [(link.text.strip(), link.get_attribute('href')) for link in links]

def save_html(links):
    html_dir = "html_files"
    os.makedirs(html_dir, exist_ok=True)
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    
    saved_files = []
    for i, (title, url) in enumerate(links, 1):
        print(f"\nProcessing {i}/{len(links)}: {title}")
        try:
            response = requests.get(url, headers=headers)
            response.raise_for_status()
            
            domain = urlparse(url).netloc
            filename = f"{i:02d}_{domain}_{title[:50]}.html"
            filename = "".join(c for c in filename if c.isalnum() or c in (' ', '_', '-', '.')).rstrip()
            file_path = os.path.join(html_dir, filename)
            
            with open(file_path, 'w', encoding='utf-8') as f:
                f.write(response.text)
            
            saved_files.append((title, file_path))
            print(f"Saved HTML: {file_path}")
        except Exception as e:
            print(f"Error saving HTML for {title}: {str(e)}")
    
    return saved_files

def convert_to_pdf(saved_files):
    pdf_dir = "pdf_files"
    os.makedirs(pdf_dir, exist_ok=True)
    
    config = pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
    options = {
        'quiet': '',
        'enable-local-file-access': None
    }
    
    for i, (title, html_path) in enumerate(saved_files, 1):
        print(f"\nConverting to PDF {i}/{len(saved_files)}: {title}")
        try:
            pdf_filename = os.path.splitext(os.path.basename(html_path))[0] + ".pdf"
            pdf_path = os.path.join(pdf_dir, pdf_filename)
            
            pdfkit.from_file(html_path, pdf_path, configuration=config, options=options)
            print(f"Saved PDF: {pdf_path}")
        except Exception as e:
            print(f"Error converting to PDF for {title}: {str(e)}")

def main():
    base_url = "https://hide.ac/magazines/5DMLm3jiO"
    
    try:
        driver = setup_driver()
        links = extract_links(driver, base_url)
        driver.quit()
        
        if links:
            saved_files = save_html(links)
            if saved_files:
                convert_to_pdf(saved_files)
                print("\nAll processes completed!")
            else:
                print("No HTML files were saved. Cannot proceed to PDF conversion.")
        else:
            print("No links found. Exiting.")
    except Exception as e:
        print(f"An unexpected error occurred: {str(e)}")

if __name__ == "__main__":
    main()

スクリプトの詳細解説

1. ドライバーのセットアップ（`setup_driver()`関数）

def setup_driver():
    options = Options()
    options.add_argument("--start-maximized")
    return webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

この関数は、Seleniumを使用してChromeブラウザを制御するためのWebDriverを設定します。

Options()でブラウザのオプションを設定します。
--start-maximizedオプションでブラウザを最大化して起動します。
ChromeDriverManager().install()を使用して、適切なバージョンのChromeDriverを自動的にダウンロードし、インストールします。

2. リンクの抽出（`extract_links()`関数）

def extract_links(driver, url):
    driver.get(url)
    print(f"Accessing URL: {url}")
    
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
    print("Page loaded successfully")
    
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    print("Scrolled to bottom of page")
    time.sleep(5)
    
    try:
        toc_button = WebDriverWait(driver, 20).until(
            EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), '目次')]"))
        )
        driver.execute_script("arguments[0].click();", toc_button)
        print("Table of Contents button clicked")
        time.sleep(5)
    except Exception as e:
        print(f"Error clicking Table of Contents button: {str(e)}")
    
    links = driver.find_elements(By.XPATH, "//a[contains(@class, 'css-17g7y85')]")
    print(f"Found {len(links)} links")
    
    return [(link.text.strip(), link.get_attribute('href')) for link in links]

この関数は、指定されたURLにアクセスし、必要なリンクを抽出します。

WebDriverWaitを使用して、ページが完全に読み込まれるのを待ちます。
ページを下部までスクロールして、動的に読み込まれる可能性のあるコンテンツを表示させます。
'目次'ボタンをクリックして、全ての記事リンクを表示させます。
XPATHを使用して、特定のクラス（css-17g7y85）を持つリンク要素を抽出します。

3. HTMLの保存（`save_html()`関数）

def save_html(links):
    html_dir = "html_files"
    os.makedirs(html_dir, exist_ok=True)
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    
    saved_files = []
    for i, (title, url) in enumerate(links, 1):
        print(f"\nProcessing {i}/{len(links)}: {title}")
        try:
            response = requests.get(url, headers=headers)
            response.raise_for_status()
            
            domain = urlparse(url).netloc
            filename = f"{i:02d}_{domain}_{title[:50]}.html"
            filename = "".join(c for c in filename if c.isalnum() or c in (' ', '_', '-', '.')).rstrip()
            file_path = os.path.join(html_dir, filename)
            
            with open(file_path, 'w', encoding='utf-8') as f:
                f.write(response.text)
            
            saved_files.append((title, file_path))
            print(f"Saved HTML: {file_path}")
        except Exception as e:
            print(f"Error saving HTML for {title}: {str(e)}")
    
    return saved_files

この関数は、抽出されたリンクのHTMLコンテンツを保存します。

requestsライブラリを使用して、各URLのHTMLコンテンツを取得します。
ファイル名は、インデックス、ドメイン名、記事タイトル（最初の50文字）を組み合わせて作成します。
ファイル名から無効な文字を除去し、安全なファイル名を生成します。
HTMLコンテンツをファイルに書き込み、保存します。

4. PDFへの変換（`convert_to_pdf()`関数）

def convert_to_pdf(saved_files):
    pdf_dir = "pdf_files"
    os.makedirs(pdf_dir, exist_ok=True)
    
    config = pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
    options = {
        'quiet': '',
        'enable-local-file-access': None
    }
    
    for i, (title, html_path) in enumerate(saved_files, 1):
        print(f"\nConverting to PDF {i}/{len(saved_files)}: {title}")
        try:
            pdf_filename = os.path.splitext(os.path.basename(html_path))[0] + ".pdf"
            pdf_path = os.path.join(pdf_dir, pdf_filename)
            
            pdfkit.from_file(html_path, pdf_path, configuration=config, options=options)
            print(f"Saved PDF: {pdf_path}")
        except Exception as e:
            print(f"Error converting to PDF for {title}: {str(e)}")

この関数は、保存されたHTMLファイルをPDFに変換します。

pdfkitライブラリを使用して、HTMLファイルをPDFに変換します。
wkhtmltopdfのパスを設定し、必要なオプションを指定します。
各HTMLファイルに対して、同じ名前（拡張子のみ変更）のPDFファイルを生成します。

5. メイン処理（`main()`関数）

def main():
    base_url = "https://hide.ac/magazines/5DMLm3jiO"
    
    try:
        driver = setup_driver()
        links = extract_links(driver, base_url)
        driver.quit()
        
        if links:
            saved_files = save_html(links)
            if saved_files:
                convert_to_pdf(saved_files)
                print("\nAll processes completed!")
            else:
                print("No HTML files were saved. Cannot proceed to PDF conversion.")
        else:
            print("No links found. Exiting.")
    except Exception as e:
        print(f"An unexpected error occurred: {str(e)}")

main()関数は、全体の処理フローを制御します。

WebDriverをセットアップし、リンクを抽出します。
抽出したリンクからHTMLファイルを保存します。
保存されたHTMLファイルをPDFに変換します。

各ステップでエラーハンドリングを行い、問題が発生した場合でも各ステップでエラーハンドリングを行い、問題が発生した場合でも可能な限り処理を続行します。

トラブルシューティング

スクリプトの実行中に問題が発生する可能性があります。以下に一般的な問題とその解決策を示します：

WebDriverの問題
- 症状：WebDriverException: Message: 'chromedriver' executable needs to be in PATH.
- 解決策：webdriver_managerを使用していることを確認してください。それでも問題が解決しない場合は、ChromeDriverを手動でダウンロードしてPATHに追加してください。
要素が見つからない
- 症状：NoSuchElementException: Message: no such element: Unable to locate element
- 解決策：ウェブサイトの構造が変更された可能性があります。XPATHやCSSセレクタを確認し、必要に応じて更新してください。
PDFへの変換エラー
- 症状：OSError: No wkhtmltopdf executable found
- 解決策：wkhtmltopdfが正しくインストールされ、パスが正しく設定されていることを確認してください。
ネットワークエラー
- 症状：requests.exceptions.ConnectionError
- 解決策：インターネット接続を確認し、必要に応じてリトライロジックを実装してください。

応用例と発展

このスクリプトは、様々な用途に応用できます。以下にいくつかの例を示します：

複数のウェブサイトからのスクレイピング
- 異なるウェブサイトのURLリストを用意し、それぞれに対してスクリプトを実行することで、複数のソースから情報を収集できます。
定期的な実行によるアーカイブ作成
- cron（Linux）やタスクスケジューラ（Windows）を使用して、スクリプトを定期的に実行し、ウェブサイトの変更を追跡できます。
コンテンツフィルタリング
- 特定のキーワードや条件に基づいて、保存するコンテンツをフィルタリングする機能を追加できます。
メタデータの抽出と保存
- 記事の著者、日付、カテゴリなどのメタデータを抽出し、CSVやJSONファイルとして保存する機能を追加できます。
PDFの最適化
- pdfkitのオプションを調整して、PDFのサイズ、品質、レイアウトなどをカスタマイズできます。

セキュリティと倫理的考慮事項

Webスクレイピングを行う際は、以下の点に注意してください：

利用規約の確認: 対象ウェブサイトの利用規約やrobots.txtを確認し、スクレイピングが許可されているか確認してください。
アクセス頻度の制限: サーバーに過度の負荷をかけないよう、リクエストの間隔を適切に設定してください。
個人情報の取り扱い: 収集したデータに個人情報が含まれる場合、適切に処理し、必要に応じて匿名化してください。
著作権の尊重: 収集したコンテンツの著作権を尊重し、適切な引用や許可を得てください。

結論

本記事では、Pythonを使用してWebサイトから記事を抽出し、HTMLとして保存し、PDFに変換する方法を詳しく解説しました。このテクニックは、情報収集、アーカイブ作成、研究資料の収集など、様々な場面で活用できます。

スクリプトの各部分を理解し、必要に応じてカスタマイズすることで、あなたの特定のニーズに合わせたツールを作成できます。Webスクレイピングの世界は広大で、常に新しい挑戦があります。

最後に、Webスクレイピングを行う際は、常に倫理的な配慮を忘れずに。ウェブサイトの所有者の権利を尊重し、収集したデータを責任を持って扱うことが重要です。

Hideの記事をまとめてPDF化 (自分のマガジンで練習)