Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
ch1		ch1
ch2		ch2
ch3		ch3
ch4		ch4
ch5		ch5
ch6		ch6
ch7		ch7
lecture		lecture
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
beauty.json		beauty.json
channel_videos.json		channel_videos.json
gossiping.json		gossiping.json
imdb_data.json		imdb_data.json
ltn_news_data.json		ltn_news_data.json
requirements.txt		requirements.txt
yahoo_movies.json		yahoo_movies.json

Repository files navigation

Python 網頁爬蟲入門實戰

課程單元

環境設定與網頁爬蟲初探

1-1. 環境設定: 安裝 Python 及使用 venv
1-2. 使用 Visual Studio Code
1-3. Browser-Based 開發環境
1-4. 網頁文件解構與網頁爬蟲初探

使用 Beautiful Soup

2-1. 不要重覆造輪子：寫爬蟲之前
2-2. 使用 BeautifulSoup - 定位標籤元件
2-3. 使用 BeautifulSoup - 巡覽網頁結構
2-4. 正規表示式 (Regular Expression)

網頁爬蟲範例實戰

3-1. 自由時報今日熱門新聞
3-2. 東森新聞今日熱門新聞
3-3. WordPress 部落格文章
3-4. momo 購物網搜尋結果
3-5. Yahoo 奇摩電影本週新片
3-6. PTT 八卦板今日熱門文章
3-7. GitHub Repositories 列表

使用 API

4-1. API 簡介
4-2. WordPress API 取得部落格文章
4-3. 自由時報新聞網 API
4-4. IMDB API
4-5. GitHub API: Repositories 列表
4-6. YouTube Data API: 頻道觀看數及影片列表

資料儲存

5-1. 儲存為 JSON 與 CSV 檔
5-2. 儲存圖片 (PTT Beauty 板圖片下載)
5-3. 儲存資料到資料庫 SQLite

動態網站爬蟲

6-1. 台銀法拍屋 - 使用 Selenium
6-2. PCHome 搜尋 - 使用 Selenium 及分析 API Endpoint
6-3. 臺灣證交所每日收盤行情 - 使用 Selenium 及分析 API Endpoint

爬蟲程式經驗談

被封鎖的常見原因: Timing, Policy Violation (robots.txt)
常用 Header 欄位、網站隱藏欄位
使用代理伺服器

說明

範例程式在各章目錄內, 講義在 lecture 目錄下
範例程式所需套件 pip install -r requirements.txt (Python 3)

About

Python 網頁爬蟲入門實戰

compthinking.dev/courses/py-web-scraping

Custom properties

Report repository

Languages

Python 100.0%