Home | 簡體中文 | 繁體中文 | 雜文 | 打賞(Donations) | ITEYE 博客 | OSChina 博客 | Facebook | Linkedin | 知乎專欄 | Search | Email

第 11 章 Scrapy - Python web scraping and crawling framework

目錄

11.1. 安裝 scrapy 開發環境
11.1.1. Mac
11.1.2. Ubuntu
11.1.3. 使用 pip 安裝 scrapy
11.1.4. 測試 scrapy
11.2. scrapy 命令
11.2.1.
11.2.2. 新建 spider
11.2.3. 列出可用的 spiders
11.2.4. 運行 spider
11.3. Scrapy Shell
11.3.1. response
11.3.1.1. 當前URL地址
11.3.1.2. status HTTP 狀態
11.3.1.3. text 正文
11.3.1.4. css
11.3.1.4.1. 獲取 html 屬性
11.3.1.5. xpath
11.3.1.6. headers
11.4. 爬蟲項目
11.4.1. 創建項目
11.4.2. Spider
11.4.2.1. 翻頁操作
11.4.2.2. 採集內容保存到檔案
11.4.3. settings.py 爬蟲配置檔案
11.4.3.1. 忽略 robots.txt 規則
11.4.4. Item
11.4.5. Pipeline
11.5. 下載圖片
11.5.1. 配置 settings.py
11.5.2. 修改 pipelines.py 檔案
11.5.3. 編輯 items.py
11.5.4. Spider 爬蟲檔案
11.6. xpath
11.6.1. 邏輯運算符
11.6.1.1. and
11.6.1.2. or
11.6.2. function
11.6.2.1. text()
11.6.2.2. contains()

https://scrapy.org

11.1. 安裝 scrapy 開發環境

11.1.1. Mac

neo@MacBook-Pro ~ % brew install python3
neo@MacBook-Pro ~ % pip3 install scrapy
			

11.1.2. Ubuntu

搜索 scrapy 包,scrapy 支持 Python2.7 和 Python3 我們只需要 python3 版本

neo@netkiller ~ % apt-cache search scrapy | grep python3
python3-scrapy - Python web scraping and crawling framework (Python 3)
python3-scrapy-djangoitem - Scrapy extension to write scraped items using Django models (Python3 version)
python3-w3lib - Collection of web-related functions (Python 3)			
			

Ubuntu 17.04 預設 scrapy 版本為 1.3.0-1 如果需要最新的 1.4.0 請使用 pip 命令安裝

neo@netkiller ~ % apt search python3-scrapy
Sorting... Done
Full Text Search... Done
python3-scrapy/zesty,zesty 1.3.0-1~exp2 all
  Python web scraping and crawling framework (Python 3)

python3-scrapy-djangoitem/zesty,zesty 1.1.1-1 all
  Scrapy extension to write scraped items using Django models (Python3 version)
			

安裝 scrapy

neo@netkiller ~ % sudo apt install python3-scrapy
[sudo] password for neo: 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly
  python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb
  python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch
  python3-pygments python3-queuelib python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib python3-wcwidth
  python3-webencodings python3-zope.interface
Suggested packages:
  python-pexpect-doc python-attr-doc python-cryptography-doc python3-cryptography-vectors python3-genshi python3-lxml-dbg python-lxml-doc default-mysql-server | virtual-mysql-server
  python-egenix-mxdatetime python3-mysqldb-dbg python-openssl-doc python3-openssl-dbg python3-pam-dbg python-pil-doc python3-pil-dbg doc-base python-pydispatch-doc ttf-bitstream-vera python-scrapy-doc
  python3-wxgtk3.0 | python3-wxgtk python-setuptools-doc python3-tk python3-gtk2 python3-glade2 python3-qt4 python3-wxgtk2.8 python3-twisted-bin-dbg
The following NEW packages will be installed:
  ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly
  python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb
  python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch
  python3-pygments python3-queuelib python3-scrapy python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib
  python3-wcwidth python3-webencodings python3-zope.interface
0 upgraded, 49 newly installed, 0 to remove and 0 not upgraded.
Need to get 7,152 kB of archives.
After this operation, 40.8 MB of additional disk space will be used.
Do you want to continue? [Y/n]
			

輸入大寫 “Y” 然後回車

11.1.3. 使用 pip 安裝 scrapy

neo@netkiller ~ % sudo apt install python3-pip
neo@netkiller ~ % pip3 install scrapy
			

11.1.4. 測試 scrapy

創建測試程序,用於驗證 scrapy 安裝是否存在問題。

			
$ cat > myspider.py <<EOF
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('h2.entry-title'):
            yield {'title': title.css('a ::text').extract_first()}

        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)
EOF
			
			

運行爬蟲

$ scrapy runspider myspider.py