https://scrapy.org
搜索 scrapy 包,scrapy 支持 Python2.7 和 Python3 我們只需要 python3 版本
neo@netkiller ~ % apt-cache search scrapy | grep python3 python3-scrapy - Python web scraping and crawling framework (Python 3) python3-scrapy-djangoitem - Scrapy extension to write scraped items using Django models (Python3 version) python3-w3lib - Collection of web-related functions (Python 3)
Ubuntu 17.04 預設 scrapy 版本為 1.3.0-1 如果需要最新的 1.4.0 請使用 pip 命令安裝
neo@netkiller ~ % apt search python3-scrapy Sorting... Done Full Text Search... Done python3-scrapy/zesty,zesty 1.3.0-1~exp2 all Python web scraping and crawling framework (Python 3) python3-scrapy-djangoitem/zesty,zesty 1.1.1-1 all Scrapy extension to write scraped items using Django models (Python3 version)
安裝 scrapy
neo@netkiller ~ % sudo apt install python3-scrapy [sudo] password for neo: Reading package lists... Done Building dependency tree Reading state information... Done The following additional packages will be installed: ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch python3-pygments python3-queuelib python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib python3-wcwidth python3-webencodings python3-zope.interface Suggested packages: python-pexpect-doc python-attr-doc python-cryptography-doc python3-cryptography-vectors python3-genshi python3-lxml-dbg python-lxml-doc default-mysql-server | virtual-mysql-server python-egenix-mxdatetime python3-mysqldb-dbg python-openssl-doc python3-openssl-dbg python3-pam-dbg python-pil-doc python3-pil-dbg doc-base python-pydispatch-doc ttf-bitstream-vera python-scrapy-doc python3-wxgtk3.0 | python3-wxgtk python-setuptools-doc python3-tk python3-gtk2 python3-glade2 python3-qt4 python3-wxgtk2.8 python3-twisted-bin-dbg The following NEW packages will be installed: ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch python3-pygments python3-queuelib python3-scrapy python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib python3-wcwidth python3-webencodings python3-zope.interface 0 upgraded, 49 newly installed, 0 to remove and 0 not upgraded. Need to get 7,152 kB of archives. After this operation, 40.8 MB of additional disk space will be used. Do you want to continue? [Y/n]
輸入大寫 “Y” 然後回車
neo@netkiller ~ % sudo apt install python3-pip neo@netkiller ~ % pip3 install scrapy
創建測試程序,用於驗證 scrapy 安裝是否存在問題。
$ cat > myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://blog.scrapinghub.com'] def parse(self, response): for title in response.css('h2.entry-title'): yield {'title': title.css('a ::text').extract_first()} for next_page in response.css('div.prev-post > a'): yield response.follow(next_page, self.parse) EOF
運行爬蟲
$ scrapy runspider myspider.py