https://scrapy.org
搜索 scrapy 包,scrapy 支持 Python2.7 和 Python3 我們只需要 python3 版本
neo@netkiller ~ % apt-cache search scrapy | grep python3 python3-scrapy - Python web scraping and crawling framework (Python 3) python3-scrapy-djangoitem - Scrapy extension to write scraped items using Django models (Python3 version) python3-w3lib - Collection of web-related functions (Python 3)
Ubuntu 17.04 預設 scrapy 版本為 1.3.0-1 如果需要最新的 1.4.0 請使用 pip 命令安裝
neo@netkiller ~ % apt search python3-scrapy Sorting... Done Full Text Search... Done python3-scrapy/zesty,zesty 1.3.0-1~exp2 all Python web scraping and crawling framework (Python 3) python3-scrapy-djangoitem/zesty,zesty 1.1.1-1 all Scrapy extension to write scraped items using Django models (Python3 version)
安裝 scrapy
neo@netkiller ~ % sudo apt install python3-scrapy [sudo] password for neo: Reading package lists... Done Building dependency tree Reading state information... Done The following additional packages will be installed: ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch python3-pygments python3-queuelib python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib python3-wcwidth python3-webencodings python3-zope.interface Suggested packages: python-pexpect-doc python-attr-doc python-cryptography-doc python3-cryptography-vectors python3-genshi python3-lxml-dbg python-lxml-doc default-mysql-server | virtual-mysql-server python-egenix-mxdatetime python3-mysqldb-dbg python-openssl-doc python3-openssl-dbg python3-pam-dbg python-pil-doc python3-pil-dbg doc-base python-pydispatch-doc ttf-bitstream-vera python-scrapy-doc python3-wxgtk3.0 | python3-wxgtk python-setuptools-doc python3-tk python3-gtk2 python3-glade2 python3-qt4 python3-wxgtk2.8 python3-twisted-bin-dbg The following NEW packages will be installed: ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch python3-pygments python3-queuelib python3-scrapy python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib python3-wcwidth python3-webencodings python3-zope.interface 0 upgraded, 49 newly installed, 0 to remove and 0 not upgraded. Need to get 7,152 kB of archives. After this operation, 40.8 MB of additional disk space will be used. Do you want to continue? [Y/n]
輸入大寫 “Y” 然後回車
neo@netkiller ~ % sudo apt install python3-pip neo@netkiller ~ % pip3 install scrapy
創建測試程序,用於驗證 scrapy 安裝是否存在問題。
$ cat > myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://blog.scrapinghub.com'] def parse(self, response): for title in response.css('h2.entry-title'): yield {'title': title.css('a ::text').extract_first()} for next_page in response.css('div.prev-post > a'): yield response.follow(next_page, self.parse) EOF
運行爬蟲
$ scrapy runspider myspider.py
Scrapy Shell 是一個爬蟲命令行交互界面調試工具,可以使用它分析被爬的頁面
neo@MacBook-Pro /tmp % scrapy shell http://www.netkiller.cn 2017-09-01 15:23:05 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot) 2017-09-01 15:23:05 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0} 2017-09-01 15:23:05 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage'] 2017-09-01 15:23:05 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-09-01 15:23:05 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-09-01 15:23:05 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-09-01 15:23:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-09-01 15:23:05 [scrapy.core.engine] INFO: Spider opened 2017-09-01 15:23:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netkiller.cn> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x103b2afd0> [s] item {} [s] request <GET http://www.netkiller.cn> [s] response <200 http://www.netkiller.cn> [s] settings <scrapy.settings.Settings object at 0x1049019e8> [s] spider <DefaultSpider 'default' at 0x104be2a90> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser >>>
response 是爬蟲返回的頁面,可以通過 css(), xpath() 等方法取出你需要的內容。
css() 這個方法可以用來選擇html和css
>>> response.css('title') [<Selector xpath='descendant-or-self::title' data='<title>Netkiller ebook - Linux ebook</ti'>] >>> response.css('title').extract() ['<title>Netkiller ebook - Linux ebook</title>'] >>> response.css('title::text').extract() ['Netkiller ebook - Linux ebook']
基于 class 選擇
>>> response.css('a.ulink')[1].extract() '<a class="ulink" href="http://netkiller.github.io/" target="_top">http://netkiller.github.io</a>' >>> response.css('a.ulink::text')[3].extract() 'http://netkiller.sourceforge.net'
數組的處理
>>> response.css('a::text').extract_first() '簡體中文' >>> response.css('a::text')[1].extract() '繁體中文' >>> response.css('div.blockquote')[1].css('a.ulink::text').extract() ['Netkiller Architect 手札', 'Netkiller Developer 手札', 'Netkiller PHP 手札', 'Netkiller Python 手札', 'Netkiller Testing 手札', 'Netkiller Java 手札', 'Netkiller Cryptography 手札', 'Netkiller Linux 手札', 'Netkiller FreeBSD 手札', 'Netkiller Shell 手札', 'Netkiller Security 手札', 'Netkiller Web 手札', 'Netkiller Monitoring 手札', 'Netkiller Storage 手札', 'Netkiller Mail 手札', 'Netkiller Docbook 手札', 'Netkiller Project 手札', 'Netkiller Database 手札', 'Netkiller PostgreSQL 手札', 'Netkiller MySQL 手札', 'Netkiller NoSQL 手札', 'Netkiller LDAP 手札', 'Netkiller Network 手札', 'Netkiller Cisco IOS 手札', 'Netkiller H3C 手札', 'Netkiller Multimedia 手札', 'Netkiller Perl 手札', 'Netkiller Amateur Radio 手札']
正則表達式
>>> response.css('title::text').re(r'Netkiller.*') ['Netkiller ebook - Linux ebook'] >>> response.css('title::text').re(r'N\w+') ['Netkiller'] >>> response.css('title::text').re(r'(\w+) (\w+)') ['Netkiller', 'ebook', 'Linux', 'ebook']
>>> response.xpath('//title') [<Selector xpath='//title' data='<title>Netkiller ebook - Linux ebook</ti'>] >>> response.xpath('//title/text()').extract_first() 'Netkiller ebook - Linux ebook'
xpath 也可以使用 re() 方法做正則處理
>>> response.xpath('//title/text()').re(r'(\w+)') ['Netkiller', 'ebook', 'Linux', 'ebook']
>>> response.xpath('//img/@src').extract() ['graphics/spacer.gif', 'graphics/note.gif', 'graphics/by-nc-sa.png', '/images/weixin.jpg', 'images/neo.jpg', '/images/weixin.jpg']
>>> response.xpath('//a/@href')[0].extract() 'http://netkiller.github.io/' >>> response.xpath('//a/text()')[0].extract() '簡體中文' >>> response.xpath('//div[@class="blockquote"]')[1].css('a.ulink::text').extract() ['Netkiller Architect 手札', 'Netkiller Developer 手札', 'Netkiller PHP 手札', 'Netkiller Python 手札', 'Netkiller Testing 手札', 'Netkiller Java 手札', 'Netkiller Cryptography 手札', 'Netkiller Linux 手札', 'Netkiller FreeBSD 手札', 'Netkiller Shell 手札', 'Netkiller Security 手札', 'Netkiller Web 手札', 'Netkiller Monitoring 手札', 'Netkiller Storage 手札', 'Netkiller Mail 手札', 'Netkiller Docbook 手札', 'Netkiller Project 手札', 'Netkiller Database 手札', 'Netkiller PostgreSQL 手札', 'Netkiller MySQL 手札', 'Netkiller NoSQL 手札', 'Netkiller LDAP 手札', 'Netkiller Network 手札', 'Netkiller Cisco IOS 手札', 'Netkiller H3C 手札', 'Netkiller Multimedia 手札', 'Netkiller Perl 手札', 'Netkiller Amateur Radio 手札']
創建爬蟲項目
scrapy startproject project
在抓取之前,你需要新建一個Scrapy工程
neo@MacBook-Pro ~/Documents % scrapy startproject crawler New Scrapy project 'crawler', using template directory '/usr/local/lib/python3.6/site-packages/scrapy/templates/project', created in: /Users/neo/Documents/crawler You can start your first spider with: cd crawler scrapy genspider example example.com neo@MacBook-Pro ~/Documents % cd crawler neo@MacBook-Pro ~/Documents/crawler % find . . ./crawler ./crawler/__init__.py ./crawler/__pycache__ ./crawler/items.py ./crawler/middlewares.py ./crawler/pipelines.py ./crawler/settings.py ./crawler/spiders ./crawler/spiders/__init__.py ./crawler/spiders/__pycache__ ./scrapy.cfg
Scrapy 工程目錄主要有以下檔案組成:
scrapy.cfg: 項目配置檔案 middlewares.py : 項目 middlewares 檔案 items.py: 項目items檔案 pipelines.py: 項目管道檔案 settings.py: 項目配置檔案 spiders: 放置spider的目錄
創建爬蟲,名字是 netkiller, 爬行的地址是 netkiller.cn
neo@MacBook-Pro ~/Documents/crawler % scrapy genspider netkiller netkiller.cn Created spider 'netkiller' using template 'basic' in module: crawler.spiders.netkiller neo@MacBook-Pro ~/Documents/crawler % find . . ./crawler ./crawler/__init__.py ./crawler/__pycache__ ./crawler/__pycache__/__init__.cpython-36.pyc ./crawler/__pycache__/settings.cpython-36.pyc ./crawler/items.py ./crawler/middlewares.py ./crawler/pipelines.py ./crawler/settings.py ./crawler/spiders ./crawler/spiders/__init__.py ./crawler/spiders/__pycache__ ./crawler/spiders/__pycache__/__init__.cpython-36.pyc ./crawler/spiders/netkiller.py ./scrapy.cfg
打開 crawler/spiders/netkiller.py 檔案,修改內容如下
# -*- coding: utf-8 -*- import scrapy class NetkillerSpider(scrapy.Spider): name = 'netkiller' allowed_domains = ['netkiller.cn'] start_urls = ['http://www.netkiller.cn/'] def parse(self, response): for link in response.xpath('//div[@class="blockquote"]')[1].css('a.ulink'): # self.log('This url is %s' % link) yield { 'name': link.css('a::text').extract(), 'url': link.css('a.ulink::attr(href)').extract() } pass
運行爬蟲
neo@MacBook-Pro ~/Documents/crawler % scrapy crawl netkiller -o output.json 2017-09-08 11:42:30 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: crawler) 2017-09-08 11:42:30 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'crawler', 'FEED_FORMAT': 'json', 'FEED_URI': 'output.json', 'NEWSPIDER_MODULE': 'crawler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['crawler.spiders']} 2017-09-08 11:42:30 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2017-09-08 11:42:30 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-09-08 11:42:30 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-09-08 11:42:30 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-09-08 11:42:30 [scrapy.core.engine] INFO: Spider opened 2017-09-08 11:42:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-09-08 11:42:30 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-09-08 11:42:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netkiller.cn/robots.txt> (referer: None) 2017-09-08 11:42:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netkiller.cn/> (referer: None) 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Architect 手札'], 'url': ['../architect/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Developer 手札'], 'url': ['../developer/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller PHP 手札'], 'url': ['../php/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Python 手札'], 'url': ['../python/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Testing 手札'], 'url': ['../testing/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Java 手札'], 'url': ['../java/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Cryptography 手札'], 'url': ['../cryptography/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Linux 手札'], 'url': ['../linux/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller FreeBSD 手札'], 'url': ['../freebsd/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Shell 手札'], 'url': ['../shell/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Security 手札'], 'url': ['../security/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Web 手札'], 'url': ['../www/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Monitoring 手札'], 'url': ['../monitoring/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Storage 手札'], 'url': ['../storage/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Mail 手札'], 'url': ['../mail/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Docbook 手札'], 'url': ['../docbook/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Project 手札'], 'url': ['../project/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Database 手札'], 'url': ['../database/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller PostgreSQL 手札'], 'url': ['../postgresql/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller MySQL 手札'], 'url': ['../mysql/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller NoSQL 手札'], 'url': ['../nosql/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller LDAP 手札'], 'url': ['../ldap/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Network 手札'], 'url': ['../network/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Cisco IOS 手札'], 'url': ['../cisco/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller H3C 手札'], 'url': ['../h3c/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Multimedia 手札'], 'url': ['../multimedia/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Perl 手札'], 'url': ['../perl/index.html']} 2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/> {'name': ['Netkiller Amateur Radio 手札'], 'url': ['../radio/index.html']} 2017-09-08 11:42:31 [scrapy.core.engine] INFO: Closing spider (finished) 2017-09-08 11:42:31 [scrapy.extensions.feedexport] INFO: Stored json feed (28 items) in: output.json 2017-09-08 11:42:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 6075, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 9, 8, 3, 42, 31, 157395), 'item_scraped_count': 28, 'log_count/DEBUG': 31, 'log_count/INFO': 8, 'memusage/max': 49434624, 'memusage/startup': 49434624, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 9, 8, 3, 42, 30, 931267)} 2017-09-08 11:42:31 [scrapy.core.engine] INFO: Spider closed (finished)
你會看到返回結果
{'name': ['Netkiller Architect 手札'], 'url': ['../architect/index.html']}
下面我們演示爬蟲翻頁,例如我們需要便利這部電子書 https://netkiller.cn/linux/index.html,首先創建一個爬蟲任務
neo@MacBook-Pro ~/Documents/crawler % scrapy genspider book netkiller.cn Created spider 'book' using template 'basic' in module: crawler.spiders.book
編輯爬蟲任務
# -*- coding: utf-8 -*- import scrapy class BookSpider(scrapy.Spider): name = 'book' allowed_domains = ['netkiller.cn'] start_urls = ['https://netkiller.cn/linux/index.html'] def parse(self, response): yield {'title': response.css('title::text').extract()}; # 這裡取出下一頁連接地址 next_page = response.xpath('//a[@accesskey="n"]/@href').extract_first() self.log('Next page: %s' % next_page) # 如果頁面不為空交給 response.follow 來爬取這個頁面 if next_page is not None: yield response.follow(next_page, callback=self.parse) pass