创建项目 1 scrapy startproject tutorial 
创建任务 1 scrapy genspider first www.baidu.com 
会生成一个first文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 import  scrapyclass  FirstSpider (scrapy.Spider):         name = 'first'           allowed_domains = ['www.baidu.com' ]          start_urls = ['https://www.baidu.com/' ]          def  parse (self, response ):         pass  
修改配置文件 只输出ERROR级别的日志
不遵从robots协议
指定ua
1 USER_AGENT = 'tutorial (+http://www.yourdomain.com)'  
运行程序 会输出一个response对象
1 <200  https:// www.baidu.com/> 
数据解析 1 2 3 4 5 6 7 8 9 10 11 12 13 import  scrapyclass  FirstSpider (scrapy.Spider):    name = 'first'      start_urls = ['https://ishuo.cn/' ]     def  parse (self, response ):                  title_list = response.xpath('//*[@id="list"]/ul/li/div[1]/text()' )         for  title in  title_list:             print (title) 
可以看到返回了一个selector对象,我们想要的数据在data属性里
1 2 3 4 chenci@MacBook-Pro tutorial %scrapy crawl first <Selector xpath='//*[@id="list"]/ul/li/div[1]/text()'  data='如果你得罪了老板,失去的只是一份工作;如果你得罪了客户,失去的不过是一份订...' > <Selector xpath='//*[@id="list"]/ul/li/div[1]/text()'  data='有位非常漂亮的女同事,有天起晚了没有时间化妆便急忙冲到公司。结果那天她被记...' > <Selector xpath='//*[@id="list"]/ul/li/div[1]/text()'  data='悟空和唐僧一起上某卫视非诚勿扰,悟空上台,24盏灯全灭。理由:1.没房没车...' > 
从data属性中取出我们想要的数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import  scrapyclass  FirstSpider (scrapy.Spider):    name = 'first'      start_urls = ['https://ishuo.cn/' ]     def  parse (self, response ):                  title_list = response.xpath('//*[@id="list"]/ul/li/div[1]/text()' )         for  title in  title_list:                          title = title.extract()               print (title) 
持久化存储 1.基于终端指令的存储 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import  scrapyclass  FirstSpider (scrapy.Spider):    name = 'first'      start_urls = ['https://ishuo.cn/' ]     def  parse (self, response ):         data_all = []                  title_list = response.xpath('//*[@id="list"]/ul/li/div[1]/text()' )         for  title in  title_list:                          title = title.extract()                            dic = {                 'title' : title             }             data_all.append(dic)                  return  data_all 
执行
1 chenci@MacBook-Pro tutorial %scrapy crawl first -o test.csv 
2.基于管道的持久化存储 开启管道
settings.py
1 2 3 ITEM_PIPELINES = {     'tutorial.pipelines.TutorialPipeline' : 300 ,   } 
在items.py中定义相关属性
1 2 3 4 5 6 7 8 9 import  scrapyclass  TutorialItem (scrapy.Item):                   title = scrapy.Field() 
将first.py提取出的数据提交给管道
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import  scrapyfrom  tutorial.items import  TutorialItemclass  FirstSpider (scrapy.Spider):    name = 'first'      start_urls = ['https://ishuo.cn/' ]     def  parse (self, response ):                  title_list = response.xpath('//*[@id="list"]/ul/li/div[1]/text()' )         for  title in  title_list:                          title = title.extract()                            item = TutorialItem()                          item['title' ] = title                          yield  item 
在pipelines.py中重写父类方法,存储到本地
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 class  TutorialPipeline :         f = None      def  open_spider (self, spider ):         print ('我是open_spider,只会在爬虫开始的时候执行一次' )         self.f = open ('./text1.txt' , 'w' , encoding='utf-8' )     def  close_spider (self, spider ):         print ('我是close_spider,只会在爬虫开始的时候执行一次' )         self.f.close()               def  process_item (self, item, spider ):                  self.f.write(item['title' ] + '\n' )         return  item 
基于管道实现数据的备份
pipelines.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import  pymysqlclass  MysqlPipeline (object ):    conn = None      cursor = None           def  open_spider (self, spider ):                  self.conn = pymysql.Connect(host='localhost' , port=3306 , user='root' , password='123456' , charset='utf8' ,                                     db='spider' )     def  process_item (self, item, spider ):         self.cursor = self.conn.cursor()         sql = 'insert into duanzi values("%s")'  % item['title' ]                  try :             self.cursor.execute(sql)             self.conn.commit()         except  Exception as  e:             print (e)             self.conn.rollback()                  return  item          def  close_spider (self, spider ):         self.cursor.close()         self.conn.close() 
在settings.py增加一个管道
1 2 3 4 5 ITEM_PIPELINES = {          'tutorial.pipelines.TutorialPipeline' : 300 ,     'tutorial.pipelines.MysqlPipeline' : 301 , } 
手动请求发送 新建工程
1 2 3 chenci@MacBook-Pro scrapy %scrapy startproject HandReq chenci@MacBook-Pro scrapy %cd  HandReq  chenci@MacBook-Pro HandReq %scrapy genspider duanzi www.xxx.com 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import  scrapyfrom  HandReq.items import  HandreqItemclass  DuanziSpider (scrapy.Spider):    name = 'duanzi'           start_urls = ['https://duanzixing.com/page/1/' ]          url = 'https://duanzixing.com/page/%d/'      page_num = 2      def  parse (self, response ):         title_list = response.xpath('/html/body/section/div/div/article[1]/header/h2/a/text()' )         for  title in  title_list:             title = title.extract()             item = HandreqItem()             item['title' ] = title             yield  item         if  self.page_num < 5 :                          new_url = format (self.url % self.page_num)             self.page_num += 1                           yield  scrapy.Request(url=new_url, callback=self.parse)              
五大核心组件工作流程 
引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心)
调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
请求传参的深度爬取-4567kan.com 文件目录
meta是一个字典,可以将meta传给callback
    scrapy.Request(url, callback, meta)
callback取出字典
    item = response.meta['item']
move.py 项目文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 import  scrapyfrom  move_4567kan.items import  Move4567KanItemclass  MoveSpider (scrapy.Spider):    name = 'move'           start_urls = ['https://www.4567kan.com/frim/index1-1.html' ]          url = 'https://www.4567kan.com/frim/index1-%d.html'      page_num = 2      def  parse (self, response ):                  li_list = response.xpath('/html/body/div[2]/div/div[3]/div/div[2]/ul/li' )         for  li in  li_list:             url = 'https://www.4567kan.com'  + li.xpath('./div/a/@href' ).extract()[0 ]             title = li.xpath('./div/a/@title' ).extract()[0 ]                          item = Move4567KanItem()             item['title' ] = title                                       yield  scrapy.Request(url=url, callback=self.get_details, meta={'item' : item})                  if  self.page_num < 5 :                          new_url = format (self.url % self.page_num)             self.page_num += 1                           yield  scrapy.Request(url=new_url, callback=self.parse)          def  get_details (self, response ):         details = response.xpath('//*[@class="detail-content"]/text()' ).extract()                  if  details:             details = details[0 ]         else :             details = None                   item = response.meta['item' ]         item['details' ] = details                  yield  item 
items.py 定义两个字段
1 2 3 4 5 6 7 8 9 import  scrapyclass  Move4567KanItem (scrapy.Item):              title = scrapy.Field()     details = scrapy.Field() 
pipelines.py 打印输出
1 2 3 4 5 class  Move4567KanPipeline :    def  process_item (self, item, spider ):         print (item)         return  item 
中间件 作用
拦截请求和响应
爬虫中间件
略
下载中间件(推荐)
拦截请求:    
    1.篡改请求url
    2.伪装请求头信息:
        UA
        Cookie
    3.设置请求代理
拦截响应:
    篡改响应数据
改写中间件文件 middlewares.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 from  scrapy import  signalsfrom  itemadapter import  is_item, ItemAdapterclass  MiddleDownloaderMiddleware :              def  process_request (self, request, spider ):         print ('我是process_request()' )         return  None                def  process_response (self, request, response, spider ):         print ('我是process_response()' )         return  response                    def  process_exception (self, request, exception, spider ):         print ('我是process_exception()' )          
编写爬虫文件
1 2 3 4 5 6 7 8 9 10 11 import  scrapyclass  MidSpider (scrapy.Spider):    name = 'mid'           start_urls = ['https://www.baidu.com' , 'https://www.sogou.com' ]     def  parse (self, response ):         print (response) 
在配置文件setting.py中启用
1 2 3 4 5 ROBOTSTXT_OBEY = True  DOWNLOADER_MIDDLEWARES = {     'middle.middlewares.MiddleDownloaderMiddleware' : 543 , } 
启动工程
1 2 3 4 5 6 7 chenci@MacBook-Pro middle %scrapy crawl mid 我是process_request() 我是process_request() 我是process_response() 我是process_exception() 我是process_response() 我是process_exception() 
process_exception()方法设置代理
1 2 3 4 5 6 7 8 9 10 def  process_exception (self, request, exception, spider ):         request.meta['proxy_' ] = 'https://ip:port'      print ('我是process_exception()' )          return  request      
process_request()方法设置headers
1 2 3 4 5 6 def  process_request (self, request, spider ):         request.headers['User-Agent' ] = 'xxx'      request.headers['Cookie' ] = 'xxx'      print ('我是process_request()' )     return  None  
process_response()方法篡改响应数据
1 2 3 4 5 6 7 def  process_response (self, request, response, spider ):         response.text = 'xxx'      print ('我是process_response()' )     return  response 
大文件下载-爬取jdlingyu.com图片 文件目录
img.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import  scrapyfrom  imgdownload.items import  ImgdownloadItemclass  ImgSpider (scrapy.Spider):    name = 'img'           start_urls = ['https://www.jdlingyu.com' ]     def  parse (self, response ):         li_list = response.xpath('/html/body/div[1]/div[2]/div[1]/div/div[6]/div/div[1]/div/div[2]/ul/li' )         for  a in  li_list:             url = a.xpath('./div/div[2]/h2/a/@href' ).extract()[0 ]             title = a.xpath('./div/div[2]/h2/a/text()' ).extract()[0 ]                          item = ImgdownloadItem()             item['title' ] = title                          yield  scrapy.Request(url=url, callback=self.get_img_url, meta={'item' : item})          def  get_img_url (self, response ):         page = 0          item = response.meta['item' ]                  img_list = response.xpath('//*[@id="primary-home"]/article/div[2]/img' )         for  scr in  img_list:             img_url = scr.xpath('./@src' ).extract()[0 ]             page += 1                           item['img_url' ] = img_url             item['page' ] = page                          yield  item 
setting.py增加配置
1 2 3 4 5 6 USER_AGENT = 'ua'  ROBOTSTXT_OBEY = False  LOG_LEVEL = 'ERROR'  IMAGES_STORE = './imgs'  
items.py增加字段
1 2 3 4 5 6 7 8 9 import  scrapyclass  ImgdownloadItem (scrapy.Item):              title = scrapy.Field()     img_url = scrapy.Field()     page = scrapy.Field() 
pipelines.py增加管道类
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 import  scrapyfrom  itemadapter import  ItemAdapterclass  ImgdownloadPipeline :    def  process_item (self, item, spider ):         return  item from  scrapy.pipelines.images import  ImagesPipelineclass  img_download (ImagesPipeline ):         def  get_media_requests (self, item, info ):                  yield  scrapy.Request(url=item['img_url' ], meta={'title' : item['title' ], 'page' : item['page' ]})          def  file_path (self, request, response=None , info=None , *, item=None  ):                  title = request.meta['title' ]         page = request.meta['page' ]         path = f'{title} /{page} .jpg'                   return  path          def  item_completed (self, results, item, info ):         return  item 
setting.py增加管道类
1 2 3 4 ITEM_PIPELINES = {        'imgdownload.pipelines.img_download' : 300 , } 
运行效果
CrawlSpider 深度爬取 是什么
是Spider的一个子类,也就是爬虫文件的父类
作用:用作于全站数据的爬取
将一个页面下所有的页码进行爬取
基本使用
1.创建一个工程
2.创建一个基于CrawlSpider类的爬虫文件
    crapy genspider -t crawl main www.xxx.com
3.执行工程
编写工程文件main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import  scrapyfrom  scrapy.linkextractors import  LinkExtractorfrom  scrapy.spiders import  CrawlSpider, Ruleclass  MainSpider (CrawlSpider ):    name = 'main'           start_urls = ['https://www.mn52.com/fj/' ]          rules = (                           Rule(LinkExtractor(allow=r'list_8_\d.html' ), callback='parse_item' , follow=True ),     )     def  parse_item (self, response ):         print (response)         item = {}         return  item 
执行工程
可以看到抓取了所有页码的url
1 2 3 4 5 6 7 8 9 chenci@MacBook-Pro crawl %scrapy crawl main <200 https://www.mn52.com/fj/list_8_2.html> <200 https://www.mn52.com/fj/list_8_3.html> <200 https://www.mn52.com/fj/list_8_4.html> <200 https://www.mn52.com/fj/list_8_8.html> <200 https://www.mn52.com/fj/list_8_5.html> <200 https://www.mn52.com/fj/list_8_7.html> <200 https://www.mn52.com/fj/list_8_9.html> <200 https://www.mn52.com/fj/list_8_6.html>