Scrapy爬虫框架：抓取淘宝天猫数据

有了前两篇的基础，接下来通过抓取淘宝和天猫的数据来详细说明，如何通过Scrapy爬取想要的内容。完整的代码：下载密码：wgq5pv。

需求

通过淘宝的搜索，获取搜索出来的每件商品的销量、收藏数、价格。

解决思路

首先，打开淘宝的搜索页面，在里面输入：硬盘，选中列表模式（因为列表模式没有广告）。
获取到现在浏览器上面的地址：
https://s.taobao.com/search?q=硬盘&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170316&style=list
在出现的商品列表中有很多硬盘，我们需要获取到这些商品的详细信息，也就是它的跳转链接，比如：//detail.tmall.com/item.htm?spm=a230r.1.14.19.QzLRla&id=40000831870&ad_id=&am_id=&cm_id=140105335569ed55e27b&pm_id=&abbucket=14
然后再把详细地址的内容全部请求出来，里面包含了销量、价格、收藏数量。

所以，最终的目的是通过获取两个页面的内容，一个是搜索结果，从里面找出来每一个商品的详细地址，然后第二个是商品详细内容，从里面获取到销量、价格等。

下载网页

有了思路现在我们先下载搜索结果页面，然后再下载页面中每一项详细信息页面。

 def _parse_handler(self, response):
        ''' 下载页面 """
        self.driver.get(response.url) 
        pass

很简单，通过self.driver.get(response.url)就能使用selenium下载内容，如果直接使用response中的网页内容是静态的。

获取想要的内容(Selector)

上面说了如何下载内容，当我们下载好内容后，需要从里面去获取我们想要的有用信息，这里就要用到选择器，选择器构造方式比较多，只介绍一种，这里看详细信息：

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']

这样就通过xpath取出来了good这个单词，更详细的xpath教程点击这里。
Selector 提供了很多方式出了xpath，还有css选择器，正则表达式，中文教程看这个，具体内容就不多说，只需要知道这样可以快速获取我们需要的内容。

处理内容

简单的介绍了怎么获取内容后，现在我们从第一个搜索结果中获取我们想要的商品详细链接，通过查看网页源代码可以看到，商品的链接在这里：

...
<p class="title">
      <a class="J_ClickStat" data-nid="523242229702" href="//detail.tmall.com/item.htm?spm=a230r.1.14.46.Mnbjq5&id=523242229702&ns=1&abbucket=14" target="_blank" trace="msrp_auction" traceidx="5" trace-pid="" data-spm-anchor-id="a230r.1.14.46">WD/西部数据 WD30EZRZ台式机3T电脑<span class="H">硬盘</span> 西数蓝盘3TB 替绿盘</a>
</p>
...

使用之前的规则来获取到a元素的href属性就是需要的内容：

selector = Selector(text=self.driver.page_source) # 这里不要省略text因为省略后Selector使用的是另外一个构造函数，self.driver.page_source是这个网页的html内容
selector.css(".title").css(".J_ClickStat").xpath("./@href").extract()

简单说一下，这里通过css工具取了class叫title的p元素，然后又获取了class是J_ClickStat的a元素，最后通过xpath规则获取a元素的href中的内容。啰嗦一句css中如果是取id则应该是selector.css("#title")，这个和css中的选择器是一致的。
同理，我们获取到商品详情后，以获取销量为例，查看源代码：

<ul class="tm-ind-panel">
    <li class="tm-ind-item tm-ind-sellCount" data-label="月销量"><div class="tm-indcon"><span class="tm-label">月销量</span><span class="tm-count">881</span></div></li>
    <li class="tm-ind-item tm-ind-reviewCount canClick tm-line3" id="J_ItemRates"><div class="tm-indcon"><span class="tm-label">累计评价</span><span class="tm-count">4593</span></div></li>
    <li class="tm-ind-item tm-ind-emPointCount" data-spm="1000988"><div class="tm-indcon"><a href="//vip.tmall.com/vip/index.htm" target="_blank"><span class="tm-label">送天猫积分</span><span class="tm-count">55</span></a></div></li>
 </ul>

获取月销量:

selector.css(".tm-ind-sellCount").xpath("./div/span[@class='tm-count']/text()").extract_first()

获取累计评价:

selector.css(".tm-ind-reviewCount").xpath("./div[@class='tm-indcon']/span[@class='tm-count']/text()").extract_first()

最后把获取出来的数据包装成Item返回。淘宝或者天猫他们的页面内容不一样，所以规则也不同，需要分开去获取想要的内容。

Item使用

Item是scrapy中获取出来的结果，后面可以处理这些结果。

定义

Item一般是放到items.py中

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

创建

>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)

使用值

>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC

>>> product['price']
1000

>>> product['last_updated']
Traceback (most recent call last):
    ...
KeyError: 'last_updated'

>>> product.get('last_updated', 'not set')
not set

>>> product['lala'] # getting unknown field
Traceback (most recent call last):
    ...
KeyError: 'lala'

>>> product.get('lala', 'unknown field')
'unknown field'

>>> 'name' in product  # is name field populated?
True

>>> 'last_updated' in product  # is last_updated populated?
False

>>> 'last_updated' in product.fields  # is last_updated a declared field?
True

>>> 'lala' in product.fields  # is lala a declared field?
False

设置值

>>> product['last_updated'] = 'today'
>>> product['last_updated']
today

>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'

这里只需要注意一个地方，不能通过product.name的方式获取，也不能通过product.name = "name"的方式设置值。

添加Pipeline过滤结果

当Item在Spider中被收集之后，它将会被传递到Item Pipeline，一些组件会按照一定的顺序执行对Item的处理。

每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方法的Python类。他们接收到Item并通过它执行一些行为，同时也决定此Item是否继续通过pipeline，或是被丢弃而不再进行处理。

以下是item pipeline的一些典型应用：

清理HTML数据
验证爬取的数据(检查item包含某些字段)
查重(并丢弃)
将爬取结果保存到数据库中

现在实现一个Item过滤器，我们把获取出来如果是None的数据赋值为0，如果Item对象是None则扔掉这条数据。
pipeline一般是放到pipelines.py中

    def process_item(self, item, spider):
        if item is not None:
            if item["p_standard_price"] is None:
                item["p_standard_price"] = item["p_shop_price"]
            if item["p_shop_price"] is None:
                item["p_shop_price"] = item["p_standard_price"]

            item["p_collect_count"] = text_utils.to_int(item["p_collect_count"])
            item["p_comment_count"] = text_utils.to_int(item["p_comment_count"])
            item["p_month_sale_count"] = text_utils.to_int(item["p_month_sale_count"])
            item["p_sale_count"] = text_utils.to_int(item["p_sale_count"])
            item["p_standard_price"] = text_utils.to_string(item["p_standard_price"], "0")
            item["p_shop_price"] = text_utils.to_string(item["p_shop_price"], "0")
            item["p_pay_count"] = item["p_pay_count"] if item["p_pay_count"] is not "-" else "0"
            return item
        else:
            raise DropItem("Item is None %s" % item)

最后需要在settings.py中添加这个pipeline

ITEM_PIPELINES = {
    'TaoBao.pipelines.TTDataHandlerPipeline': 250,
    'TaoBao.pipelines.MysqlPipeline': 300,
}

后面那个数字越小，则执行的顺序越靠前，这里先过滤处理数据，获取到正确的数据后，再执行TaoBao.pipelines.MysqlPipeline添加数据到数据库。

完整的代码：下载。

可能会遇到的一些问题

IDE调试

之前说的方式都是直接通过命令scrapy crawl tts来启动。怎么用IDE的调试功能呢？很简单通过main函数启动爬虫：

#   写到Spider里面
if __name__ == "__main__":
    settings = get_project_settings()
    process = CrawlerProcess(settings)
    spider = TmallAndTaoBaoSpider
    process.crawl(spider)
    process.start()

302重定向的问题

在获取数据的时候，很多时候会遇到网页重定向的问题，scrapy会返回302然后不会自动重定向后继续爬取新地址，在scrapy的设置中，可以通过配置来开启重定向，这样即使域名是重定向的scrapy也会自动到最终的地址获取内容。
解决方案：settings.py中添加REDIRECT_ENABLED = True

命令行参数传递

很多时候爬虫都有自定义数据，比如之前写的是硬盘关键字，现在通过参数的方式怎么传递呢？
解决方案：

重写初始化函数 def __init__(self, *args, **kwargs):
直接在函数参数添加自定义参数：
def __init__(self, dt=None, keys=None, *args, **kwargs): super(TmallAndTaoBaoSpider, self).__init__(*args, **kwargs)
dt 和 keys是自定义的参数。
命令行使用。命令行是通过-a参数来传递的，需要注意的是-a只能传递一个参数，如果需要传递多个参数，使用多次-a
scrapy crawl tts -a keys="硬盘,光驱" -a dt="20170316"
IDE中main函数使用。
if __name__ == "__main__": settings = get_project_settings() process = CrawlerProcess(settings) spider = TmallAndTaoBaoSpider process.crawl(spider, keys="硬盘,光驱", dt="20170316") process.start()

数据不全（selenium并不知道什么时候ajax请求完成），延时处理

大部分时候，我们可以取到完整的网页信息，如果网页的ajax请求太多，网速太慢的时候，selenium并不知道什么时候ajax请求完成，这个时候如果通过self.driver.get(response.url)获取页面，然后通过Selector取数据，很可能还没加载完成取不到数据。
解决方案：通过selenium提供的工具来延迟获取内容，直到获取到数据，或者超时。

    def _wait_get(self, method):
        """
        延时获取，如果10秒钟还没有获取完成，则返回失败
        :param method:
        :return:
        """
        result = None
        try:
            result = WebDriverWait(self.driver, 10).until(method)
        except:
            self.__error("超时获取：%s  %s" % (self.driver.current_url, self.driver.title))
            log.e()
        return result

这里以获取评论为例:

item['p_comment_count'] = self._wait_get(lambda dr: Selector(text=self.driver.page_source).xpath("//li/div/div[@class='tb-rate-counter']/a/strong/text()").extract_first())

在10秒以内会一直执行这个lambada函数：

lambda dr: Selector(text=self.driver.page_source).xpath("//li/div/div[@class='tb-rate-counter']/a/strong/text()").extract_first()

直到这个函数返回的不是None，或者10秒后返回超时。

robots.txt不让爬取

Scrapy爬取遵循robots协议，就是网站定义了哪些数据可以爬取，哪些不能爬取，如果网站不允许爬取，还是想爬怎么办？
解决方案：
在settings.py中忽略robots协议，添加参数:ROBOTSTXT_OBEY = False

请求数量配置

默认的数量是16，可以修改大一些，settings.py中设置:CONCURRENT_REQUESTS = 50

完整的代码：链接：https://share.weiyun.com/5FOskms 密码：wgq5pv。

** 免责声明：该内容只为传递知识，如果用做他途后果自负。**

10 comments

Max说道：

2018年6月27日上午8:35

您好，
在ShopDao class中
def select_by_shop_id(self, shop_id, key):
select_sql = SELECT.format(table=”shop”, s=”id”) + “WHERE dt = 0 and shop_id = ‘” + str(shop_id) + “‘ and search_key='” + str(key) + “‘”
print(“mysql select:\n” + select_sql)
num = self.cursor.execute(select_sql)
return num, self.cursor.fetchall()

SELECT语句“WHERE dt = 0”这里，我没有看到shop表里有这个dt的entry。请问这个方法是什么意思呢？
谢谢您！

回复
1. carlton说道：
  
  2018年6月27日下午11:55
  
  这里是通过shop_id查询数据。dt 是delete tag就是删除标签，和is_delete是一样的。这是逻辑删除判断，可以把where语句删掉，或者自己在数据库加上dt字段设置成0.
  
  回复
  1. Max说道：
    
    2018年6月28日上午10:04
    
    非常感谢！学习到很多！
    另外碰到了一个问题。
    def __parse_tmall(self, response, selecor):
    …
    src_data = selector.xpath(“//script[contains(.,’Tshop.Setup’)]”).re_first(“{\”api[\s\S]*}\n”)
    json_data = loads(src_data)
    …
    运行到这里报错：TypeError: the JSON object must be str, bytes or bytearray, not ‘NonTye’
    应该是天猫上的Tshop.Setup语句段更改了，所以这里src_data没抓到数据，导致所有的天猫的产品都没有爬下来。
    
    回复
    1. carlton说道：
      
      2018年6月28日上午10:49
      
      明白原理就好了，规则会一直变化，你用浏览器调试看看新的数据结构是什么样的，然后修改一下数据匹配规则就好了。
      
      回复
小海说道：

2018年4月16日上午10:03

代码链接失效了楼主能给我份代码么

回复
1. carlton说道：
  
  2018年4月16日下午2:25
  
  链接：https://share.weiyun.com/5FOskms 密码：wgq5pv
  
  回复
Ann说道：

2018年4月15日下午4:10

代码好像不能下载

回复
1. carlton说道：
  
  2018年4月16日下午2:25
  
  链接：https://share.weiyun.com/5FOskms 密码：wgq5pv
  
  回复
xu说道：

2017年11月16日下午2:33

你好，资源好像无法下载

回复
1. carlton说道：
  
  2017年12月18日下午4:46
  
  感谢反馈，你可以加我qq直接传给你一个。迁移网站的时候文件有丢失
  
  回复

Carlton's

进无止境，止于至善

Scrapy爬虫框架：抓取淘宝天猫数据

需求

解决思路

下载网页

获取想要的内容(Selector)

处理内容

Item使用

定义

创建

使用值

设置值

添加Pipeline过滤结果

可能会遇到的一些问题

IDE调试

302重定向的问题

命令行参数传递

数据不全（selenium并不知道什么时候ajax请求完成），延时处理

robots.txt不让爬取

请求数量配置

10 comments

发表回复取消回复

需求

解决思路

下载网页

获取想要的内容(Selector)

处理内容

Item使用

定义

创建

使用值

设置值

添加Pipeline过滤结果

可能会遇到的一些问题

IDE调试

302重定向的问题

命令行参数传递

数据不全（selenium并不知道什么时候ajax请求完成），延时处理

robots.txt不让爬取

请求数量配置

10 comments

发表回复 取消回复

发表回复取消回复