2017年3月 – 第 2 页

月度归档： 2017年3月

Scrapy爬虫框架：安装和开始新项目

2017年3月16日 by carlton·0 Comments

Scrapy 是一套基于基于Twisted的异步处理框架，纯python实现的爬虫框架，只需要定制开发几个模块就可以轻松的实现一个爬虫。

安装

Scrapy官网和官方安装文档。
直接使用PIP安装
pip install Scrapy

注意:

Scrapy依赖这些python包:

lxml
parsel
w3lib
twisted
cryptography and pyOpenSSL

对于Scrapy最小的包版本:

Twisted 14.0
lxml 3.4
pyOpenSSL 0.14

Scrapy新项目

使用命令创建一个Scrapy新项目：
scrapy startproject 项目名称
例如创建一个名叫TaoBao的项目：scrapy startproject TaoBao

项目结构

TaoBao/
    scrapy.cfg            # Scrapy项目配置文件
    TaoBao/             # Python 项目module
        __init__.py
        items.py          # 项目的Item定义位置
        pipelines.py      # 项目的Pipeline文件
        settings.py       #项目的设置文件
        spiders/          # 蜘蛛目录
            __init__.py

编写一个蜘蛛

我们以爬取淘宝和天猫数据为例，我们通过淘宝的搜索结果然后爬取搜索出来的内容。在spiders目录下面新建一个名叫TmallAndTaoBaoSpider的蜘蛛：

class TmallAndTaoBaoSpider(Spider):
    name = &quot;tts&quot;
    allowed_domains = [&#039;tmall.com&#039;, &#039;taobao.com&#039;]
    start_urls = []
    def parse(self, response):
        pass

这里需要解释一下，每一个蜘蛛都要继承Spider，这是Scrapy提供的基础蜘蛛，Spider中有3个变量必须定义：

name – 蜘蛛的名字，等会儿通过命令行启动蜘蛛的时候用到
allowed_domains – 限定蜘蛛爬取的域，以免去爬一些我们不关心的网站内容，上面只爬淘宝和天猫的。是一个数组
start_urls – 从哪儿开始爬。是一个数组
def parse(self, response): – 这里就是蜘蛛通过下载器下载好的内容回调，通过这个方法可以取到网页内容。

这样一个简单的蜘蛛就完成了，可以通过命令:
scrapy crawl tts 来启动。当然上面的代码还什么都抓不到，因为start_urls没有填写，parse也还没有实现。

总结

简单的实现了一个蜘蛛，这个蜘蛛现在只能抓取静态网页，如果网站包含了动态内容，或者很多ajax请求，那么这样是抓不到完整数据的，下面会通过selenium来抓取动态网页，淘宝和天猫都是动态的。

ImportError: No module named ‘_sqlite3’

2017年3月8日 by carlton·0 Comments

python经常在某些环境安装的时候会出现，这个错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/python3.6/lib/python3.6/sqlite3/__init__.py", line 23, in <module>
    from sqlite3.dbapi2 import *
  File "/usr/local/python3.6/lib/python3.6/sqlite3/dbapi2.py", line 27, in <module>
    from _sqlite3 import *
ModuleNotFoundError: No module named '_sqlite3'

可能的原因是因为安装python的时候没有找到sqlite3.so的库。
简单的解决方式：

# 安装sqlite相关的全部，可能并不需要全部，为了简单直接用*号代替了
yum install sqlite*
然后重新编译python或者重新安装python

如果这种方式解决不了，建议通过编译安装sqlite3，下载地址

./configure --prefix=/usr/local/sqlite3
make && make install

然后打开python安装源码的setup.py修改如下：

sqlite_inc_paths = [ '/usr/include',
                             '/usr/local/sqlite3/include', #增加该部分内容
                             '/usr/include/sqlite',
                             '/usr/include/sqlite3',
                             '/usr/local/include',
                             '/usr/local/include/sqlite',

重新编译安装。
上面的方法我自己在centos6.5、python3.6的环境依旧没解决问题。最后我是这样解决的：

#手动安装sqlite3 完成的时候会有这么一段提示：
----------------------------------------------------------------------
Libraries have been installed in:
   /usr/local/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the '-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the 'LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the 'LD_RUN_PATH' environment variable
     during linking
   - use the '-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to '/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------

这段内容显示了sqlite3的安装路径：/usr/local/lib。
特别注意add LIBDIR to the 'LD_LIBRARY_PATH' environment variablesqlite建议添加环境变量。

export LD_LIBRARY_PATH=/usr/local/lib

设置一个环境变量，然后重新编译python3安装就能够成功了。

Carlton's

进无止境，止于至善