Python pyspider 爬虫工具

xinxi · January 29, 2020 · Last by 冷月醉夕阳 replied at February 06, 2020 · 1728 hits

背景

一个国人编写的强大的网络爬虫系统并带有强大的WebUI。采用Python语言编写,分布式架构,支持多种数据库后端,强大的WebUI支持脚本编辑器,任务监视器,项目管理器以及结果查看器。在线示例: http://demo.pyspider.org/

安装

github

https://github.com/binux/pyspider

pycurl

pip uninstall pycurl

export PYCURL_SSL_LIBRARY=openssl

pip install pycurl

jsmin

pip install jsmin
pip uninstall jsmin

pyspider

pip install pyspider

启动命令:pyspider

报错日志:

ValueError: Invalid configuration:
- Deprecated option 'domaincontroller': use 'http_authenticator.domain_controller' instead.

解决方法:

pipenv install wsgidav==2.4.1

https://segmentfault.com/q/1010000015429020?utm_source=tag-newest

image

image

image

image

面板

image

脚本

获取淘宝的链接地址

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2020-01-28 18:08:45
# Project: testdemo1

"""
爬虫某宝链接地址
"""
from pyspider.libs.base_handler import *
from six import itervalues
import MySQLdb
import redis


class SQL():
# 数据库初始化
def __init__(self):
# 数据库连接相关信息
hosts = '192.168.1.103'
username = 'root'
password = '123321'
database = 'pyspider'
charsets = 'utf8'

self.connection = False
try:
self.conn = MySQLdb.connect(host=hosts, port=8888, user=username, passwd=password, db=database,
charset=charsets)
self.cursor = self.conn.cursor()
self.cursor.execute("set names " + charsets)
self.connection = True
except Exception as e:
print("Cannot Connect To Mysql!/n", e)

def escape(self, string):
return '%s' % string

# 插入数据到数据库
def insert(self, tablename=None, **values):

if self.connection:
tablename = self.escape(tablename)
if values:
_keys = ",".join(self.escape(k) for k in values)
_values = ",".join(['%s', ] * len(values))
sql_query = "insert into %s (%s) values (%s)" % (tablename, _keys, _values)
else:
sql_query = "replace into %s default values" % tablename
try:
if values:
self.cursor.execute(sql_query, list(itervalues(values)))
else:
self.cursor.execute(sql_query)
self.conn.commit()
return True
except Exception as e:
print("An Error Occured: ", e)
return False


class Handler(BaseHandler):
crawl_config = {
}

@every(minutes=24 * 60)
def on_start(self):
self.crawl('www.taobao.com', callback=self.index_page)

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page)

@config(priority=2)
def detail_page(self, response):
print("######### response url #########" + str(response.url))
return {
"url": response.url,
"title": response.doc('title').text(),
}

def on_result(self, result):
print("##################")
if not result or not result['url']:
return
print(result)
r = redis.Redis(host='127.0.0.1', port=6379, db=0)
r.lpush("url", result['url'])
SQL().insert('t_pyspider_project', **result)

mysql存储

image

redis存储

image

命令行命令

--config

pyspider --config config.json

全局配置

{
"taskdb": "mysql+taskdb://username:password@host:port/taskdb",
"projectdb": "mysql+projectdb://username:password@host:port/projectdb",
"resultdb": "mysql+resultdb://username:password@host:port/resultdb",
"message_queue": "amqp://username:password@host:port/%2F",
"webui": {
"username": "some_name",
"password": "some_passwd",
"need-auth": true
}
}

pyspider all

pyspider all

pyspider one

pyspider one

脚本代码

把写的脚本上传到github仓库中

https://github.com/xinxi1990/pyspiderScript.git

参考

https://zhuanlan.zhihu.com/p/39199546

https://www.jianshu.com/p/df34d9b2f248

https://www.cntofu.com/book/156/api/api5.md

共收到 3 条回复 时间 点赞

这个看着不错,你这几天在家没闲着,哈哈

rhyme 回复

天天撸码 岂不快哉

牛逼,好东西,谢楼主分享

需要 Sign In 后方可回复, 如果你还没有账号请点击这里 Sign Up