Original Purpose of Building This Support

By leaning along web crawler(spider), most of cases need to setup many things from 0. My point of view for the build up crawler step is analysis web, scrap data, data storage. Additional function to crawler also need proxy pool, user agent pool, log...etc. Furthermore, as a crawler maintainer also required to monitor crawler, system, database, web status to perform a good quality. Therefore, i would like to centralize this kind of issue in one package to make it more easier to build up a spider.

On the other hand, as you may know famous project in crawler is Scrapy. It's a good project. So i would also build some plugin for scrapy in this package.

User Guide

DBConsole: Data Storage

1. initiate Database() or Database(config) object, if config not provided, then will use default in setting
2. config format follow INIT_DEFAULT in setting
3. To insert data, use method: Database().store_data(data)
4. data format in dict , dict in list
5. current support for mysql, sqlite3, inlufx, mongo, csv

myscrapy: function for scrapy

downloadmiddlewares

1. IPProxy
    # DOWNLOADER_MIDDLEWARES = {
    #     'Amazing_SpiderMan.myscrapy.dowmloadmiddlewares.IPProxy.IPProxyMiddleware': 125,
    # }
    # IPPROXY_ENABLED = True

2. URLBloomFilter
    # DUPEFILTER_CLASS = 'Amazing_SpiderMan.myscrapy.downloadmiddlewares.URLBloomFilter.URLBloomFilter'


3. UserAgent
    # DOWNLOADER_MIDDLEWARES = {
    #     'Amazing_SpiderMan.myscrapy.dowmloadmiddlewares.UserAgent.UserAgent': 1,
    # }

extensions

add the extensions as below if you want it to be used for every scrapy projects. As the way of import base setting is different from project extension. Project extension expected in project. if you just put below extension in project, it will pop up error. eg: 'scrapy.setting.default_settings ==> EXTENSIONS_BASE = {'Amazing_SpiderMan.myscrapy.extensions.StatsInflux.StatsInfluxdb': 300,}'

ps: another reason is i don't want to change the original import method. To avoid error occure.

1. StatsInflux

pipelines

1. mysql pipeline
    # ITEM_PIPELINES = {
    #     # 'Amazing_SpiderMan.myscrapy.pipelines.MysqlPipeline.MysqlPipeline': 300,
    # }
    # # MYSQL host
    # MYSQL_URI = 'localhost:3306'
    
    # # MYSQL database
    # MYSQL_DATABASE = 'job'
    
    # # MYSQL table
    # MYSQL_TABLE = 'jobsearch'
    
    # # MYSQL use
    # MYSQL_USER = 'jobsearch'
    
    # # MYSQL password
    # MYSQL_PASS = 'jobsearch'
    
2. mongo pipeline

scrapy_redis : Redis BloomFilter

1. Scrapy_redis + BloomFilter
    In this function, will put the queue and dupfilter in redis, controlled by schedule.
    
    # 使用scrapy_redis的调度器
    # SCHEDULER = "Amazing_SpiderMan.myscrapy.scrapy_redis.scheduler.Scheduler"
    # SCHEDULER_QUEUE_CLASS = 'Amazing_SpiderMan.myscrapy.scrapy_redis.queue.SpiderPriorityQueue'
    
    # 在redis中保持scrapy-redis用到的各个队列，从而允许暂停和暂停后恢复
    # SCHEDULER_PERSIST = True
    
    
    # REDIS_HOST = '127.0.0.1'
    # REDIS_PORT = 6379
    # # 去重队列的信息
    # FILTER_URL = None
    # FILTER_HOST = 'localhost'
    # FILTER_PORT = 6379
    # FILTER_DB = 0

Monitor

Plugin

Template

Update history

-----------------------------2018-10-31----------------------------
Add new pipeline in myscrapy
New update:

1. CSVPipeline
2. JSONPipeline

-----------------------------2018-10-29----------------------------
Update Sql insert object attribute
New update:

1. add an attribute '__table_args__ = 'to class datafram(BaseModel).
   data encoding incorrect, cause tableau couldn't import data.

-----------------------------2018-10-28----------------------------
Add myscrapy project
New update:

1. centralize all dowmloadmiddlewares/extension/pipeline/function into here if it can be used for all scrapy project
2. to use it, you can refer to remark which on top of the object file. 
   ps: if it contains more than one file, remark will be __init__ file.
   
   detail remark will be built later on

-----------------------------2018-10-27----------------------------
Build up probject 'Amazing Spider Man'
New update:

1. add influx database platform to DBConsole
2. Updategrade data insert format to multiple, eg: list, dict
   sample:
   "[{'ip':'192.168.0.2', 'port': 800, 'name': 'cb', 'area':'beijin'}, 
   {'ip':'192.168.0.2', 'port': 800, 'name': 'cb', 'area':'beijin'}]"
   
   "{'a': {'ip':'192.168.0.2', 'port': 800, 'name': 'cb', 'area':'beijin'}, 
   'b': {'ip':'192.168.0.2', 'port': 800, 'name': 'cb', 'area':'beijin'}}"
   
    original form is only dict allowed. 
3. influx can initiate database column in INIT_DEFAULT['col'] with specify 'tags'/'fields'. 
   If not, then will be identified by objself. Default number to fields, rest are 'tags'

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
DBConsole		DBConsole
DataClean		DataClean
Monitor		Monitor
Plugin		Plugin
Source24mat		Source24mat
myscrapy		myscrapy
template		template
.DS_Store		.DS_Store
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Original Purpose of Building This Support

User Guide

DBConsole: Data Storage

myscrapy: function for scrapy

downloadmiddlewares

extensions

pipelines

scrapy_redis : Redis BloomFilter

Monitor

Plugin

Template

Update history

About

Releases

Packages

Languages

cb0207/Amazing_SpiderMan

Folders and files

Latest commit

History

Repository files navigation

Original Purpose of Building This Support

User Guide

DBConsole: Data Storage

myscrapy: function for scrapy

downloadmiddlewares

extensions

pipelines

scrapy_redis : Redis BloomFilter

Monitor

Plugin

Template

Update history

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages