Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler自动化 #8

Open
zmy opened this issue Feb 10, 2014 · 5 comments
Open

Crawler自动化 #8

zmy opened this issue Feb 10, 2014 · 5 comments
Labels
Milestone

Comments

@zmy
Copy link
Member

zmy commented Feb 10, 2014

目前尹福的抓取仍然要每次手动刷新,并且代码写得比较dirty。目前可以:

  • 对Crawler重构,降低人手工介入程度
  • 和尹福作者吴天际联系,直接从他们服务器获取数据
@TerrorJack
Copy link

建议重写crawler:

  • 目标改为3g.renren.com,页面parsing容易做,方便抓取超大相册
  • 只保留cookies登录方式。crawler可以定时轮询,遇到需重新登录识别验证码时停止轮询并自动e-mail通知admin,admin自行登录后将cookies远程推送到服务器上并重启轮询。

@zmy
Copy link
Member Author

zmy commented Mar 29, 2014

我先去联系下尹福作者看能不能直接访问数据O_O

@zmy
Copy link
Member Author

zmy commented Mar 29, 2014

@TerrorJack 3g.renren.com 是返回的html网页?

目前的crawler除了获取相册是html网页,其它都是直接拿的json _

@TerrorJack
Copy link

抓3g.renren.com拿不到json,全部走html,但是处理起来还是比www.renren.com方便一些,只需要提交cookies即可,不用管token什么的

@cgcgbcbc
Copy link
Member

@zmy @TerrorJack
close or update this issue?

@cgcgbcbc cgcgbcbc added this to the v0.1.0 milestone May 15, 2014
@cgcgbcbc cgcgbcbc modified the milestones: v1.0.0, v0.1.0 May 22, 2014
@cgcgbcbc cgcgbcbc added the data label May 22, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants