-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crawler_valor #85
base: master
Are you sure you want to change the base?
crawler_valor #85
Conversation
…to repeat this code in every crawler.
index = requests.get(INDEX_URL).content | ||
soup = BeautifulSoup(index, "lxml") | ||
news_index = soup.find(id="block-valor_capa_automatica-central_automatico").find_all('h2') | ||
news_urls = news_urls + ['http://www.valor.com.br' + BeautifulSoup( art.encode('utf8') , "lxml" ).find('a').attrs['href'] for art in news_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to break this line to make it more readable. PEP8 suggests 72 with a maximum of 79. I like to follow that whenever possible.
I think that, apart from the really small issues I pointed out in line comments, the code looks good and we should merge it. |
Also, another really important thing: please use 4 spaces instead of tab. I have no real problem with tabs, but mixing spaces and tabs are a bad idea, and our entire code base is using spaces. If you use vim you can use spaces by adding
To your |
from bs4 import BeautifulSoup | ||
import requests | ||
import re | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to import re
nor pandas
. It's a good idea to remove these imports.
I've created this crawler for Valor. Please check if it's ok. I try to take in count the coments on @lucasmachadorj pull request.