-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[web scraper] verified source #262
Comments
I did a small research on this topic and I think we can use
|
@sultaniman good points. Could you please do a proof of concept of this? |
I will do it once I check out this one dlt-hub/dlt#811 |
So I created a draft prototype where
flowchart LR
queue[[queue]]
pipeline[[dlt pipeline]]
exit{{scraping done}}
save([exit & save data])
nodata{scraping done?}
spider-- push results -->queue
spider-- no more data -->exit
queue-->pipeline
pipeline-->nodata
nodata-- NO -->queue
nodata-- DONE -->save
exit-. no data .->queue
Pipeline and scaffoldingfrom queue import Queue
import threading
import dlt
from scrapy.crawler import CrawlerProcess
from quotes_spider import QuotesSpider
result_queue = Queue(maxsize=1000)
class SpiderResultHandler(threading.Thread):
def __init__(self, queue: Queue):
super().__init__(daemon=True)
self.result_queue = queue
def run(self):
@dlt.resource(name="quotes")
def get_results():
# keep pulling items from queue
# until we get "done" in message
while True:
result = self.result_queue.get()
if "done" in result:
break
yield result
pipeline = dlt.pipeline(
pipeline_name="issue_262",
destination="postgres",
)
load_info = pipeline.run(
get_results,
table_name="fam_quotes",
write_disposition="replace",
)
print(load_info)
process = CrawlerProcess()
process.crawl(QuotesSpider, queue=result_queue)
handler = SpiderResultHandler(queue=result_queue)
handler.start()
process.start()
handler.join() Spider sourcefrom queue import Queue
from typing import Any
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape.com/page/1/",
]
custom_settings = {"LOG_LEVEL": "INFO"}
def __init__(
self,
name: str | None = None,
queue: Queue | None = None,
**kwargs: Any,
):
super().__init__(name, **kwargs)
self.queue = queue
def parse(self, response):
for quote in response.css("div.quote"):
data = {
"headers": dict(response.headers.to_unicode_dict()),
"quote": {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
},
}
# here we push result to queue
self.queue.put(data)
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
else:
# finally if there are no more results then send "done"
self.queue.put({"done": True}) |
@sultaniman thanks for the POC, this looks great. Please go ahead and make a verified source from this POC. As you mentioned, you'd need to devise a nice way to wrap it into a source definition that hides some complexities while giving enough ways to configure the source. Please take a look at other verified source in this repo for the inspiration. Please submit a draft PR and we'll iterate on the source interface. |
Quick source info
Current Status
What source does/will do
The idea is to base the source on scrapy. In theory, scrapy can be used with dlt directly because you can get the scraped data as a generator, on the other hand it is typically wrapped in opaque process from where there's no way to get data. scrapy has its own framework so we can fit
dlt
intoscrapy
(ie as export option)We must investigate if to use scrapy or switch to beautiful soup and write own spider.
The requirements [for scrapy]
dlt.resource
that when provided a scrapySpider
will yield all the data that spider yieldsTest account / test data
Looks like we'll have plenty of websites to test against
Additional context
Please provide one demo where we scrap PDFs and parse them in transformer as in
The text was updated successfully, but these errors were encountered: