Scrape

Summary

I originally started this to throw together a skeleton app for scraping data. After decing to go a different route, I implemented NLP to parse content from sites and single page applications, it just seemed natural to want to add that data into Apache Solr and to work on an algorithm that finds content on the web quickly. This was a bit of a pet project just to see how easily it could be done and launched with docker compose.

Now, this could still be used to scrape content from sites, or to work on an algorithm to organize web data into some meaningful structure. I know Google can be used for providing search results for one's website. Possibly, if documentation is behind a paywall or it's all implemented within a SPA, it might be difficult to use Google.

Nutch is a great solution, if you want to deal with the learning curve or need to write a plugin. Here's some simple free code to toss at that problem as well.

My goal here is to provide a solution you can clone or fork and run docker compose up and it'll just work. However, given the seed files are static, you'll surely want to adjust those. If you want the crawler to stay on your site, you'll need to modify the code a bit for that.

Otherwise, note that I launch this from a Ubuntu WSL environment, but it should possibly work from Windows as well. Make sure to configure the volumes correctly in the docker-compose.yml, and be sure to set the dot.env to .env and fix up the configuration settings - as well as the dot.docker-compose.env to .docker-compose.env file.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
defaultset/conf		defaultset/conf
scrapesolr/conf		scrapesolr/conf
src		src
.eslintrc.cjs		.eslintrc.cjs
.gitignore		.gitignore
.nvmrc		.nvmrc
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
dot.docker-compose.env		dot.docker-compose.env
dot.env		dot.env
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape

Summary

About

Releases

Packages

Languages

dwoolworth/scrape

Folders and files

Latest commit

History

Repository files navigation

Scrape

Summary

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages