Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler.
Execute the ./scripts/docker/start.sh
and wait for all containers to start.
You can start the crawler in detached mode by passing --detach to start.sh
Ensure that image dperson/torproxy:latest
is used in docker-compose.yml
in deployments/docker.
# torproxy:
# image: torproxy:Dockerfile
# logging:
# driver: none
torproxy:
image: dperson/torproxy:latest
logging:
driver: none
cd build/tor-proxy/
. Then edit the torrc file to add tor bridges.
Tor bridges configurations can be found in tor-browser_en-US/Browser/TorBrowser/Data/Tor/torrc
.
Execute docker build -t "torproxy:Dockerfile" .
to build the image locally.
Then modify docker-compose.yml
in deployments/docker
.
# replace dperson/torproxy with torproxy built locally from niruix/tor
torproxy:
image: torproxy:Dockerfile
logging:
driver: none
# torproxy:
# image: dperson/torproxy:latest
# logging:
# driver: none
./scripts/docker/start.sh
- You can start the crawler in detached mode by passing --detach to start.sh.
- Ensure you have at least 3 GB of memory as the Elasticsearch stack docker will require 2 GB.
Modify the docker-compose.yml
file. Replace the named volume with path to the folder.
elasticsearch:
image: elasticsearch:7.5.1
logging:
driver: none
environment:
- discovery.type=single-node
- ES_JAVA_OPTS=-Xms2g -Xmx2g
volumes:
- /mnt/NAStor-universe/esdata:/usr/share/elasticsearch/data
One can use the RabbitMQ dashhboard available at RabbitMQ, and publish a new JSON object in the crawlingQueue.
The object should look like this:
{"url": "http://torlinkbgs6aabns.onion/"}
Multiple URLs can be published automatically using rabbitmqadmin.
Go to http://{hostname}:15672/cli/rabbitmqadmin
to download rabbitmqadmin
.
Then sudo chmod +x rabbitmqadmin
, sudo cp rabbitmqadmin /usr/local/bin
.
Finally run ./publish.sh
to publish seed URLs.
If one want to speed up the crawling, he can scale the instance of crawling component in order to increase performance.
This may be done by issuing the following command after the crawler is started:
./scripts/docker/start.sh --scale crawler=10 --scale indexer-es=2 --scale scheduler=4
You can use the Kibana dashboard.
You will need to create an index pattern named 'resources', and when it asks for the time field, choose 'time'.
docker exec -it <docker container name> bash
docker container kill $(docker ps -q)
Install elasticdump
elasticdump --input=http://[elasticsearch-url]:9200/resources --output=[file_path]/universe.json --limit 500 --concurrency 20 --concurrencyInterval 1 --type=data --max-old-space-size=16384
elasticdump --input=http://172.18.0.3:9200/resources --output=/home/justin/Public/universe_data/universe-mar-26.json --limit 500 -concurrency 20 --concurrencyInterval 1 --type=data --max-old-space-size=16384
If you've made a change to one of the crawler component and wish to use the updated version when running start.sh you just need to issue the following command:
goreleaser --snapshot --skip-publish --rm-dist
This will rebuild all images using local changes. After that just run start.sh
again to have the updated version
running.
Example:
PUT _settings
{
"index": {
"blocks": {
"read_only_allow_delete": "false"
}
}
}
Run universe-mining.ipynb
for general analysis and classification.ipynb
for domain classification.
conda install -c anaconda py-xgboost
First download the labelled darknet addresses provided in DUTA_10K.xls
by GVIS.
cd page-downloader/
python3 downloader.py
The downloaded webpages are in data/universe-labelled
POST http://172.23.0.3:9200/v1/resources/_delete_by_query { "query": { "match": { "url":"http://torlinkbgs6aabns.onion" } } }
POST /resources/_delete_by_query { "query": { "match": { "url":"http://torlinkbgs6aabns.onion" } } }
All classifiers are in the classification
folder.