Crawler failed to start crawling #169

Amirthi · 2018-08-08T15:21:14Z

I'm using docker-compose on windows 10
when I run "docker-compose up " everything works fine
elasticsearch works fine, DDT tool works fine but the crawler won't work, when I use deep crawling and use a text file it loads the urls but when I click Start Crawl i get this error "Failed to start the crawler".
and the error is :

ache_focused_crawl | [2018-08-08 15:32:05,228] INFO [qtp1776560893-13] (MatcherFilter.java:153) - The requested route [/] has not been mapped in Spark for Accept: [text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8]
ache_focused_crawl | [2018-08-08 15:32:05,297] INFO [qtp1776560893-15] (MatcherFilter.java:153) - The requested route [/static/css/main.b679dd0a.css] has not been mapped in Spark for Accept: [text/css,*/*;q=0.1]
ache_focused_crawl | [2018-08-08 15:32:05,337] INFO [qtp1776560893-16] (MatcherFilter.java:153) - The requested route [/static/js/main.ce18de7a.js] has not been mapped in Spark for Accept: [*/*]
ache_focused_crawl | [2018-08-08 15:32:05,868] INFO [qtp1776560893-10] (MatcherFilter.java:153) - The requested route [/static/media/ache-logo.eb3a2cca.png] has not been mapped in Spark for Accept: [image/webp,image/apng,image/*,*/*;q=0.8]
ache_focused_crawl | [2018-08-08 15:32:07,622] INFO [qtp1776560893-13] (MatcherFilter.java:153) - The requested route [/static/media/glyphicons-halflings-regular.448c34a5.woff2] has not been mapped in Spark for Accept: [*/*]
ache_focused_crawl | [2018-08-08 15:32:13,898]ERROR [qtp1776560893-17] (CrawlerResource.java:110) - Failed to start crawler.
ache_focused_crawl | java.lang.RuntimeException: Failed to open database at /data/data_url/dir
ache_focused_crawl |    at focusedCrawler.util.persistence.rocksdb.RocksDBHashtable.<init>(RocksDBHashtable.java:45)
ache_focused_crawl |    at focusedCrawler.util.persistence.PersistentHashtable.<init>(PersistentHashtable.java:75)
ache_focused_crawl |    at focusedCrawler.link.frontier.Frontier.<init>(Frontier.java:21)
ache_focused_crawl |    at focusedCrawler.link.frontier.FrontierManagerFactory.create(FrontierManagerFactory.java:30)
ache_focused_crawl |    at focusedCrawler.link.LinkStorage.create(LinkStorage.java:177)
ache_focused_crawl |    at focusedCrawler.crawler.async.AsyncCrawler.create(AsyncCrawler.java:111)
ache_focused_crawl |    at focusedCrawler.rest.resources.CrawlerResource.lambda$new$2(CrawlerResource.java:98)
ache_focused_crawl |    at focusedCrawler.rest.Transformers.lambda$json$1(Transformers.java:61)
ache_focused_crawl |    at spark.RouteImpl$1.handle(RouteImpl.java:61)
ache_focused_crawl |    at spark.http.matching.Routes.execute(Routes.java:61)
ache_focused_crawl |    at spark.http.matching.MatcherFilter.doFilter(MatcherFilter.java:127)
ache_focused_crawl |    at spark.embeddedserver.jetty.JettyHandler.doHandle(JettyHandler.java:50)
ache_focused_crawl |    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:189)
ache_focused_crawl |    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
ache_focused_crawl |    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:119)
ache_focused_crawl |    at org.eclipse.jetty.server.Server.handle(Server.java:517)
ache_focused_crawl |    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
ache_focused_crawl |    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:242)
ache_focused_crawl |    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:261)
ache_focused_crawl |    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
ache_focused_crawl |    at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:75)
ache_focused_crawl |    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:213)
ache_focused_crawl |    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:147)
ache_focused_crawl |    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
ache_focused_crawl |    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
ache_focused_crawl |    at java.lang.Thread.run(Thread.java:748)
ache_focused_crawl | Caused by: org.rocksdb.RocksDBException: IO error: directory: Invalid argument
ache_focused_crawl |    at org.rocksdb.RocksDB.open(Native Method)
ache_focused_crawl |    at org.rocksdb.RocksDB.open(RocksDB.java:184)
ache_focused_crawl |    at focusedCrawler.util.persistence.rocksdb.RocksDBHashtable.<init>(RocksDBHashtable.java:43)
ache_focused_crawl |    ... 25 common frames omitted

The text was updated successfully, but these errors were encountered:

Amirthi · 2018-08-09T08:33:04Z

I tried to run crawler from your docker file with this :
docker run -v e:\dev\crawler\ddt:/config -v e:\dev\crawler\ddt\data:/data -p 8080:8080 vidanyu/ache startCrawl -c /config/ -s /config/seeds.txt -o /data/

and again I get this error:

ACHE Crawler 0.12.0-SNAPSHOT

[2018-08-09 08:31:03,925]ERROR [main] (Main.java:260) - Crawler execution failed: Failed to open database at /data/default/data_url/dir

java.lang.RuntimeException: Failed to open database at /data/default/data_url/dir
at focusedCrawler.util.persistence.rocksdb.AbstractRocksDbHashtable.(AbstractRocksDbHashtable.java:35)
at focusedCrawler.util.persistence.rocksdb.StringObjectHashtable.(StringObjectHashtable.java:15)
at focusedCrawler.util.persistence.PersistentHashtable.(PersistentHashtable.java:49)
at focusedCrawler.link.frontier.Frontier.(Frontier.java:21)
at focusedCrawler.link.frontier.FrontierManagerFactory.create(FrontierManagerFactory.java:31)
at focusedCrawler.link.LinkStorage.create(LinkStorage.java:177)
at focusedCrawler.crawler.async.AsyncCrawler.create(AsyncCrawler.java:114)
at focusedCrawler.crawler.CrawlersManager.createCrawler(CrawlersManager.java:104)
at focusedCrawler.Main$StartCrawl.run(Main.java:246)
at focusedCrawler.Main.main(Main.java:59)
Caused by: org.rocksdb.RocksDBException: While fsync: a directory: Invalid argument
at org.rocksdb.RocksDB.open(Native Method)
at org.rocksdb.RocksDB.open(RocksDB.java:231)
at focusedCrawler.util.persistence.rocksdb.AbstractRocksDbHashtable.(AbstractRocksDbHashtable.java:33)
... 9 common frames omitted

Amirthi · 2018-08-11T05:34:28Z

After 2 days i tried some workaround
docker is working fine, container is running but the error i get from crawler is that it cant access database in the container itself
root@58e8c682cdf3:/data/data_url/dir# ls
000036.log LOG LOG.old.1533800634667721 LOG.old.1533802779848934 MANIFEST-000035
CURRENT LOG.old.1533796928836627 LOG.old.1533800720793448 LOG.old.1533802912574826
IDENTITY LOG.old.1533797098378344 LOG.old.1533801326613958 LOG.old.1533964710469742
LOCK LOG.old.1533797216544528 LOG.old.1533802676289078 LOG.old.1533965187477155

its the files inside the ache container and i dont know why it cant access the database
the error is :
[2018-08-11 05:26:27,606]ERROR [qtp1356930131-12] (CrawlerResource.java:110) - Failed to start crawler.
ache_deep_crawl | java.lang.RuntimeException: Failed to open database at /data/data_url/dir

aecio · 2018-08-28T16:01:40Z

Sorry for the long delay to respond. I have just seen other people running into this same problem on Windows. This is related with the underlying RocksDB database engine that the crawler uses. Apparently, it doesn't work with Windows filesystems mounted with Docker.
I'm not sure if there is anything we can do fix this right now.

suvarnajadhav · 2019-06-26T05:32:36Z

Hello,
i tried docker with elastic search on windows 10 , but error occurred like --- Failed to start crawler.
ache_1 | java.lang.RuntimeException: Failed to check whether index already exists in Elasticsearch.
on Ubuntu its working fine.

ache_1 | [2019-06-25 08:54:37,282]ERROR [qtp1550068122-12] (CrawlerResource.java:85) - Failed to start crawler.
ache_1 | java.lang.RuntimeException: Failed to check whether index already exists in Elasticsearch.
ache_1 | at focusedCrawler.target.repository.ElasticSearchRestTargetRepository.createIndexMapping(ElasticSearchRestTargetRepository.java:65)
ache_1 | at focusedCrawler.target.repository.ElasticSearchRestTargetRepository.(ElasticSearchRestTargetRepository.java:54)
ache_1 | at focusedCrawler.target.TargetRepositoryFactory.createRepository(TargetRepositoryFactory.java:87)
ache_1 | at focusedCrawler.target.TargetRepositoryFactory.create(TargetRepositoryFactory.java:34)
ache_1 | at focusedCrawler.target.TargetStorage.create(TargetStorage.java:131)
ache_1 | at focusedCrawler.crawler.async.AsyncCrawler.create(AsyncCrawler.java:117)
ache_1 | at focusedCrawler.crawler.CrawlersManager.createCrawler(CrawlersManager.java:104)
ache_1 | at focusedCrawler.crawler.CrawlersManager.createCrawler(CrawlersManager.java:89)

can you please help me.

aecio · 2019-06-26T22:49:08Z

Seems like ACHE is not able to connect with Elasticsearch to check if index already exists. So the problem might be with your Elasticsearch instance.

suvarnajadhav · 2019-06-27T10:03:24Z

Thank you for reply.
On windows installed elasticsearch-7.1.1 msi and in ache .yml connect with http://127.0.0.1:9200.

suvarnajadhav · 2019-06-27T10:06:49Z

On window if i give Ubuntu elastic search instance its working fine, but problem occurring on windows elastic search instance . can you please tell me, what could be issue.

aecio · 2019-06-30T08:42:48Z

We usually don't test or support running the crawler on Windows. Also, without any more detailed errors logs, it is hard to know what is happening.

suvarnajadhav · 2019-07-01T06:03:08Z

Crawler is supported on windows but problem occurred when connecting windows elastic search instance, currently connected with Ubuntu elastic instance its working fine on window, but i want elastic search on windows.
Error log given like - Failed to check whether index already exists in Elasticsearch.

aecio added the windows label Aug 28, 2018

aecio added the bug label Aug 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler failed to start crawling #169

Crawler failed to start crawling #169

Amirthi commented Aug 8, 2018 •

edited by aecio

Loading

Amirthi commented Aug 9, 2018 •

edited

Loading

Amirthi commented Aug 11, 2018

aecio commented Aug 28, 2018

suvarnajadhav commented Jun 26, 2019

aecio commented Jun 26, 2019 via email

suvarnajadhav commented Jun 27, 2019

suvarnajadhav commented Jun 27, 2019

aecio commented Jun 30, 2019

suvarnajadhav commented Jul 1, 2019

Crawler failed to start crawling #169

Crawler failed to start crawling #169

Comments

Amirthi commented Aug 8, 2018 • edited by aecio Loading

Amirthi commented Aug 9, 2018 • edited Loading

ACHE Crawler 0.12.0-SNAPSHOT

Amirthi commented Aug 11, 2018

aecio commented Aug 28, 2018

suvarnajadhav commented Jun 26, 2019

aecio commented Jun 26, 2019 via email

suvarnajadhav commented Jun 27, 2019

suvarnajadhav commented Jun 27, 2019

aecio commented Jun 30, 2019

suvarnajadhav commented Jul 1, 2019

Amirthi commented Aug 8, 2018 •

edited by aecio

Loading

Amirthi commented Aug 9, 2018 •

edited

Loading