Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler failed to start crawling #169

Open
Amirthi opened this issue Aug 8, 2018 · 9 comments
Open

Crawler failed to start crawling #169

Amirthi opened this issue Aug 8, 2018 · 9 comments

Comments

@Amirthi
Copy link

Amirthi commented Aug 8, 2018

I'm using docker-compose on windows 10
when I run "docker-compose up " everything works fine
elasticsearch works fine, DDT tool works fine but the crawler won't work, when I use deep crawling and use a text file it loads the urls but when I click Start Crawl i get this error "Failed to start the crawler".
and the error is :

ache_focused_crawl | [2018-08-08 15:32:05,228] INFO [qtp1776560893-13] (MatcherFilter.java:153) - The requested route [/] has not been mapped in Spark for Accept: [text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8]
ache_focused_crawl | [2018-08-08 15:32:05,297] INFO [qtp1776560893-15] (MatcherFilter.java:153) - The requested route [/static/css/main.b679dd0a.css] has not been mapped in Spark for Accept: [text/css,*/*;q=0.1]
ache_focused_crawl | [2018-08-08 15:32:05,337] INFO [qtp1776560893-16] (MatcherFilter.java:153) - The requested route [/static/js/main.ce18de7a.js] has not been mapped in Spark for Accept: [*/*]
ache_focused_crawl | [2018-08-08 15:32:05,868] INFO [qtp1776560893-10] (MatcherFilter.java:153) - The requested route [/static/media/ache-logo.eb3a2cca.png] has not been mapped in Spark for Accept: [image/webp,image/apng,image/*,*/*;q=0.8]
ache_focused_crawl | [2018-08-08 15:32:07,622] INFO [qtp1776560893-13] (MatcherFilter.java:153) - The requested route [/static/media/glyphicons-halflings-regular.448c34a5.woff2] has not been mapped in Spark for Accept: [*/*]
ache_focused_crawl | [2018-08-08 15:32:13,898]ERROR [qtp1776560893-17] (CrawlerResource.java:110) - Failed to start crawler.
ache_focused_crawl | java.lang.RuntimeException: Failed to open database at /data/data_url/dir
ache_focused_crawl |    at focusedCrawler.util.persistence.rocksdb.RocksDBHashtable.<init>(RocksDBHashtable.java:45)
ache_focused_crawl |    at focusedCrawler.util.persistence.PersistentHashtable.<init>(PersistentHashtable.java:75)
ache_focused_crawl |    at focusedCrawler.link.frontier.Frontier.<init>(Frontier.java:21)
ache_focused_crawl |    at focusedCrawler.link.frontier.FrontierManagerFactory.create(FrontierManagerFactory.java:30)
ache_focused_crawl |    at focusedCrawler.link.LinkStorage.create(LinkStorage.java:177)
ache_focused_crawl |    at focusedCrawler.crawler.async.AsyncCrawler.create(AsyncCrawler.java:111)
ache_focused_crawl |    at focusedCrawler.rest.resources.CrawlerResource.lambda$new$2(CrawlerResource.java:98)
ache_focused_crawl |    at focusedCrawler.rest.Transformers.lambda$json$1(Transformers.java:61)
ache_focused_crawl |    at spark.RouteImpl$1.handle(RouteImpl.java:61)
ache_focused_crawl |    at spark.http.matching.Routes.execute(Routes.java:61)
ache_focused_crawl |    at spark.http.matching.MatcherFilter.doFilter(MatcherFilter.java:127)
ache_focused_crawl |    at spark.embeddedserver.jetty.JettyHandler.doHandle(JettyHandler.java:50)
ache_focused_crawl |    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:189)
ache_focused_crawl |    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
ache_focused_crawl |    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:119)
ache_focused_crawl |    at org.eclipse.jetty.server.Server.handle(Server.java:517)
ache_focused_crawl |    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
ache_focused_crawl |    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:242)
ache_focused_crawl |    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:261)
ache_focused_crawl |    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
ache_focused_crawl |    at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:75)
ache_focused_crawl |    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:213)
ache_focused_crawl |    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:147)
ache_focused_crawl |    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
ache_focused_crawl |    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
ache_focused_crawl |    at java.lang.Thread.run(Thread.java:748)
ache_focused_crawl | Caused by: org.rocksdb.RocksDBException: IO error: directory: Invalid argument
ache_focused_crawl |    at org.rocksdb.RocksDB.open(Native Method)
ache_focused_crawl |    at org.rocksdb.RocksDB.open(RocksDB.java:184)
ache_focused_crawl |    at focusedCrawler.util.persistence.rocksdb.RocksDBHashtable.<init>(RocksDBHashtable.java:43)
ache_focused_crawl |    ... 25 common frames omitted
@Amirthi
Copy link
Author

Amirthi commented Aug 9, 2018

I tried to run crawler from your docker file with this :
docker run -v e:\dev\crawler\ddt:/config -v e:\dev\crawler\ddt\data:/data -p 8080:8080 vidanyu/ache startCrawl -c /config/ -s /config/seeds.txt -o /data/

and again I get this error:


ACHE Crawler 0.12.0-SNAPSHOT

[2018-08-09 08:31:03,925]ERROR [main] (Main.java:260) - Crawler execution failed: Failed to open database at /data/default/data_url/dir

java.lang.RuntimeException: Failed to open database at /data/default/data_url/dir
at focusedCrawler.util.persistence.rocksdb.AbstractRocksDbHashtable.(AbstractRocksDbHashtable.java:35)
at focusedCrawler.util.persistence.rocksdb.StringObjectHashtable.(StringObjectHashtable.java:15)
at focusedCrawler.util.persistence.PersistentHashtable.(PersistentHashtable.java:49)
at focusedCrawler.link.frontier.Frontier.(Frontier.java:21)
at focusedCrawler.link.frontier.FrontierManagerFactory.create(FrontierManagerFactory.java:31)
at focusedCrawler.link.LinkStorage.create(LinkStorage.java:177)
at focusedCrawler.crawler.async.AsyncCrawler.create(AsyncCrawler.java:114)
at focusedCrawler.crawler.CrawlersManager.createCrawler(CrawlersManager.java:104)
at focusedCrawler.Main$StartCrawl.run(Main.java:246)
at focusedCrawler.Main.main(Main.java:59)
Caused by: org.rocksdb.RocksDBException: While fsync: a directory: Invalid argument
at org.rocksdb.RocksDB.open(Native Method)
at org.rocksdb.RocksDB.open(RocksDB.java:231)
at focusedCrawler.util.persistence.rocksdb.AbstractRocksDbHashtable.(AbstractRocksDbHashtable.java:33)
... 9 common frames omitted

@Amirthi
Copy link
Author

Amirthi commented Aug 11, 2018

After 2 days i tried some workaround
docker is working fine, container is running but the error i get from crawler is that it cant access database in the container itself
root@58e8c682cdf3:/data/data_url/dir# ls
000036.log LOG LOG.old.1533800634667721 LOG.old.1533802779848934 MANIFEST-000035
CURRENT LOG.old.1533796928836627 LOG.old.1533800720793448 LOG.old.1533802912574826
IDENTITY LOG.old.1533797098378344 LOG.old.1533801326613958 LOG.old.1533964710469742
LOCK LOG.old.1533797216544528 LOG.old.1533802676289078 LOG.old.1533965187477155

its the files inside the ache container and i dont know why it cant access the database
the error is :
[2018-08-11 05:26:27,606]ERROR [qtp1356930131-12] (CrawlerResource.java:110) - Failed to start crawler.
ache_deep_crawl | java.lang.RuntimeException: Failed to open database at /data/data_url/dir

@aecio aecio added the windows label Aug 28, 2018
@aecio
Copy link
Member

aecio commented Aug 28, 2018

Sorry for the long delay to respond. I have just seen other people running into this same problem on Windows. This is related with the underlying RocksDB database engine that the crawler uses. Apparently, it doesn't work with Windows filesystems mounted with Docker.
I'm not sure if there is anything we can do fix this right now.

@aecio aecio added the bug label Aug 28, 2018
@suvarnajadhav
Copy link

Hello,
i tried docker with elastic search on windows 10 , but error occurred like --- Failed to start crawler.
ache_1 | java.lang.RuntimeException: Failed to check whether index already exists in Elasticsearch.
on Ubuntu its working fine.

ache_1 | [2019-06-25 08:54:37,282]ERROR [qtp1550068122-12] (CrawlerResource.java:85) - Failed to start crawler.
ache_1 | java.lang.RuntimeException: Failed to check whether index already exists in Elasticsearch.
ache_1 | at focusedCrawler.target.repository.ElasticSearchRestTargetRepository.createIndexMapping(ElasticSearchRestTargetRepository.java:65)
ache_1 | at focusedCrawler.target.repository.ElasticSearchRestTargetRepository.(ElasticSearchRestTargetRepository.java:54)
ache_1 | at focusedCrawler.target.TargetRepositoryFactory.createRepository(TargetRepositoryFactory.java:87)
ache_1 | at focusedCrawler.target.TargetRepositoryFactory.create(TargetRepositoryFactory.java:34)
ache_1 | at focusedCrawler.target.TargetStorage.create(TargetStorage.java:131)
ache_1 | at focusedCrawler.crawler.async.AsyncCrawler.create(AsyncCrawler.java:117)
ache_1 | at focusedCrawler.crawler.CrawlersManager.createCrawler(CrawlersManager.java:104)
ache_1 | at focusedCrawler.crawler.CrawlersManager.createCrawler(CrawlersManager.java:89)

can you please help me.

@aecio
Copy link
Member

aecio commented Jun 26, 2019 via email

@suvarnajadhav
Copy link

Thank you for reply.
On windows installed elasticsearch-7.1.1 msi and in ache .yml connect with http://127.0.0.1:9200.

@suvarnajadhav
Copy link

On window if i give Ubuntu elastic search instance its working fine, but problem occurring on windows elastic search instance . can you please tell me, what could be issue.

@aecio
Copy link
Member

aecio commented Jun 30, 2019

We usually don't test or support running the crawler on Windows. Also, without any more detailed errors logs, it is hard to know what is happening.

@suvarnajadhav
Copy link

Crawler is supported on windows but problem occurred when connecting windows elastic search instance, currently connected with Ubuntu elastic instance its working fine on window, but i want elastic search on windows.
Error log given like - Failed to check whether index already exists in Elasticsearch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants