You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Problem:
The PathFinder/PathTracker components responsible for building the "path" navigation across web links from page to page starting from the "root site URL" (rootPath) have two issues:
Redundant bath entires are formed sometimes (which causes over-consumption of memory and CPU cycles)
Empty path enties are formed sometimes (which causes exceptions like the following):
Fri Jun 05 13:47:30 UTC 2020:Site crawling failed unknown https://blog.wechat.com/category/news/ java.lang.ArrayIndexOutOfBoundsException: 0,:0
java.lang.ArrayIndexOutOfBoundsException: 0
at net.webstructor.al.Set.get(Set.java:35)
at net.webstructor.self.PathTracker.run(PathTracker.java:136)
at net.webstructor.self.PathTracker.run(PathTracker.java:110)
at net.webstructor.self.PathTracker.run(PathTracker.java:96)
at net.webstructor.self.PathTracker.run(PathTracker.java:58)
at net.webstructor.self.WebCrawler.crawl(WebCrawler.java:66)
at net.webstructor.self.Siter.read(Siter.java:171)
at net.webstructor.self.Spider$1.call(Spider.java:191)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
We need to solve both.
Extra:
In addition to that, for each of the "sites" configured for crawling, we may have the option "crawl mode" (SMART|FIND|TRACK) set other than default "SMART" so the "path" can not be modified and always re-used as configured manually ("TRACK" mode) or never used so the exhaustive crawl applies every time ("FIND" mode).
The text was updated successfully, but these errors were encountered:
The Problem:
The PathFinder/PathTracker components responsible for building the "path" navigation across web links from page to page starting from the "root site URL" (rootPath) have two issues:
We need to solve both.
Extra:
In addition to that, for each of the "sites" configured for crawling, we may have the option "crawl mode" (SMART|FIND|TRACK) set other than default "SMART" so the "path" can not be modified and always re-used as configured manually ("TRACK" mode) or never used so the exhaustive crawl applies every time ("FIND" mode).
The text was updated successfully, but these errors were encountered: