Web paths formation improvements #23

akolonin · 2020-06-06T12:38:01Z

The Problem:
The PathFinder/PathTracker components responsible for building the "path" navigation across web links from page to page starting from the "root site URL" (rootPath) have two issues:

Redundant bath entires are formed sometimes (which causes over-consumption of memory and CPU cycles)
Empty path enties are formed sometimes (which causes exceptions like the following):

Fri Jun 05 13:47:30 UTC 2020:Site crawling failed unknown https://blog.wechat.com/category/news/ java.lang.ArrayIndexOutOfBoundsException: 0,:0
java.lang.ArrayIndexOutOfBoundsException: 0
        at net.webstructor.al.Set.get(Set.java:35)
        at net.webstructor.self.PathTracker.run(PathTracker.java:136)
        at net.webstructor.self.PathTracker.run(PathTracker.java:110)
        at net.webstructor.self.PathTracker.run(PathTracker.java:96)
        at net.webstructor.self.PathTracker.run(PathTracker.java:58)
        at net.webstructor.self.WebCrawler.crawl(WebCrawler.java:66)
        at net.webstructor.self.Siter.read(Siter.java:171)
        at net.webstructor.self.Spider$1.call(Spider.java:191)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

We need to solve both.

Extra:
In addition to that, for each of the "sites" configured for crawling, we may have the option "crawl mode" (SMART|FIND|TRACK) set other than default "SMART" so the "path" can not be modified and always re-used as configured manually ("TRACK" mode) or never used so the exhaustive crawl applies every time ("FIND" mode).

The text was updated successfully, but these errors were encountered:

…debugging

akolonin · 2020-06-08T03:54:59Z

1 & 2 assumed fixed, keep testing...

akolonin self-assigned this Jun 6, 2020

akolonin added bug Something isn't working enhancement New feature or request progress In progress and removed progress In progress labels Jun 6, 2020

akolonin changed the title ~~Web paths formation improvments~~ Web paths formation improvements Jun 6, 2020

akolonin added a commit that referenced this issue Jun 8, 2020

2.8.3 Fix crawl path formation #23; post-Siter-refactoring fixes and …

f595a5d

…debugging

akolonin added testing and removed progress In progress labels Jun 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web paths formation improvements #23

Web paths formation improvements #23

akolonin commented Jun 6, 2020 •

edited

Loading

akolonin commented Jun 8, 2020

Web paths formation improvements #23

Web paths formation improvements #23

Comments

akolonin commented Jun 6, 2020 • edited Loading

akolonin commented Jun 8, 2020

akolonin commented Jun 6, 2020 •

edited

Loading