null byte issue #474

kijung-iM · 2024-06-17T05:38:35Z

Description
There is a problem with null byte characters being inserted in HTML pages created with Docusaurus when the language is cjk. Of course, the issue mentioned is also registered as an issue in Docusaurus.

When I scrape that page with docs-scraper, I run into the problem that it doesn't scrape anything. Logic to replace null byte characters is required.

example site:

Docs-Scraper: https://docs.whatap.io/java/agent-load-amount 0 records)
Docs-Scraper: https://docs.whatap.io/java/agent-dbsql 0 records)
Docs-Scraper: https://docs.whatap.io/java/agent-apdex 0 records)

I proceeded with the work by modifying the files as shown below. Please refer to the information below and correct it for the better.

documentation_spider.py:162

def parse_from_sitemap(self, response):
    if self.reason_to_stop is not None:
        raise CloseSpider(reason=self.reason_to_stop)
    
    # remove null byte
    response_text = response.text.replace('\u0000', '')

    if (not self.force_sitemap_urls_crawling) and (
            not self.is_rules_compliant(response)):
        print("\033[94m> Ignored from sitemap:\033[0m " + response.url)
    else:
        # self.add_records(response, from_sitemap=True)
        self.add_records(response.replace(body=response_text), from_sitemap=True)
        # We don't return self.parse(response) in order to avoid crawling those web page

def parse_from_start_url(self, response):
    if self.reason_to_stop is not None:
        raise CloseSpider(reason=self.reason_to_stop)

    # remove null byte
    response_text = response.text.replace('\u0000', '')

    if self.is_rules_compliant(response):
        self.add_records(response, from_sitemap=False)
    else:
        print("\033[94m> Ignored: from start url\033[0m " + response.url)

    # return self.parse(response)
    return self.parse(response.replace(body=response_text))

custom_downloader_middleware.py:37

body = self.driver.page_source.encode('utf-8')
# remove null byte
body = self.driver.page_source.replace('\u0000', '')
body = body.encode('utf-8')  # UTF-8 encoding
url = self.driver.current_url

default_strategy.py:37

if self._body_contains_stop_content(response):
    return []

# remove null byte
cleaned_body = response.text.replace('\u0000', '')

self.dom = self.get_dom(response.replace(body=cleaned_body.encode('utf-8')))
self.dom = self.remove_from_dom(self.dom, self.config.selectors_exclude)

records = self.get_records_from_dom(response.url)
return records

tats-u · 2024-10-01T09:58:18Z

Issue in Docusaurus: facebook/docusaurus#9985

tats-u · 2024-10-01T10:05:01Z

Possibly related to scrapy/parsel#123

curquiza added the bug Something isn't working label Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

null byte issue #474

null byte issue #474

kijung-iM commented Jun 17, 2024 •

edited

Loading

tats-u commented Oct 1, 2024

tats-u commented Oct 1, 2024 •

edited

Loading

null byte issue #474

null byte issue #474

Comments

kijung-iM commented Jun 17, 2024 • edited Loading

tats-u commented Oct 1, 2024

tats-u commented Oct 1, 2024 • edited Loading

kijung-iM commented Jun 17, 2024 •

edited

Loading

tats-u commented Oct 1, 2024 •

edited

Loading