Crawler does not seem to work on websites that use shadowDOM #552

mantou132 · 2021-05-18T06:50:20Z

Hello Algolia Devs,

I tried to add search function to my website, but I got the reply that no content can be crawled. Is it because my website uses shadowdom?

Thank you.

shortcuts · 2021-05-18T07:25:31Z

Hi,

It indeed doesn't seem like it is possible to access to dom via query selectors with the shadow-root tag open. I don't know much about shadowDOM but it might be possible to make it work.

As long as you can query selector something from the console, our scraper will be able to get it so you will be able to use DocSearch!

mantou132 · 2021-05-18T08:55:20Z

Cannot select the content of shadowDOM through css selector or xpath.

To select shadowDOM content like a css selector, need to extend the css selector, such as using >>(outdated specification) instead of shdowDOM boundary: gem-book >> gem-book-sidebar >> gem-active-link，when using this selector, replace >> with shadowRoot, for example:

'body gem-book >> gem-book-sidebar >> gem-active-link >> a[href]'.split('>>').reduce(
  (p, c, index, arr) => {
    const isLastSelector = index === arr.length - 1;
    return p.map((e) => [...e.querySelectorAll(c)].map((ce) => (isLastSelector ? ce : ce.shadowRoot))).flat();
  },
  [document],
);

This is also an example of use in the browser, if it is selenium, there should be a similar API

mantou132 · 2021-09-01T10:20:05Z

Hi, I viewed the source code today, i found only a little update can support ShadowDOM.

https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/custom_downloader_middleware.py#L31

Can use custom downloaders to pull all DOM:

# pseudocode
driver.execute_script("return document.documentElement.getInnerHTML();")

https://web.dev/declarative-shadow-dom/

Will get result:

<head>...</head>
<body>
<gem-book>
<template shadowroot="open">
... content
</template>
</gem-book>
</body>

~~Next, we only need to delete the all <template> tag(don't delete content), may be a regular expression~~

This was referenced May 18, 2021

Crawler does not seem to work on websites that use shadowDOM algolia/docsearch#1011

Closed

Add <gbp-docsearch> mantou132/gem-book#42

Closed

mantou132 linked a pull request Sep 3, 2021 that will close this issue

Custom downloader support ShadowDOM #559

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler does not seem to work on websites that use shadowDOM #552

Crawler does not seem to work on websites that use shadowDOM #552

mantou132 commented May 18, 2021

shortcuts commented May 18, 2021

mantou132 commented May 18, 2021 •

edited

Loading

mantou132 commented Sep 1, 2021 •

edited

Loading

Crawler does not seem to work on websites that use shadowDOM #552

Crawler does not seem to work on websites that use shadowDOM #552

Comments

mantou132 commented May 18, 2021

shortcuts commented May 18, 2021

mantou132 commented May 18, 2021 • edited Loading

mantou132 commented Sep 1, 2021 • edited Loading

mantou132 commented May 18, 2021 •

edited

Loading

mantou132 commented Sep 1, 2021 •

edited

Loading