-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select inside <noscript> #123
Comments
The parsed result is: Html {
errors: [],
quirks_mode: NoQuirks,
tree: Tree { Fragment => { Element(<html>) => { Element(<noscript>) => { Text(Tendril<UTF8>(owned: "<h1>Hello, world!</h1>")) } } } },
} It seems that the content in the script will be treated as |
Might this be an issue in how we configure html5ever? |
https://docs.rs/html5ever/0.26.0/html5ever/tree_builder/struct.TreeBuilderOpts.html#structfield.scripting_enabled, probably. The parse_* functions would need to take an extra argument though. |
use scraper::{Html, Selector};
fn main() {
// Parse html
let fragment = Html::parse_fragment("<noscript><h1>Hello, world!</h1></noscript>");
// Extract noscript inner html
let noscript_sel = Selector::parse("noscript").unwrap();
let noscripts: Vec<Html> = fragment
.select(&noscript_sel)
.filter_map(|elem| {
elem.html()
.strip_prefix("<noscript>")
.and_then(|str| str.strip_suffix("</noscript>"))
.map(String::from)
})
.map(|frag| Html::parse_fragment(&frag))
.collect();
let noscript = noscripts.get(0).expect("<noscript> tag wasn't found");
// Extract h1 from it
let h1_sel = Selector::parse("h1").unwrap();
let h1 = noscript.select(&h1_sel).next().unwrap();
// Prove that it works
assert_eq!("<h1>Hello, world!</h1>", h1.html());
} EDIT: For this solution to work, it is needed to have scripting enabled, otherwise the |
Disabling the scripting_enabled flag definitely fixes the given example (just tested), causing a tree to be built for the contents of the noscript node instead of treating it as text. Also, your example doesn't seem to work, as it escapes the html. Instead of trying to serialize the node to html and extract the inner text, you should just probably just extract the text directly. I still think it would be nice to have a config option for disabling scripting, as I'd imagine there would be some cases where parse_fragment would fail while parsing the elements in the context of the page would succeed. |
Interesting, I must have done something wrong, since it didn't work when I first tried it. Thanks for the contribution.
My example does work if you enable scripting, but if you say that by doing this, you can directly parse the content inside the |
With your example, I get:
Cargo.toml: [package]
name = "scraper-test"
version = "0.1.0"
edition = "2021"
[dependencies]
scraper = "0.20.0" Changing
The html function appears to escape the inner text. I think you meant to access the inner text directly: fn main() {
let fragment = Html::parse_document("<noscript><h1>Hello, world!</h1>");
let noscript_sel = Selector::parse("noscript").unwrap();
let noscripts: Vec<Html> = fragment
.select(&noscript_sel)
.filter_map(|elem| {
elem.text().next()
})
.map(|frag| Html::parse_fragment(&frag))
.collect();
let noscript = noscripts.get(0).expect("<noscript> tag wasn't found");
// Extract h1 from it
let h1_sel = Selector::parse("h1").unwrap();
let h1 = noscript.select(&h1_sel).next().unwrap();
// Prove that it works
assert_eq!("<h1>Hello, world!</h1>", h1.html());
} But yeah, I agree that adding a way to change the |
I want to restate that my code works only if scripting is enabled, which is not the case if you used the default version of scraper which you provided in the Cargo.toml. I agree that you code works even without this hypothesis, since it extracts raw text. I agree that enabling scripting allows to handle this case better. When I tried, it did not parse the html inside the |
That shouldn't be the case. html5ever sets this to true by default, and scraper just uses the default settings. In fact, all I had to do to get this working is disable the flag: pub fn parse_document(document: &str) -> Self {
let parser = driver::parse_document(
Self::new_document(),
driver::ParseOpts {
tree_builder: TreeBuilderOpts {
scripting_enabled: false,
..Default::default()
},
..Default::default()
},
);
parser.one(document)
}
/// Parses a string of HTML as a fragment.
pub fn parse_fragment(fragment: &str) -> Self {
let parser = driver::parse_fragment(
Self::new_fragment(),
driver::ParseOpts {
tree_builder: TreeBuilderOpts {
scripting_enabled: false,
..Default::default()
},
..Default::default()
},
QualName::new(None, ns!(html), local_name!("body")),
Vec::new(),
);
parser.one(fragment)
} Furthermore, I don't really understand how enabling scripting changes the output format of the html method.
Yeah, I guess that might be easier than adding an option to control it. Is there ever a scenario where someone would want to treat the interior of a |
Elements within a
<noscript>
tag seem to be ignored by default. Is there a workaround?Example
current output:
The text was updated successfully, but these errors were encountered: