Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display. #184

Open
deahhh opened this issue Oct 10, 2023 · 2 comments

Comments

@deahhh
Copy link

deahhh commented Oct 10, 2023

using EzXML

doc = EzXML.parsehtml("<body><p>hello</p><p>中国</p><p>深圳</p></body>")

primates = root(doc)

for p in eachelement(primates)
    println(nodecontent(p))
end
julia draft.jl

Out put:
hello中åæ·±å³

@deahhh
Copy link
Author

deahhh commented Oct 10, 2023

the bug will be fixed roughly by replacing "encoding" with "utf-8" in julia.

function parsehtml(htmlstring::AbstractString)
    if isempty(htmlstring)
        throw(ArgumentError("empty HTML string"))
    end
    url = C_NULL
    encoding = C_NULL
    options = 1
    doc_ptr = @check ccall(
        (:htmlReadMemory, libxml2),
        Ptr{_Node},
        (Cstring, Cint, Cstring, Cstring, Cint),
        htmlstring, sizeof(htmlstring), url, "utf-8", options) != C_NULL
    show_warnings()
    return Document(doc_ptr)
end

@noxthot
Copy link

noxthot commented Oct 31, 2023

We just had the same problem using Genie.jl and boiled down the problem root to the new version of XML2_jll.jl v2.11.5. Pinning that package to the previously released version v2.10.4 makes the problem disappear:

Note that versions 2.11.0 to 2.11.4 were not provided by XML2_jll.jl, so these can not be immediately tested.

Then:

julia> using EzXML

julia> doc = EzXML.parsehtml("<body><p>hello</p><p>中国</p><p>深圳</p></body>")
EzXML.Document(EzXML.Node(<HTML_DOCUMENT_NODE@0x0000000001afee70>))

julia> primates = root(doc)
EzXML.Node(<ELEMENT_NODE[html]@0x0000000001c9f680>)

julia> for p in eachelement(primates)
           println(nodecontent(p))
       end
hello中国深圳

Of course this is also a problem when using umlauts.

Not sure whether this is already (or should be) in scope by of https://gitlab.gnome.org/GNOME/libxml2/-/issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants