Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML parse error #16

Open
ciciaip opened this issue Oct 21, 2014 · 0 comments
Open

HTML parse error #16

ciciaip opened this issue Oct 21, 2014 · 0 comments

Comments

@ciciaip
Copy link

ciciaip commented Oct 21, 2014

Hi,I use wikiprep-esa to process a wikiprep dump in Zemanta format but i encounter with an error in scanData.py when it run with the page "Tying (commerce)" :
File "scanData.py", line 372, in
recordArticle(doc)
File "scanData.py", line 318, in recordArticle
t = html.fromstring(text)
File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init_
.py", line 723, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init

_.py", line 616, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty

I check the "text" of this page carefully and find it really has unicode content. Thus,it's obvious that lxml can't parse the "text" of the page.Part of the "text" is list below:

<!-- Part of WikiProject Law. Most of this is ripped off from [[Template:Intellectual prop
- collusion is the sale of hot nuts - Eoin "balls" Devlin
! style="padding: 0 7px 0 7px; background:#00FA9A" align="center"
-
! style=" font-size: 95%; padding: 0 7px 0 7px; background:#98FB98" align="center"
-
style=" font-size: 90%; padding: 0 5px 0 5px; text-align: left;"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant