HTML parse error #16

ciciaip · 2014-10-21T03:37:52Z

Hi,I use wikiprep-esa to process a wikiprep dump in Zemanta format but i encounter with an error in scanData.py when it run with the page "Tying (commerce)" :
File "scanData.py", line 372, in
recordArticle(doc)
File "scanData.py", line 318, in recordArticle
t = html.fromstring(text)
File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init_
.py", line 723, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init
_.py", line 616, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty

I check the "text" of this page carefully and find it really has unicode content. Thus,it's obvious that lxml can't parse the "text" of the page.Part of the "text" is list below:

<!-- Part of WikiProject Law. Most of this is ripped off from [[Template:Intellectual prop

- collusion is the sale of hot nuts - Eoin "balls" Devlin
! style="padding: 0 7px 0 7px; background:#00FA9A" align="center"
-
! style=" font-size: 95%; padding: 0 7px 0 7px; background:#98FB98" align="center"
-
style=" font-size: 90%; padding: 0 5px 0 5px; text-align: left;"

History of competition law
Monopoly
** Coercive monopoly
** Natural monopoly
......
it contains lots of anchor texts in the former part,which is a bit different from usual texts. I can't tell if this is the cuase of the problem.Have you ever meet this kind of problems before? I'm really confused about it.
Best regards!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML parse error #16

HTML parse error #16

ciciaip commented Oct 21, 2014

HTML parse error #16

HTML parse error #16

Comments

ciciaip commented Oct 21, 2014