You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,I use wikiprep-esa to process a wikiprep dump in Zemanta format but i encounter with an error in scanData.py when it run with the page "Tying (commerce)" :
File "scanData.py", line 372, in
recordArticle(doc)
File "scanData.py", line 318, in recordArticle
t = html.fromstring(text)
File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init_ .py", line 723, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init
_.py", line 616, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
I check the "text" of this page carefully and find it really has unicode content. Thus,it's obvious that lxml can't parse the "text" of the page.Part of the "text" is list below:
<!-- Part of WikiProject Law. Most of this is ripped off from [[Template:Intellectual prop
- collusion is the sale of hot nuts - Eoin "balls" Devlin
Monopoly
** Coercive monopoly
** Natural monopoly
......
it contains lots of anchor texts in the former part,which is a bit different from usual texts. I can't tell if this is the cuase of the problem.Have you ever meet this kind of problems before? I'm really confused about it.
Best regards!
The text was updated successfully, but these errors were encountered:
Hi,I use wikiprep-esa to process a wikiprep dump in Zemanta format but i encounter with an error in scanData.py when it run with the page "Tying (commerce)" :
File "scanData.py", line 372, in
recordArticle(doc)
File "scanData.py", line 318, in recordArticle
t = html.fromstring(text)
File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init_
.py", line 723, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init
_.py", line 616, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
I check the "text" of this page carefully and find it really has unicode content. Thus,it's obvious that lxml can't parse the "text" of the page.Part of the "text" is list below:
** Coercive monopoly
** Natural monopoly
......
it contains lots of anchor texts in the former part,which is a bit different from usual texts. I can't tell if this is the cuase of the problem.Have you ever meet this kind of problems before? I'm really confused about it.
Best regards!
The text was updated successfully, but these errors were encountered: