Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully rewrite the HTML parser #80

Open
9 of 13 tasks
aecio opened this issue May 23, 2017 · 0 comments
Open
9 of 13 tasks

Fully rewrite the HTML parser #80

aecio opened this issue May 23, 2017 · 0 comments
Milestone

Comments

@aecio
Copy link
Member

aecio commented May 23, 2017

Current implementation is messy, very hard to maintain, and make changes. New implementation should be compatible with current one and add new features:

  • Should normalize relative links
  • Should validate links and discard invalid ones
  • Should extract deep web .onion links
  • Should extract anchor text
  • Should extract text around links
  • Should extract meta-tags (description, keywords, etc)
  • Should decode HTML entities to regular characters (turn & into &) from links
  • Should decode HTML entities to regular characters (turn & into &) from text
  • Should remove the fragment portion of the URL (anything after the character #)
  • Should do basic link normalization (lowercase domain, reorder query parameters, etc)
  • NEW: Extract links to images and regular links separately
  • NEW: Allow for easy extensions such as extraction of meta tags such as og:description, og:title, etc
  • etc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant