grablinks.py
is a simple and streamlined Python 3 script to extract and filter links from a remote HTML resource.
An installation of Python 3
(any version above 3.5
should do fine). Additionally the 3rd-party Python modules requests
and beautifulsoup4
are required. Both modules can be easily installed with Python's package manager pip
, e.g.:
pip --install requests --user
pip --install beautifulsoup4 --user
usage: grablinks.py [-h] [-V] [--insecure] [-f FORMATSTR] [--fix-links]
[--images] [-c CLASS] [-s SEARCH] [-x REGEX]
URL
Extracts, and optionally filters, all links (`<a href=""/>') from a remote
HTML document.
positional arguments:
URL a fully qualified URL to the source HTML document
optional arguments:
-h, --help show this help message and exit
-V, --version show version number and exit
--insecure disable verification of SSL/TLS certificates (e.g. to
allow self-signed certificates)
-f FORMATSTR, --format FORMATSTR
a format string to wrap in the output: %url% is
replaced by found URL entries; %text% is replaced with
the text content of the link; other supported
placeholders for generated values: %id%, %guid%, and
%hash%
--fix-links try to convert relative and fragmental URLs to
absolute URLs (after filtering)
--images extract `<img src=""/>' instead `<a href=""/>'.
filter options:
-c CLASS, --class CLASS
only extract URLs from href attributes of <a>nchor
elements with the specified class attribute content.
Multiple classes, separated by space, are evaluated
with an logical OR, so any <a>nchor that has at least
one of the classes will match.
-s SEARCH, --search SEARCH
only output entries from the extracted result set, if
the search string occurs in the URL
-x REGEX, --regex REGEX
only output entries from the extracted result set, if
the URL matches the regular expression
Report bugs, request features, or provide suggestions via
https://github.com/the-real-tokai/grablinks/issues
# extract wikipedia links from 'www.example.com':
$ grablinks.py 'https://www.example.com/' --search 'wikipedia'
https://ja.wikipedia.org/wiki/仲間由紀恵
https://ja.wikipedia.org/wiki/黒木華
https://ja.wikipedia.org/wiki/清野菜名
…
# extract download links from 'www.example.com', create a shell script
# on-the-fly and pass it along to sh to fetch things with wget:
$ grablinks.py 'https://www.example.com/' --search 'download.example.org' --format 'wget "%url%"' | sh
# Note: Do not do that at home. It is dangerous! 😱
# alternatively just pass to wget directly:
$ grablinks.py 'https://www.example.com/' --search 'download.example.org' | sort -u | wget -i-
# extract/ handle links like
# <a href="https://example.com/a-cryptic-ID">proper-filename.ext</a>
$ grablinks.py 'https://www.example.com/' --format 'wget '\''%url%'\'' -O '\''%text%'\' > fetchfiles.sh
$ sh fetchfiles.sh
# Note: %text% is not sanitized by grablinks.py for safe shell usage. It is
# recommended to verify this before executing things automatically
1.9 | 28-Dec-2024 |
Identify with proper user agents for remote requests --fix-links: Update input/ response URL in case of redirections --fix-links: Improved handling of some path edge-cases Avoid unnecessary (re-)encoding (assume all loaded data as bytes) Added basic support for 'file://' URIs |
1.8 | 21-Nov-2024 | Added support for "<img src="">" via '--images'. |
1.7 | 21-Jan-2024 | Disable urllib3 warnings when '--insecure' is used. |
1.6 | 2-Dec-2023 |
Added '--insecure' argument to disable SSL/TLS certificate verification Added support for '%text%' placeholder in format string (<a>text</a>) |
1.5 | 24-Nov-2022 | Added a (fixed) timeout to the remote request. |
1.4 | 30-May-2022 | Improved handling of passing multiple classes to '--class'. |
1.3 | 6-Feb-2021 | Fix: handling of common edge cases when '--fix-links' is used. |
1.2 | 16-Aug-2020 | Fix: in some cases links from "<a>" tags without a 'class' attribute were not part of the result. |
1.1 | 7-Jun-2020 | Initial public source code release |