-
Notifications
You must be signed in to change notification settings - Fork 4
functional design #1
Comments
My thoughts below: 1) The package & CLI options are a good way of exposing Twint's APIs Output should be a file or stdout The CLI options all make sense I will setup project scaffold this weekend to get us started |
Yeah, second that, that would be really neat. |
Totally agree 🎉 |
Scaffolding done in a separate branch - let me know what you guys think of the tooling choices How much is portable from the current Twint package? I would assume the scraper can be moved across |
I think you forgot to push the branch. During tests I did this weekend, I did not find any reason to do things like rotate user agents. So we should keep the code very simple, only adding bells and whistles when we really need them. |
I would like to start off with proposing (and agreeing on) a functional design of the new tool. I had already written something up and I will post that in this issue over the next day or so.
I think the combination of a package and a command line tool is something we absolutely want to keep.
Below is my braindump. Please respond in the comments what you think!!!
Output
Output will be written to stdout (default) or file.
Command line arguments:
-o <filename> output to file
--csv Write as .csv format
--json Write as .jsonlines format (independent json object on every line)
Error messages and debugging
Errors and informational messages will be output to stderr (default) or file.
Command line arguments:
-v enable verbose output (loglevel info)
-vv enable debug logging (loglevel debug)
-q disable error output completely (loglevel none)
-l <filename> logfile instead of stderr
--count Display number of Tweets scraped at the end of session.
--stats Show number of replies, retweets, and likes.
Cloaking and rate limiting options
The tool
needsmight need to be able to circumvent most measures taken by Twitter.configurable user agent(not for now)Command line arguments:
-ua <user agent>-uafile <filename> (with ua strings, one per line, tool will rotate through them)-proxy <proxyurl>
-proxyfile <filename> (with proxyurls, one per line, tool will rotate through them))
TBD rate limiting, for instance backoff exponent, min/max/random wait time
Searching and filtering
I consolidated all command line args that have to do with searching and filtering. I think we need to keep the search params (i.e. those that send a different request to Twitter) and remove the filters (i.e. those that remove things that are in the output). Filtering can be done by an external program.
Can someone with more internal knowledge split these args in those two groups maybe?
--to USERNAME Search Tweets to a user.
--all USERNAME Search all Tweets associated with a user.
--favorites Scrape Tweets a user has liked.
-nr, --native-retweets
Filter the results for retweets only.
--min-likes MIN_LIKES
Filter the tweets by minimum number of likes.
--min-retweets MIN_RETWEETS
Filter the tweets by minimum number of retweets.
--min-replies MIN_REPLIES
Filter the tweets by minimum number of replies.
--links LINKS Include or exclude tweets containing one o more links.
If not specified you will get both tweets that might
contain links or not.
--source SOURCE Filter the tweets for specific source client.
--members-list MEMBERS_LIST
Filter the tweets sent by users in a given list.
-fr, --filter-retweets
Exclude retweets from the results.
--videos Display only Tweets with videos.
--images Display only Tweets with images.
--media Display Tweets with only images or videos.
--retweets Include user's Retweets (Warning: limited).
--email Filter Tweets that might have email addresses
--phone Filter Tweets that might have phone numbers
--verified Display Tweets only from verified users (Use with -s).
The text was updated successfully, but these errors were encountered: