-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[ERROR] CRITICAL:twint.run:Twint:Feed:noDataExpecting ~ Inconsistent results [High Severity] #604
Comments
Twint does not cache results, queries or anything else. Every single piece of data is provided by Twitter. It makes sense to run multiple searches until it makes sense to you, for example you could monitor specific hashtags and see which user deletes more tweets than others. On the other side, you will most probably get duplicates if you don't filter the data with If the hashtag is really popular, I'd split the searches in months if not even weeks. |
Thanks. So, is it normal though that if I run the exact same search for the same query 12 hours apart that I get very different numbers of tweets back? When I ran it yesterday I got ~6800 results and when I ran it again this morning I only got ~4700. (When I ran it a third time it returned 0 tweets??) It's not really that big a deal, I'm just hoping to understand the expected behaviour to explain to analysts when I explain the data to them. If I'm only getting a subset, or a random subset of the data, that's fine, but I need to understand that before they ask me about it. Thanks |
If you run the same script over time, so with the same date-time ranges and other parameters, the dataset should be "constant" since it does not depend by the relative time which you run the script at. So if we have a variation and what can variate is the time interval or the dataset itself, I guess that (in this case) is the dataset that's changing. Most probably due to deleted tweets, accounts that go from open to closed, accounts that get delete/suspended. Let's say that tweets with id = 1,2,3 are sent in the interval time (A,B). If we run the scripts at C with C > B, and the tweets don't get deleted/shadowed et similia , we'll always get tweets with id = 1,2,3. The interval is closed and nothing can go it or get out by itself. The only way which a tweet can go out of that interval is just being deleted/shadowed and so on. |
Thank you. I guess my strange results were a glitch. I mean, it's unlikely that 2000 tweets got deleted for that hashtag overnight. I'll keep experimenting. Much appreciated. |
I can guarantee you that what Twint returns is what Twitter gives (and this can be proven, just run with You can run the script with your handle as target, if you always get the same tweets everything is fine otherwise it needs to be investigated. Best of luck! |
Thanks, I'm sure it's something I'm doing. I will keep at it. |
Sorry to come back to this, but just wanted you to be aware of the results of my testing. (running with python 3.6 and fully updated twint)
I just ran that exact code 4 times in a row. It returned this many tweets: I'm really not sure what to do with that at this point? Am I doing something wrong? On the three runs with the lower values, it also returned this
Any advice would be welcome. Edit: For what it's worth, it's definitely something to do with the feed being disrupted. That error is being thrown in the Feed method of Twint, and it seems to happen most of the time. I re-ran that script in a loop a bunch of times, and 13, 207 tweets seems to be the actual correct number, but it doesn't come back with that very often. |
I've tried your query and I can confirm that the results are not consistent. That's really strange and needs to be investigated. I'll keep you updated |
Thanks! |
So it seems that having HTTPS or not does not always have effects, for now my findings are there (I've run the script 3 times):
So as we can see, Twint starts always at the same point. Which is good. Now we have to see where it stops
If we take a look at the <div id="main_content">
<div class="system">
<div class="blue">
<table class="content">
<tr>
<td>
<div class="title">Sorry, that page doesn't exist</div>
<div class="subtitle"><a href="/">Back to home</a></div>
</td>
</tr>
</table>
</div If we take a look at the latest scraped tweet in A note about the time-zone. If we run the same with two different local times, most probably we'll get different results since my start (end) of the day is different that yours. That said, our aim is not to be sure that one gets each other's results, our aim is to get the same results each time we ask for them, comparing results individually. (FYI I got 25282 tweets) Reasons why the issue might be related to HTTP(s) switch:
Reasons why the issue might not be related to that switch:
What happens when Twint gets those error messages:
Sometimes, luckily, error messages are not printed, even if running the same query. In such case the only thing that Twint has different than in the other runs is the UserAgent. So maybe Twitter plays differently based on the UserAgent specified. Updates soon. |
It seems that using So I suggest you to edit lines 158 and 159 in |
Hmmm. My installed I looked through it and couldn't find any reference to user agent. I found it several places in |
Sorry I mean in Lines 155 to 161 in 3a4f778
|
Sorry, finally had a chance to try this. Yes, I made that change with that user agent and ran the code 4 times, and got back 13,011 tweets each time. |
Just to update, running with that user agent ususally seems to bring back a fairly complete list of tweets, but does randomly fail and a smaller subset are returned. I am playing around with a workaround to look at the last tweet returned to see if it's timestamp is close to 00:00:00, and if not, redo the query. Not sure if there's a more effective way to detect that the scrape finished early as a workaround. |
It would help a lot if there was some way to know if there were any errors during the search or profile request. Now the only indication of something going badly is a message on stderr, if any. |
I thought I'd just add that after a lot more experimenting, results continue to be inconsistent regardless of the User Agent, though quite randomly so. Sometimes I can run the same code 3 times and get the exact same number of tweets and other times it will return a much smaller number, or even zero tweets. |
And does in happen regardless the query? |
Huh, oddly it does seem to vary with the query. Some queries I tried seem to always return the same number, others vary a lot. Really strange that. That football hashtag (#nfl) is always very variable, but something like #france or #germany seems to be consistent. |
That makes the debugging even harder, as of now I'd exclude a flaw in Twint. So I guess that somehow the issue is related to Twitter It'd be interesting to check if it returns less tweets even if it reaches the end of the day. Because if it stops before for unknown reasons, you could just resume from that point. To try this out just run something like import twint
c = twint.Config()
c.Search = "#nfl"
c.Debug = True
c.Resume = "test_1.session"
c.Since = "2019-12-18"
c.Until = "2019-12-19"
c.Store_csv = True
c.Output = "test_1.csv"
twint.run.Search(c) If it does not stop at |
Oh, that is perfect. I will try that. I was trying to do something like that myself by checking for whether the last tweet collected was close to 00:00:00, but when it wasn't I had to rerun the whole script instead of just restarting from where it stopped. So if I understand correctly, as long as the script hasn't terminated, re-running twint.run.search(c) will restart from the last tweet collected (assuming it hadn't reached 00:00:00), so a simple loop with a check for the latest time collected should do the trick. Thanks, you've been very helpful with this. |
Going to echo what others have said that this probably isn't anything to do with twint. Seems to me that coupled with the seemingly "new" policies Twitter have been enforcing recently like forcing number verification, banning random accounts and all that I am of the opinion that they are simply starting to limit the type of full access that we've all become used to & just a general crackdown of sorts I've been having luck with a solution @yunusemrecatalcam posted in issue #888 https://github.com/twintproject/twint/issues/888#issuecomment-693977671 So big thank you to him and the Dev of the project mentioned. Works nicely with a rotating pool, speeds are not that bad at all And of course Big thank you to @pielco11 and everyone involved for your awesome work on Twint. It really is appreciated! |
I have also faced the same issue. Twint is not working anymore and giving the error as given in question. How to get rid from this. Is there any alternate ? |
looks like it started again -_- |
Update for all interested people: Looks like this issue has already been resolved an unknown time ago, if you clone the current version of twint, the 🎉 |
When will these changes be released on pip? |
See twintproject/twint#604 (comment) for more information.
@aliabdmahdi
|
|
It's because this method first clones the twint from GitHub, that requires you to have git installed on your system.
|
|
First start the conda terminal, like you previously did. Then change the directory using the Google how to change directory in cmd |
As per the the error. If you read it it asks you a question. Do you have got installed? Have you installed git? If not.... |
Don't take this the wrong way but I think you should really consider getting to know the things behind the tools you're trying to use before using them. That error indicates that you're probably not using a correct or compatible version of python. Or you've only just done the initial install of conda and not updated it prior to using it. A quick google search of the error you're getting would help you more than waiting for replies here. And you'll probably learn a a bit along the way which will only be good. Again, I'm not trying to be mean. But you're going to find yourself running into many issues like this when you start out and people aren't going to want to help you if you don't appear to really try and fix your problem yourself first. Like by just googling your error. That error points to countless posts on stackexchange for example. And they always offer up great, detailed information that'll really help you. Just some advice. Hope it lands and comes across as intended. |
I'm afraid, the "CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1" error is back. uninstalling + pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint doesn't work now. Am I the only one for whom this issue is back? |
same here .. |
I am getting this same error! |
same here, uninstalled it to try this trick and ran into this issue while reinstalling. |
UPDATE: for MacOS 11.0 & Python 3.7
everything works like a charm on my end now! I'd be more than happy to submit a PR, but decided to hold off until I can get some confirmation that this is a valid solution, or whether other use cases bring up more of these errors. |
Thanks! I got this error when I try your solution: "Error: No such keg: /usr/local/Cellar/c-ares" |
@aurigandrea interesting, what error message do you get when you run the pip install command? |
FWIW, we are running twint on an hourly basis (https://github.com/Museum-Barberini-gGmbH/Barberini-Analytics) and haven't seen any major issues during the latest months. We install twint using the pip dependency |
Is anyone else still having this issue? I keep getting |
same here, what am I doing wrong? I'm getting error messages like
|
@charliehawco a fork has been working for me as of this AM -- not sure what's been going on with the core repo though. Uninstall twint and reinstall with See here for more details. |
Python 3.6
twint 2.1.7 updated from master
Have searched issues without finding anything
Running on Ubuntu 18.04, anaconda, jupyter notebook
Commands run:
Hi, thanks for writing this package, it's very useful. I'm clearly not using it right though. I ran the commands above as a test, using"#nfl" as a query because it's innocuous and guaranteed to have a lot of results over the course of one day, but I am getting inconsistent results.
First, when I run it I get a lot of these warnings (which I saw from another issue are probably related to http/https?):
CRITICAL:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
That's fine though, the script still runs. The problem is the results are inconsistent. I ran it last night and got back 6,832 tweets, then ran it again this morning as part of testing some other code and got 4,710 tweets. When I saw that I ran it again and got 0 tweets.
I have a couple of questions if that's okay. Is twint caching the results of queries somewhere, and if so, how do I clear the cache? Is this inconsistent behaviour expected (is it a Twitter search page thing?) and if so, does it make sense to run the same search multiple times and concatenate results? Finally, is there a suggested best practice for searching date ranges? (i.e. If you want all the tweets for a hashtag for the past 3 months, is it better to do one big search or break the search into daily or weekly time ranges?)
Again, thanks for this package. Great work.
The text was updated successfully, but these errors were encountered: