-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regarding the preprocessed dataset doubt #6
Comments
I have the same problem, how to get data.TD_RvNN.vol_5000.txt from the original data set Twitter15 and 16. |
They are tf-idf vectors from twitter15 & twitter16 datasets.However, I find that the entries in rvnn and the ones in the original dataset don't match.For example, post with id 624298742162845696 is not included in twitter15 & twitter16 dataset |
The paper uses TF-IDF values to represent node features, but I don’t know how to extract the retweet or response node features from the original Twitter 15&16 dataset. |
You have to use Twitter API to crawl the original tweet text via the tweetID offered by the original Twitter15&16 dataset first. However, Twitter API set a rate limit and many original posts are missing. As a result, it's really complicated to get these data. Moreover, as I have mentioned above, the entries in data.TD_RvNN.vol_5000.txt and the ones in original Twitter15&16 don't match exactly. I find two approaches to solve this problem. |
Thank you so much, I will look at https://github.com/serenaklm/rumor_detection. |
Excuse me, in the https://github.com/serenaklm/rumor_detection, I can't find the '../data/controversy/raw_data/'. In addition, I have trouble constructing the propagation structure of fake news. Like most papers, I use the retweet or response nodes as propagation nodes. If textual content of the retweet node is used as its feature, the retweet node features should be the same, which affects the detection result? I checked the author's data.TD_RvNN.vol_5000.txt file. The vector of each node is basically different, maybe they just used the response nodes. But I am still confused that I don't know how to represent the retweet node. Do you have any suggestions? |
I try to use this RvNN file directly. If there's no meaningful text for most retweets, you may need to crawl other metadata such as user information and then use it as the node feature. |
2111105031607336 None 1 1:1 2:2 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:3 15:1 16:1 17:1 18:1 |
These TF-IDF vectors are computed from the text contents of Twitter posts and user responses by setting the number of corpus to 5000. If you'd like to compute these vectors, you need to crawl these text contents. |
你好想问一下,具体步骤是怎么操作的呢?先爬取文本内容和用户反应这个用户反应包括什么信息呢,然后是怎么操作的呢,之间在github中下载TF-IDF 方法代码就能生成这个文件了吗 |
收到
|
不能,你没有他的词表,这里的root节点也和twitter15 16里的不太一样,相当于重新构造一个数据集了。可以考虑用他提供的处理好的数据集,只是丢失了词汇间的顺序关系。 |
明白了,还想要请教一下,那想要用自己数据集复现一下怎么能处理数据成这个数据的形式呢 |
|
非常感谢 |
您好,请问您在自己的数据集上,处理成这个数据形式了吗? |
收到
|
1 similar comment
收到
|
There are "index: count" pairs in each row of the preprocessed data file....what do they signify? is it something like the tweet text has been tokenized and based on all the token you have a vocabulary/dictionary with key: count pairs and each row symbolizes the token index: count pair from the vocabulary. Could you please provide more details on how have you preprocessed the dataset?
The text was updated successfully, but these errors were encountered: