WebScience

This is the implementation of Twitter data crawler and its data analyse code. Using TF-IDF and LSH(locality sensitive Hashing) to group tweets and assign geo-tags to those non-geotagged tweets.

University of Glasgow

The result of grouping tweets: https://github.com/Ten000hours/WebScience/blob/master/TwitterCluster.pdf

We also analysed another platform----Google Plus using the same method and the result of grouping data: https://github.com/Ten000hours/WebScience/blob/master/GoogleCluster.pdf

requirements

Using "pip install --" command to install following libs:

tweepy
pymongo
bson.son
matplotlib.pylab
datetime
dateutil.parser
google-api-python-client
httplib2
regex
sklearn.feature_extraction.text
nltk.corpus

Note that you may need to download corresponding lib of nltk,so before running the script running following code to download

 import nltk

 nltk.download("corpus")
 nltk.download("stopwords")

You also need to install MongoDB in advance and change the directory path in the code .

Data import

Before running the python code,you need to import the json data (data.rar) into MongoDB using following code in command prompt:

mongoimport -d WEBSCIENCE -c Twitter_REST_search_geo --file Twitter_REST_search_geo.json --type json
mongoimport -d WEBSCIENCE -c Twitter_location_without_tag --file Twitter_location_without_tag.json --type json
mongoimport -d WEBSCIENCE -c Twitter_location_with_tag --file Twitter_location_with_tag.json --type json
mongoimport -d WEBSCIENCE -c GooglePlus_text_glasgow --file GooglePlus_text_glasgow.json --type json

Citation

This project applied following code :

 LSH python version: https://github.com/totalgood/LSHash

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.idea		.idea
lshash		lshash
DBconnection.py		DBconnection.py
GoogleCluster.pdf		GoogleCluster.pdf
GooglePlusCrawl.py		GooglePlusCrawl.py
GooglePlusDataAnaly.py		GooglePlusDataAnaly.py
README.md		README.md
StreamingApiForHashtag.py		StreamingApiForHashtag.py
StreamingApiForUser.py		StreamingApiForUser.py
TwitterCluster.pdf		TwitterCluster.pdf
countquote.pdf		countquote.pdf
data.rar		data.rar
evaluation.pdf		evaluation.pdf
evaluation0.5.pdf		evaluation0.5.pdf
groupTweet.py		groupTweet.py
requirements.txt		requirements.txt
streamingApiForLocation.py		streamingApiForLocation.py
test.py		test.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebScience

requirements

Data import

Citation

About

Releases

Packages

Contributors 2

Languages

Ten000hours/WebScience

Folders and files

Latest commit

History

Repository files navigation

WebScience

requirements

Data import

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages