Skip to content

stccenter/Covid-19SentimentAnalysis

Repository files navigation

COVID-19 Sentiment Analysis

Project Background

The project implementation consists of the following methods:

  1. A data pre-processing methodology to get rid of symbols and stopwords.
  2. A convolutional neural network (CNN) to train the model to classify social media posts based on categories ("medical research", "social events", "pandemic data", "administrative policy").
  3. Accuracy plots for each epoch

Project Goal

In December 2019, COVID-19 started its outbreak in Wuhan, China. It quickly spread to nearly all the countries in the world. There were many news and reports published every day since December 2019. It will be very time-consuming to read all the news and categorize all the news from the website for research purposes. Therefore, a method of automatic classification of news is needed for categorizing news in an efficient manner.

This project began in December of 2019, and ended in May 2020.

Please contact Zifu Wang with questions about this project.

Please see the detailed implementation guidance provided below or download it as a pdf.

1. Implementation Guidance

1.1 Data Understanding

Data Understanding section on how the data was obtained.

1.2 Overall Training Dataset

One such training dataset is “Coronavirus: A timeline of how the deadly COVID-19 outbreak is evolving” (https://www.pharmaceutical-technology.com/), which recorded the timeline of big events of COVID-19 have occurred since December 2019.

Another data source is China Data Lab, 2020, "Global News" (https://dataverse.harvard.edu/), which is provided by Harvard Dataverse. This is a CSV file which crawled weblink of January news related COVID-19. It contains 81892 lines of data. However, some of the links have been invalid because it has been a long time since the news were published.

1.3 Computing Platforms

The tweet classification is tested in five computing platforms.
  (A) Windows Laptop with Graphics Card
  (B) AWS Windows g4dn instance with 4 NVIDIA Tesla T4
  (C) Windows Laptop with Intel Core i7-8550U CPU
  (D) Ubuntu on Windows 10
  (E) AWS Deep Learning Ubuntu 18.04 LTS

Overview of Software Requirements

Open-source packages Version
Python 3.8
NVIDIA For Windows - 496.49
NVIDIA For Ubuntu Server - 450.51.06
CUDA 11.0
TensorFlow 2.4.1

A: 1.3.0 Windows Laptop with Graphics Card

A: 1.3.0.0 Check Graphic Card Details

Go to Windows Start menu and type device manager. Then expand your Display Adapters, and your graphic cards will be displayed.

Caption: Windows Device Manager

A: 1.3.0.1 Download and Install the NVIDIA driver

  • Step 1: Use the link to download the NVIDIA driver.

  • Step 2: Download the driver.

  • Step 3: Install downloaded NVIDIA driver.

  • Step 4: You will find CUDA subfolder inside “NVIDIA GPU computing toolkit” folder inside C drive “Program Files” folder (C:\Program Files\NVIDIA GPU Computing Toolkit).

A: 1.3.0.2 CUDA Toolkit Installation

  • Step 1: Go to CUDA Toolkit Archive.

  • Step 2: Find the CUDA Toolkit 11.0 version.

    Caption: Latest Release of CUDA

  • Step 3: Select the Operating System (Linux or Windows), architecture, and version based on your machine preference.

    Caption: Match your Architecture

  • Step 4: Click download.

  • Step 5: Double click the downloaded exe file (Example: cuda_11.0.2_451.48_win10.exe) and follow the on-screen prompts.

A: 1.3.0.3 Download cuDNN

  • Step 1: Go to the cuDNN link.

  • Step 2: Click Download cuDNN. If you are a first-time user, you need to create a user account and consent to the cuDNN Software License Agreement.

  • Step 3: Select the right version of cuDNN. Please note that the version of CUDA and cuDNN should match. In this case, I download cuDNN version 8.0.5 for CUDA 11.0.

  • Step 4: It will download as a compressed folder. Extract the compressed folder.

  • Step 5: The extracted folder has “cuda” subfolder that matches with the “CUDA” subfolder in C:\Program Files\NVIDIA GPU Computing Toolkit.

  • Step 6: Now, copy cudnn64_8.dll from the bin of the extracted folder (C:\Users\anush\Downloads\cudnn-11.0-windows-x64-v8.0.5.39\cuda\bin) and paste it in the bin folder inside CUDA folder of NVIDIA GPU Computing Toolkit (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin).

  • Step 7: Copy cudnn.h file from include of the extracted folder (C:\Users\anush\Downloads\cudnn-11.0-windows-x64-v8.0.5.39\cuda\include) and paste it in the bin folder inside CUDA folder of NVIDIA_GPU_Computing Toolkit (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include).

  • Step 8: Copy cudnn.lib file from lib/x64 folder inside extracted folder (C:\Users\anush\Downloads\cudnn-11.0-windows-x64-v8.0.5.39\cuda\lib\x64) and paste it in the similar folder of NVIDIA_GPU_Computing_Tookit (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\lib\x64).

A: 1.3.0.4 Clone the Covid-19 sentiment analysis repository and download training data

  • Step 1: Go to Start menu. Open command prompt. Use below command to create a new directory for Covid19SentimentAnalysis. Name it as per your preference. Then, navigate to the directory.
mkdir Covid19SentimentAnalysis
cd Covid19SentimentAnalysis
  • Step 2: Clone or download the repository.

  • Step 3: From the cloned repository, copy requirements.txt, tweets_classification.py, tweets_classification_multi_GPU.py, and training_data_set_unduplicate.csv files from the cloned folder and paste them inside Covid19SentimentAnalysis folder.

  • Step 4: Download GoogleNews-vectors-negative300.bin. This will need to be unzipped using 7-Zip. Then place the unzipped .bin file inside the Covid19SentimentAnalysis folder.

A: 1.3.0.5 Create conda virtual environment

Note: Please make sure you have Python 3.8 and Anaconda installed in your system.

  • Step 1: In your terminal, create conda virtual environment using below command:
conda create --name sentimentanalysis-env python=3.8
  • Step 2: Activate conda environment
conda activate sentimentanalysis-env

A: 1.3.0.5 Install python packages

  • Step 1: In the command prompt, use the below command to install the required packages:
pip install -r requirements.txt

Caption: Python packages installation

A: 1.3.0.4 Verify GPU

To verify your GPU is successfully installed, type python in your terminal:

Caption: Go to python terminal

Type below command in Python terminal. This will print off any GPU devices currently active on the machine.

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

Caption: List of GPU devices

Type exit() to exit the python shell.

A: 1.3.0.5 Run the script

set CUDA_VISIBLE_DEVICES=0 & python tweets_classification_multi_GPU.py

A: 1.3.0.9 Execution time

The total time for execution is 46.34 seconds.

Caption: GPU output

It should also display model accuracy graph similar to this:

Caption: Model graph

B: 1.3.1 AWS g4dn instance with NVIDIA Tesla T4

B: 1.3.1.0 Download and Install the NVIDIA driver

The following steps are based on AWS help document to install NVIDIA driver in g4dn instance.

  • Step 1: Connect to the Windows instance. Then, go to the Start menu, and open a Powershell Window. Copy and paste the below command. This command will download the driver inside of the Desktop.
  $Bucket = "ec2-windows-nvidia-drivers"
  $LocalPath = "$home\Desktop\NVIDIA"
  $Objects = Get-S3Object -BucketName $Bucket -KeyPrefix $KeyPrefix -Region us-east-1
  foreach ($Object in $Objects) {
    $LocalFileName = $Object.Key
    if ($LocalFileName -ne '' -and $Object.Size -ne 0) {
        $LocalFilePath = Join-Path $LocalPath $LocalFileName
        Copy-S3Object -BucketName $Bucket -Key $Object.Key -LocalFile $LocalFilePath -Region us-east-1
    }
}
  • Step 2: Navigate to the desktop -> NVIDIA (folder) -> grid-13.0.

  • Step 3: Double-click the installation file 471.68_grid_win10_server2016_server2019_server-aws-swl.exe to launch the installation process. Follow the instructions to successfully install the driver.

  • Step 4: Reboot your instance.

  • Step 5: Verify the GPU is working properly by checking the Device Manager. Go to the Start menu and search for the Device Manager. Then, expand the section "Display adapters".

Caption: Verify GPU Installation

  • Step 6: Optional
New-ItemProperty -Path "HKLM:\SOFTWARE\NVIDIA Corporation\Global\GridLicensing" -Name "NvCplDisableManageLicensePage" -PropertyType "DWord" -Value "1"

B: 1.3.1.1 Install CUDA 11.0

Caption: Latest Release of CUDA

  • Step 3: Select the target platform. For operating system select Windows; for architecture select x86_64; for version select Server 2019, and for instance type: choose exe (local)

  • Step 4*: Click download.

  • Step 5: Double click the downloaded exe file and follow the on-screen prompts.

  • Step 6: Reboot the instance.

B: 1.3.1.2 Download cuDNN library

  • Step 1: Go to cuDNN link.

  • Step 2: Click Download cuDNN. If you are a first-time user, you need to create a user account and consent to the cuDNN Software License Agreement.

  • Step 3: Select Archived cuDNN Releases link. Select cuDNN v8.0.5 (November 9th, 2020), for CUDA 11.0.

  • Step 4: Select cuDNN Library for Windows (x86)

  • Step 5: Download the compressed folder and extract it.

  • Step 6: The extracted folder has “cuda” subfolder that matches with the “CUDA” subfolder in C:\Program Files\NVIDIA GPU Computing Toolkit.

  • Step 7: Now, copy all dll files from the bin of the extracted folder (C:\Users\Administrator\Downloads\cudnn-11.0-windows-x64-v8.0.5.39\cuda\bin) and paste it in the bin folder inside CUDA folder of NVIDIA GPU Computing Toolkit (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin).

  • Step 8: Copy cudnn.h file from include of the extracted folder (C:\Users\Administrator\Downloads\cudnn-11.0-windows-x64-v8.0.5.39\cuda\include) and paste it in the include folder inside CUDA folder of NVIDIA_GPU_Computing Toolkit (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include).

  • Step 9: Copy cudnn.lib file from lib/x64 folder inside extracted folder (C:\Users\Administrator\Downloads\cudnn-11.0-windows-x64-v8.0.5.39\cuda\lib\x64) and paste it in the similar folder of NVIDIA_GPU_Computing_Tookit (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\lib\x64).

B: 1.3.1.2 Set up the Virtual Environment

  • Step 1: Create a new project folder and name it as per your wish. For example, we will use "sentiment-analysis".

  • Step 2: Go to the Start menu, and search for your "command prompt". Then, navigate to your project folder, changing the PATH as needed.

cd sentiment-analysis
  • Step 3: Create and then activate your virtual environment using the code below. As per our example in Step 1, our environment is named sentiment-analysis.

Creation

  python -m venv sentiment-analysis

Activation

  sentiment-analysis\Scripts\activate.bat

B: 1.3.1.4 Clone the project repository and download training data

  • Step 1: Clone or download the repository.

  • Step 2: From the cloned repository, copy and paste the requirements.txt, tweets_classification_multi_GPU.py, and tweets_classification.py inside the sentiment-analysis directory.

  • Step 3: Download GoogleNews-vectors-negative300.bin. This will need to be unzipped using 7-Zip. Then place the unzipped .bin file inside the sentiment-analysis folder.

  • Step 4: Download the training_data_set_unduplicate.csv from the cloned repository and place it inside the sentiment-analysis folder.

B: 1.3.1.5 Install Python Packages

In the command prompt, paste the below code to install the necessary Python packages:

pip install -r requirements.txt

B: 1.3.1.6 Run the script

a. For multi-GPU use below command to run the script.

set CUDA_VISIBLE_DEVICES=0,1,2,3 & python tweets_classification_multi_GPU.py

NVIDIA-SMI screenshot:

Caption: NVIDIA SMI

b. For single-GPU use below command to run the script.

set CUDA_VISIBLE_DEVICES=0 & python tweets_classification_multi_GPU.py

Note: You may get below error while running the script:

No OpKernel was registered to support Op 'NcclAllReduce, please change line #86 to strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

B: 1.3.1.7 Execution time

a. For 4 GPUs: The total time for execution is 83.02 seconds.

b. For single GPU: The total time for execution is 48.17 seconds.

C: 1.3.2 Windows Laptop with Intel Core i7-8550U CPU

  • Step 1: Created a new folder and name it CovidSentimentAnalysis

  • Step 2: Download the zip file of the Covid-19SentimentAnalysis repository.

  • Step 3: Once the zip file is downloaded, extract the folder and copy/paste each file into the CovidSentimentAnalysis folder

  • Step 4: Download GoogleNews-vectors-negative300.bin. Once downloaded, you will need to install a software called 7-Zip in order to unzip the file.

  • Step 5: After installing 7-Zip, right click GoogleNews-vectors-negative300.bin and extract it to the bin. Once extracted, add this to the CovidSentimentAnalysis folder

  • Step 6: Open Command Prompt

  • Step 7: Create virtual environment by typing:

python -m venv sentimentanalysis-env
  • Step 8: Activate the virtual environment by typing:
sentimentanalysis-env\Scripts\activate.bat
  • Step 9: Begin installing the packages for the requirements.txt file by typing:
pip install -r requirements.txt
  • Step 10: Run the script for the tweets_classification.py by typing:
python tweets_classification.py

The script took approximately 30 seconds to run

Caption: Time taken to train the model

It should also display model accuracy graph similar to this:

Caption: Model accuracy

D: 1.3.3 Ubuntu on Windows 10

  • Step 1: Create Conda Environment

Enter This command to create the conda environment:

conda create -n sentiment-env python=3.8
  • Step 2: Activate the Conda Environment

To activate the conda environment you created, enter this command:

conda activate sentiment-env

Followed by this command to install pip to the conda environment:

conda install pip
  • Step 3: Download the Training Data and Required python packages

Download the training data at this link and place it in a folder named sentimentpackage on your desktop

Download the zip containing your script and place the contents of the zip in the sentimentpackage folder

To change to the desktop directory, Enter the Command:

cd Desktop

To change to the directory to your folder, Enter the Command:

cd sentimentpackage

Install the required packages using the command:

pip install -r requirements.txt
  • Step 4: Run the Script

To run the Script enter the command:

python tweets_classification.py 

E: 1.3.4 AWS Deep Learning Ubuntu 18.04 LTS

  • Step 1 T:he AWS Deep Learning AMI comes with different versions of CUDA. Please switch to the correct CUDA version, 11.0, by using the following commands:
sudo rm /usr/local/cuda
sudo ln -s /usr/local/cuda-11.0 /usr/local/cuda

After running both commands, verify your CUDA version by running NVIDIA's nvcc program:

nvcc --version
  • Step 2: Create a folder inside your prefered directory in which you will store all the necessary files for the Tweets classification program. For the purposes of this guide, I will refer to this folder as 'Covid-19SentimentAnalysis'

  • Step 3: Download GoogleNews-vectors-negative300.bin. This will need to be unzipped using 7-Zip. Then place the unzipped .bin file inside the sentiment-analysis folder.

  • Step 4: Download the training_data_set_unduplicate.csv from the cloned repository and place it inside the sentiment-analysis folder.

  • Step 5: Clone the Tweets classification repository. In your terminal type the below command:

git clone https://github.com/stccenter/Covid-19SentimentAnalysis

Copy requirements.txt, tweets_classification.py, and tweets_classification_multi_GPU.py files from the cloned folder and paste them inside Covid-19SentimentAnalysis folder.

  • Step 6: Create a conda environment with the required Python version, 3.8, using the following command (you may name your conda environment however you like; for the purposes of this guide, I will use the name 'tweetclassify-env'):
conda create -n tweetclassify-env python=3.8
  • Step 7: After you create your conda environment, activate it using the following command:
conda activate tweetclassify-env
  • Step 8: Use the following command to install all of the necessary packages:
pip install -r requirements.txt
  • Step 9: Run the tweet classification program using the following command:

For single GPU

CUDA_VISIBLE_DEVICES=0 python tweets_classification_multi_GPU.py.py

For multi GPU

CUDA_VISIBLE_DEVICES=0,1,2,3 python tweets_classification_multi_GPU.py.py

2. Demystifying the Code

What scripts will be gone over? Overview of all items.

2.1 Script

Script overview, a few sentences on what the program is meant to do.

2.1.0 Script Segment

CODE

Explain code.

2.1.1 Model Output

Describe the model output.

2.1.2 Interpretting the Model Output

Interpret the model output.

Videos

Walkthrough video for AWS Ubuntu Server multi-GPU instance

Walkthrough Video for AWS Windows g4dn instance - Multi GPU

Walkthrough Video for Windows Laptop with Intel Core i7-8550U CPU

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages