Skip to content

Latest commit

 

History

History
170 lines (135 loc) · 7.41 KB

README.md

File metadata and controls

170 lines (135 loc) · 7.41 KB

Blueplanet project

Imgur

Video Links

Short description

This web application is developed in order to solve the difficulty of finding review posts/threads which are used as information for planning trips. At present, while planning a trip, Thai people always search for interesting travel reviews, especially the review threads of Pantip.com which are very popular in Thailand. However, not every post will match their needs. Sometimes, they have to read up to 10 threads, but just only one thread can be used. Therefore, our web application is classifying threads so that users are able to filter only the threads they want. Moreover, the application provides other services such as suggesting posts and creating your favorite triplist.

This project consist of 3 repositories

  1. Front-end: https://gitlab.mikelab.net:65443/blueplanet/fontend is about the user interface.
  2. Back-end: https://gitlab.mikelab.net:65443/blueplanet/backend (this repository) is a bridge between front-end and database. It is used to send data from database to front-end and receive data from front-end to keep in database.
  3. Analytic: https://gitlab.mikelab.net:65443/blueplanet/analytics its work is prepare/classify data and push them into database. Please read the files explanation below for the objective of each file.

After finish installation, to play a web application, you need to execute back-end first by run command nodemon server.js on the back-end directory. Then, go to the front-end directory and run yarn start. There is no need to run Analytic repository to play the web application.

Installation

0) Prerequisites

Front-end: NodeJS
Back-end: NodeJS
Analysis: Python3.6+, MongoDB

useful link

1) Download database

upload the initial data via mongo console so you need to run mongo server and go to mongo console and run command as below. Moreover, these data files can be downloaded from drive link.

> load("/path/to/parent/directory/mongo_js/initialize/[filename].js")

Here is all commands for all data to download.

> load("import_classified_thread_300.js")
> load("import_classified_thread_20200425.js")
> load("import_countries_list.js")
> load("import_favorites.js")
> load("import_recently_viewed.js")
> load("import_triplists.js")

2) Clone each project

Go to favor directory on your computer

$ git clone https://gitlab.mikelab.net:65443/blueplanet/fontend<br>
$ git clone https://gitlab.mikelab.net:65443/blueplanet/backend<br>
$ git clone https://gitlab.mikelab.net:65443/blueplanet/analytics<br>

3) Back-end installation (both on local and production)

from parent directory

$ cd backend/

install package using command yarn or npm install

$ yarn

download serviceKeyAccount.json and .env via link using .ku.th account and paste them on the parent directory.

start server

$ nodemon server.js

4) Front-end installation

from parent directory

$ cd fontend/

install package using command yarn or npm install

$ yarn

start server if on local use

$ yarn start

but on production your have to add a env

$ export REACT_APP_BACKEND_URL=mars.mikelab.net:30010<br>

and then start on production

$ yarn start:production

5) Analytics installation

from parent directory

$ cd analytics/

install required package (pip version must >10.0.0)

$ pip install --upgrade pip<br>
$ pip install -r package_requirements.txt<br>
$ pip install pythainlp<br>

using python or python3 commands for running files. Read the exaplnation for the objective of each file and folder.

$ python3 [filename.py]

directory and files explanation

1. clicksteam/
contain ranking.py for ranking the top country by the number of threads which related to the country. The threads are considerd in each day (now it is not an automatic run) from each collection in clicksteam database.

2. config/
config files of database and url

3. utils/
contain lot of python files which provide lot of necessary functions

4. naiveBayes-mmscale-interval-090320/ (one of the fail version)

  • all steps from create text -> clean text -> TF-IDF -> Naive Bayes Classification -> prediction
  • The trained data consist of only threads which have only one theme.
  • In the TF-IDF step, it uses the tf-idf score for cutting words that have a low score.
  • In Naive Bayes classification, there is only one model to predict multiple themes of each thread.
  • But the results are unacceptable.
  • contain train.py which the main file for execution

5. nb-mmscale-interval-yesno-230320/ (the lastest version)

  • all steps from create text -> clean text -> TF-IDF -> Naive Bayes Classification -> prediction results -> measurements.
  • The trained data consist of only threads which have only one theme.
  • In the TF-IDF step, it uses the idf score for cutting words that have a low score.
  • In Naive Bayes classification, there are models of each theme and predict the theme to be yes or no. So, to create the model, trained threads(data) with one theme of the current considered theme will be 'yes' and others are 'no'. For example, the current consider theme is 'Mountain'. The threads with one theme and that theme is 'Mountain' will be the 'yes' class and The threads with one theme and that theme is not 'Mountain' will be the 'no' class. These are X_train and Y_train data.
  • Using the Jaccard Similarity Index to measure the similarity between trained dataset and predicted dataset.
  • contain train.py which the main file for execution
  • contain many folders named [theme]-idf-new which keep model files.

6. classification-accuracy.py
aim to calculate similarity of 304 threads between classify manually and classify by program. This file writes checked_300threads.json and accuracy_300threads.json.

7. checked_300threads.json
the result of chacking and calculating similarity of 304 threads between classify manually and classify by program.

8. accuracy_300threads.json
summary measurements of 304 threads which are classified manually.

9. co-occurrenceTest.py
learn and try co-occurrence.

10. countriesListSorted.json
A list contain all countries's information around the world which are sorted by country code.

11. labeledThreadsbyHand_v2.csv
The data of threads are classified manually (by project's members).

12. classificationByPattern.py
The main file of threads classification.

13. scheduleClassify.py
is used to automatically call a classification function every day for data updating.