Skip to content

Latest commit

 

History

History
248 lines (170 loc) · 12.7 KB

File metadata and controls

248 lines (170 loc) · 12.7 KB

Spyder Python NumPy Pandas scikit-learn Tensorflow Keras

Covid19_New_Cases_Prediction_Project 💬

The year 2020 was a catastrophic year for humanity. Pneumonia of unknown aetiology was first reported in December 2019., since then, COVID-19 spread to the whole world and became a global pandemic. More than 200 countries were affected due to pandemic and many countries were trying to save precious lives of their people by imposing travel restrictions, quarantines, social distances, event postponements and lockdowns to prevent the spread of the virus. However, due to lackadaisical attitude, efforts attempted by the governments were jeopardised, thus, predisposing to the wide spread of virus and lost of lives.

The scientists believed that the absence of AI assisted automated tracking and predicting system is the cause of the wide spread of COVID-19 pandemic. Hence, the scientist proposed the usage of deep learning model to predict the daily COVID cases to determine if travel bans should be imposed or rescinded.

Project Descriptions 📝

The aim of this project is to create a deep learning model with only LSTM, Dense and Dropout to predict new cases for Covid19.

Project Organization 📁

├── Datasets                                    : Contains file about the project
├── Statics                                     : Contains all save image (graphs/heatmap/tensorboard)
├── logs                                        : logs file for tensorboard
├── models                                      : .pkl for mms_train & test
├──.gitattributes                               : .gitattributes
├── README.md                                   : Project Descriptions
├── classes.py                                  : Module file in python format
└── main_covid19.py                             : Main code file in python format

Requirements 💻

This project is created using Spyder as the main IDE. The main frameworks used in this project are Pandas, Matplotlib, Seaborn, Scikit-learn, Tensorflow and Tensorboard.

Methodology 🏃

This project contains two .py files. The training file and the module file is main_covid19.py, classes.py. The flow of the projects are as follows:

  • Step 1 - Loading the data:

    • Data preparation is the primary step for any deep learning problem. Raw Data is retrieved from

    dataset. This raw data consists of training and test dataset.

       CSV_PATH_TRAIN = os.path.join(os.getcwd(),'Datasets','cases_malaysia_train.csv')
       CSV_PATH_TEST = os.path.join(os.getcwd(),'Datasets','cases_malaysia_test.csv')
       
       df_train = pd.read_csv(CSV_PATH_TRAIN)
       df_test = pd.read_csv(CSV_PATH_TEST)
    
    • From the dataset, the are a few rows in df[cases_new] have empty space and ?. To change cases_new into numeric and change ? and empty space into NaNs

      df_train['cases_new'] = pd.to_numeric(df_train['cases_new'], errors='coerce')
      
  • Step 2) Data Inspection:

    • Train dataset

          df_train.head()
          df_train.info()
          df_train.describe().T
      
          # Time Series Data
          con_col_train = df_train.columns[(df_train.dtypes=='float64') | 
                                           (df_train.dtypes=='int64')]
          print(con_col_train)
      
          # time Series Data Visualization (Line Plot)
          eda = EDA()
          eda.lineplot_graph(con_col_train,df_train)
      
          # To check NaNs value
          df_train.isna().sum()
      
      • From this train dataset there are 12 in cases_new and 342 NaNs in cluster_import,cluster_religious,cluster_community, cluster_highRisk,cluster_education,cluster_detentionCentre, and cluster_workplace. This is because there are not yet this clusters in the begining of Covid-19.

    • Test dataset

          df_test.head()
          df_test.info()
          df_test.describe().T
      
           # Time Series Data
           con_col_test = df_test.columns[(df_test.dtypes=='float64') | 
                                          (df_test.dtypes=='int64')]
           print(con_col_test)
      
           # time Series Data Visualization (Line Plot)
      
           eda = EDA()
           eda.lineplot_graph(con_col_test,df_test) 
      
           # To check NaNs value
           df_test.isna().sum()
      
      • From this test dataset there are 1 NaNs in cases_new

  • Step 3) Data Cleaning:

    • Do not clean time series data unless REALLY necessary. But this data have NaNs, so we need to used intepolate. Both train and test dataset have NaNs in cases_new column.

         # Train Dataset
         df_train['cases_new'] = df_train['cases_new'].interpolate()
         # To check the NaNs
         df_train.isna().sum()
         # NaNs in cases_new in train dataset have been interpolate
      
         # Test Dataset
         df_test['cases_new'] = df_test['cases_new'].interpolate()
         # To check the NaNs
         df_test.isna().sum()
         # NaNs in cases_new in test dataset have been interpolate
      
  • Step 4) Features Selection

    • Train dataset

         # cases_new - Only 1 feature 
         X = df_train['cases_new']
      
         # Method 1: MinMaxScaler
         mms = MinMaxScaler()
         X = mms.fit_transform(np.expand_dims(X,axis=-1))
      
         # Save Train Min Max Scaler
         with open(MMS_SAVE_PATH, 'wb') as file:
             pickle.dump(mms.transform,file)
      
         win_size = 30
         X_train = []
         y_train = []
      
         # produce list
         for i in range(win_size,len(X)):
           X_train.append(X[i-win_size:i])
           y_train.append(X[i])
      
         # change list to rank 3 array by using np.array
         X_train = np.array(X_train) 
         y_train = np.array(y_train)
      
    • Test dataset

      • We need 30 days of data to predict the number of covid19 cases. The test data only have 100 sample so we need to concatenate together with train data

        #Must to concatenate
        dataset_cat = pd.concat((df_train['cases_new'],df_test['cases_new']))
        
        # Method 1
        length_days = len(dataset_cat)-len(df_test)-win_size
        tot_input = dataset_cat[length_days:] 
        
        Xtest = mms.transform(np.expand_dims(tot_input,axis=-1))
        
        # Save Test Min Max Scaler
        with open(MMS_SAVE_PATH, 'wb') as file:
            pickle.dump(mms.transform,file)
        
        X_test = []
        y_test = []
        
        # Produce list
        for i in range(win_size,len(Xtest)):
          X_test.append(Xtest[i-win_size:i])
          y_test.append(Xtest[i])
        
        # Change list to rank 3 array by using np.array
        X_test = np.array(X_test) 
        y_test = np.array(y_test)
        
  • Model Development

    • By using the model Sequential, LSTM, dropout, and Dense.

    The model can be view in classes.py file.

         md = ModelDevelopment()
    
         input_shape = np.shape(X_train)[1:]
         nb_class = 1
         nb_node = 32
         dropout_rate = 0.2
         activation = 'relu'
    
         model = md.simple_dl_model(input_shape,nb_class,nb_node,dropout_rate,activation)
    
         model.compile(optimizer='adam',loss='mse',metrics=['mean_absolute_percentage_error','mse'])
    
  • Model Training

    • This model include tensorboard callbacks training the model. This model used 500 epoch to train.

        tensorboard_callback = TensorBoard(log_dir=LOGS_PATH,histogram_freq=1)
      
        hist = model.fit(X_train,y_train,
                          epochs=500,
                          validation_data=(X_test,y_test),
                          callbacks=(tensorboard_callback))
      
    • The visualization of our model architecture can be presented in the figure below:

    model

  • Model Evaluation

    • From this sections, line graph and inverse graph were plotted and mean absolute percentage error were recorded.

          me = ModelEvaluation()
          me.plot_hist_graphy(hist)
      
          predicted_new_cases= model.predict(X_test)
      
          # Plot line graph
          me.plot_line_graph(y_test, predicted_new_cases)
      
          # Plot inverse graph
          actual_cases = mms.inverse_transform(y_test)
          predicted_cases = mms.inverse_transform(predicted_new_cases)
          me.inverse_line_graph(actual_cases, predicted_cases)
      
          # Mean Absolute Percentage Error
          print(mean_absolute_percentage_error(actual_cases,predicted_cases))
          # Mean Absolute Percentage Error in percentage
          mp = MAPE()
          print("Mean_Absolute_Percentage_Error: {}".format(mp.mape(actual_cases,
                                                                 predicted_cases)))
      

Results and Discussion 📝

  • Plotting the graph

    • Graph from tensorboard are shown below

    epoch_mape epoch_loss

  • Performance of the model and the reports as follows:

    • The mean_absolute_percentage_error is recorded at 8.5%. The details results is shown below.

    MAPE

    • The plot line graph are shown below.

    line

    • The plot inverse graph are shown below.

    line

  • The MAPE recorded at 8.5%, the graph shows low loss, low mse which indicates model is good and the graph of predicted vs Actual able to show a good-fit to the training dataset of Covid19. As conclusion, this model able to predict the new cases of Covid19.

Credits 📂

This project is made possible by the data provided from this MoH-Malaysia