Malware Classification using classical Machine Learning and Deep Learning

Created by Goh Ee Sheng 2022, NYP FYPJ 2022 P4

Supervisor: Dr Brandon Ooi

Abstract

This project is in collaboration with CSIT to develop AI models for predicting malware attribution. Students will collect and process malware samples, extract features, build and test AI models, and produce a detailed study of the use of AI models in the malware attribution problem.

Quick Notes:

Implementation is using sklearn, numpy, pandas and tensorflow.
MS Windows executable binary files are used as data.
Features * Classic ML-based approaches: PE fie features are extracted and used * Deep Learning-based approaches: (1) Opcodes (2) Converted executables into gray-scale images

Packages requirements

Option 1:

Install pefile pythong package e.g. conda install pefile
Install PyTorch and other libs e.g. conda install -c pytorch torchtext. All other common dependencies should be covered by anaconda distro.
objdump in ubuntu. (This code is developed and tested for ubuntu-based development env)

Option 2:

Create virtual environment
Install requirements.txt - pip install -r requirements.txt
objdump in ubuntu. (This code is developed and tested for ubuntu-based development env)

Malware samples

* copy the malware samples at <project_dir>/data/exec_files/org_dataset.

├── config.py
├── data
│             ├── exec_files
│             │             └── org_dataset #create folder
│             │                 ├── malware_directory_1
│             │                 ├── malware_directory_2
├── data_preprocess.py
├── data_utils
.
.

Data preprocessing

Execute data_preprocess.py with below mentioned options to preprocess the data.

python data_preprocess.py --extract_pe_features

python data_preprocess.py --bin_to_img

python data_preprocess.py --extract_opcodes

python data_preprocess.py --split_opcodes

Week 1

Research on malware using AI and Sandbox Technology.
Read and analyse what previous NYP researchers did.

Challenge

- Fix Kaelan's unzip.py
    * Now able to unzip .7z, .rar, .zip folders regardless of OS.

Week 2

Official Work starts

Preprocess malware Portable Executables

Research on malware using AI and Sandbox Technology.
Read and analyse what previous NYP researchers did.
Use pre-trained VGG19 model to train with pre-processed data.

Challenge

Preprocess malware images collected from the internet including polymorph malware
- Create Python scripts.
    1. convert_pdf_doc.py
        - Convert PDFs and Word Documents into grayscale images.

    2. convert_bin_to_img.py
        - Convert compiled malware (i.e., .msi, .exe, .jar) into grayscale and RGB images.

    3. resize.py
        - Able to resize original images in directory to specific width and height

    4. train_test_split.py
        - Split datasets in to train and test folders.

Week 3

Use pre-trained VGG16 model to train with pre-processed data.

Challenge

- Merge Python script into a singular file with python parser module.
    1. bin_to_img.py
        - Convert any files including malwares into grayscale images.

Troubleshoot

- Edit the notebook to allow grayscale image as input_data

Lesson Learnt

There is another way in preprocessing data. Basically, get the training datasets that are grouped in classes and convert them to numpy array immediately. Do not need to waste time and disk space for the pre-processed images.

* Use this on original images - Instead of resizing every image and store in another folder.

    def imagearray(path,size):
    data = []
    for folder in os.listdir(path): # Loop the train/test folder
        sub_path = path +"/" + folder # Subfolder - Classes
        for img in os.listdir(sub_path): # Loop the images
            image_path = sub_path + "/" + img
            img_arr = cv2.imread(image_path)
            img_arr = cv2.resize(img_arr,size)
            data.append(img_arr)
    return data

To pad grayscale images to the same width and height, you can use the resize method from a library like OpenCV or Pillow. This method allows you to specify the target width and height for the resized images, and it will automatically pad the images with zeros to ensure that they have the specified dimensions.

    Here is an example of how you might use the resize method from the OpenCV library to pad your grayscale images:
    # Import the necessary libraries
    import cv2

    # Load the grayscale image
    img = cv2.imread('grayscale_image.png', cv2.IMREAD_GRAYSCALE)

    # Resize the image to the desired width and height
    img = cv2.resize(img, (1024, 1024))

    # Save the resized and padded image
    cv2.imwrite('resized_image.png', img)

In this example, the grayscale image is first loaded from a file using the imread method from OpenCV. The resize method is then used to resize the image to the desired width and height, and the resulting image is saved to a new file using the imwrite method.

Keep in mind that this is just an example, and you may need to adjust the code to fit your specific use case. Additionally, this code assumes that your grayscale images are in the PNG format and that you want to save the resized and padded images in the same format. You may need to modify this code to handle other image formats or to save the images in a different format.

Week 4

Create own CNN models to train with pre-processed data using 8:2 ratio train and test dataset.
Use pre-trained VGG16 model to train with pre-processed data using 8:2 ratio train and test dataset.

Challenge

    1. resize_recursively_pad.py.py
        - Able to resize original images in directory **recursively** to specific width and height
        - Pad all images to 1024x1024 pixels

    2. Setup Cuda and cuDNN on laptop
        - NVIDIA GeForce MX130 GPU - 6 years old
        - Outdated hardware thus will not work

    3. Setup PlaidML on Desktop
        - Desktop uses AMD GPU
        - PlaidML uses RocM architecture to train DL model
        - **DO NOT DO THIS**
        - **Corrupt Window OS, bootscreen white undercursor error** 
        - To fix: Requires Windows recovery drive

    4. Create own CNN models
        - Models trained with RGB images
        - Models with different image sizes
        - Models with myraid learning rates
    
    5. Use pre-trained VGG-16 model to train greyscale images
        - Configure whole VGG16 architecture for greyscale images training

    6. Obtained datasets of malwares categories (e.g. Trojan, Ransomware, Obfusacted virus)
    
    7. Carry on with my Sandbox Project

Troubleshoot

- Fix my corrupted desktop Window OS 

- Understanding VGG16 architecture and how to modify it to suit greyscale images

- Not enough greyscale data: Search for additional threat groups malware

Lesson Learnt

How to create your own CNN model and VGG16 configured pre-trained model and everything stated above^

Week 5

Create own CNN models to train with pre-processed data using 8:2 ratio train and test dataset.
Use pre-trained VGG16 model to train with pre-processed data using 8:2 ratio train and test dataset.

Challenge

- Merge Python script into a singular file with python parser module.
    1. extract_pe_features.py
        - Bulk process and ]eExtract file features from binary files grouped in their classes, and save them into csv file.
        - File entropy measures the randomness of the data in a file and is used to determine whether a file contains hidden data or suspicious scripts. The scale of randomness is from 0, not random, to 8, totally random, such as an encrypted file.
        - The following array is the column name of necessary headers extracted into csv files:
        features_columns = [
                                    "Name",
                                    "md5",
                                    "Machine",
                                    "SizeOfOptionalHeader",
                                    "Characteristics",
                                    "MajorLinkerVersion",
                                    "MinorLinkerVersion",
                                    "SizeOfCode",
                                    "SizeOfInitializedData",
                                    "SizeOfUninitializedData",
                                    "AddressOfEntryPoint",
                                    "BaseOfCode",
                                    "BaseOfData",
                                    "ImageBase",
                                    "SectionAlignment",
                                    "FileAlignment",
                                    "MajorOperatingSystemVersion",
                                    "MinorOperatingSystemVersion",
                                    "MajorImageVersion",
                                    "MinorImageVersion",
                                    "MajorSubsystemVersion",
                                    "MinorSubsystemVersion",
                                    "SizeOfImage",
                                    "SizeOfHeaders",
                                    "CheckSum",
                                    "Subsystem",
                                    "DllCharacteristics",
                                    "SizeOfStackReserve",
                                    "SizeOfStackCommit",
                                    "SizeOfHeapReserve",
                                    "SizeOfHeapCommit",
                                    "LoaderFlags",
                                    "NumberOfRvaAndSizes",
                                    "SectionsNb",
                                    "SectionsMeanEntropy",
                                    "SectionsMinEntropy",
                                    "SectionsMaxEntropy",
                                    "SectionsMeanRawsize",
                                    "SectionsMinRawsize",
                                    "SectionMaxRawsize",
                                    "SectionsMeanVirtualsize",
                                    "SectionsMinVirtualsize",
                                    "SectionMaxVirtualsize",
                                    "ImportsNbDLL",
                                    "ImportsNb",
                                    "ImportsNbOrdinal",
                                    "ExportNb",
                                    "ResourcesNb",
                                    "ResourcesMeanEntropy",
                                    "ResourcesMinEntropy",
                                    "ResourcesMaxEntropy",
                                    "ResourcesMeanSize",
                                    "ResourcesMinSize",
                                    "ResourcesMaxSize",
                                    "LoadConfigurationSize",
                                    "VersionInformationSize",
                                    "Malware_ClassID",
                                    "Malware_ClassName"
                                ]
       Carry on with my Sandbox Project
    2. extract_opcode.py
        - Bulk process and extract opcodes in malware binary files grouped in their classes
       
    3. Use pre-trained VGG-16 model to train RGB images
    
    4. Create own CNN models
        - Models trained with RGB images
        - Models with different image sizes
        - Models with myraid learning rates
        - Models trained with 6:4 dataset ratio
    
    Carry on with my Sandbox Project

Troubleshoot

- Creating the extract_pe_features.py and extract_opcode.py python scripts.

Lesson Learnt

Life is great!

Week 6

Create own CNN models to train with pre-processed data using 6:4 ratio train and test dataset.
Use pre-trained VGG16 model to train with pre-processed data using 6:4 ratio train and test dataset.

Challenge

- Create **GOOD** CNN models.

Troubleshoot

- How to ensure CNN models that are not overfitted and underfitted.

Lesson Learnt

I love AI!

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data		data
data_utils		data_utils
README.md		README.md
config.py		config.py
data_preprocess.py		data_preprocess.py
detect_malware.py		detect_malware.py
org_malware_dataset_count_images.csv		org_malware_dataset_count_images.csv
org_malware_dataset_count_pe_features.csv		org_malware_dataset_count_pe_features.csv
org_malware_dataset_pe_features.csv		org_malware_dataset_pe_features.csv
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware Classification using classical Machine Learning and Deep Learning

Created by Goh Ee Sheng 2022, NYP FYPJ 2022 P4

Supervisor: Dr Brandon Ooi

Abstract

Quick Notes:

Packages requirements

Option 1:

Option 2:

Malware samples

Data preprocessing

Week 1

Challenge

Week 2

Official Work starts

Preprocess malware Portable Executables

Challenge

Week 3

Challenge

Troubleshoot

Lesson Learnt

Week 4

Challenge

Troubleshoot

Lesson Learnt

Week 5

Challenge

Troubleshoot

Lesson Learnt

Week 6

Challenge

Troubleshoot

Lesson Learnt

About

Releases

Packages

Languages

goheesheng/FYPJ_AI_MALWARE

Folders and files

Latest commit

History

Repository files navigation

Malware Classification using classical Machine Learning and Deep Learning

Created by Goh Ee Sheng 2022, NYP FYPJ 2022 P4

Supervisor: Dr Brandon Ooi

Abstract

Quick Notes:

Packages requirements

Option 1:

Option 2:

Malware samples

Data preprocessing

Week 1

Challenge

Week 2

Official Work starts

Preprocess malware Portable Executables

Challenge

Week 3

Challenge

Troubleshoot

Lesson Learnt

Week 4

Challenge

Troubleshoot

Lesson Learnt

Week 5

Challenge

Troubleshoot

Lesson Learnt

Week 6

Challenge

Troubleshoot

Lesson Learnt

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages