Malware-Detection-in-PE-files-using-ML

Every thing about this project is explained in detail in FER(Final Evaluation report).

Project Description

This project aims to detect malware in PE (Portable Executable) files using Machine Learning techniques. We have developed a model that analyzes PE files and predicts whether they contain malware or not using hybrid static malware analysis(combination of PE Headers, byte-n-grams and opcode-n-grams features). This project can be a valuable tool in enhancing cybersecurity measures and protecting systems against malicious software/files.

Dataset(s):

PE files csv, containing metadata, header information Dataset.
byte and asm raw files, from kaggle microsoft malware classification challenge (BIG 2015) Dataset.

Preprocessing/Feature Extraction:

For Dataset-1:

Dataset already has values for the features.
Created a script to extract those features and header information from the given PE files like .exe, .dll file types.
Used Extra-Trees classifier for the feature selection, important feature set from all the available information.

For Dataset-2:

Dataset has raw byte and asm files. Created seperate directories for each type and extracted file size as a feature for each file.
Extracted N-grams from byte(byte-n-grams, where n= 1,2) and asm files(prefixes/keywords/registers/opcode-n-grams, where n= 1,2,3,4) as the features from each file.
Converted asm files to image and extracted top performing 200 image pixels as features from that image.
Used Random Forest for important features selection from all the above features separately for each feature set and merged them.

Final dataset contains the following features.

PE Header dataset
Byte unigrams
Opcode unigrams
Top 300 Byte bigrams
Top 200 Opcode bigrams
Top 200 Opcode trigrams
Top 200 Opcode tetragrams
Top 200 Image Pixels

Training (build/train/evaluate)

Trained various ML models on the above final dataset for the classification of files into malware/benign.
Evaluation metrics used are accuracy, f1 score, confusion matrix.
Random Forest model performed best among others like Gradient Boost, SVM.

you can download the trained Random Forest model here.

Setup

Clone the repository to your local machine:

git clone https://github.com/DasariJayanth/Malware-Detection-in-PE-files-using-Machine-Learning.git

Once you cloned the repository create a virtual environment using

python3 -m venv .venv

you might be required to set the policies to authorize the acivation of env

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Activate the environment:

source .venv/bin/activate

Next install the required libraries using:

pip install -r requirements.txt

Perform Feature extraction on your data as done in the PE_Header(exe, dll files)/malware_test.py and Ngrams(byte, asm files)/N-grams.ipynb. Also refer Malware Detection Model.ipynb for merging both feature sets before predicting with the model.

Load the models/RF_model.pkl and run the loaded model on the extracted features for prediction.

After you are done, Deactivate the virtual environment:

deactivate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware-Detection-in-PE-files-using-ML

Project Description

Dataset(s):

Preprocessing/Feature Extraction:

For Dataset-1:

For Dataset-2:

Training (build/train/evaluate)

Setup

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Final Evaluation		Final Evaluation
Mid Evaluation		Mid Evaluation
Ngrams(byte, asm files)		Ngrams(byte, asm files)
PE_Header(exe, dll files)		PE_Header(exe, dll files)
models		models
.gitignore		.gitignore
Malware Detection Model.ipynb		Malware Detection Model.ipynb
README.md		README.md
requirements.txt		requirements.txt

DasariJayanth/Malware-Detection-in-PE-files-using-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Malware-Detection-in-PE-files-using-ML

Project Description

Dataset(s):

Preprocessing/Feature Extraction:

For Dataset-1:

For Dataset-2:

Training (build/train/evaluate)

Setup

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages