This project is in collaboration with CSIT to develop AI models for predicting malware attribution. Students will collect and process malware samples, extract features, build and test AI models, and produce a detailed study of the use of AI models in the malware attribution problem.
- Implementation is using sklearn, numpy, pandas and tensorflow.
- MS Windows executable binary files are used as data.
- Features * Classic ML-based approaches: PE fie features are extracted and used * Deep Learning-based approaches: (1) Opcodes (2) Converted executables into gray-scale images
- Install pefile pythong package e.g.
conda install pefile
- Install PyTorch and other libs e.g.
conda install -c pytorch torchtext
. All other common dependencies should be covered by anaconda distro. objdump
in ubuntu. (This code is developed and tested for ubuntu-based development env)
- Create virtual environment
- Install requirements.txt - pip install -r requirements.txt
objdump
in ubuntu. (This code is developed and tested for ubuntu-based development env)
* copy the malware samples at <project_dir>/data/exec_files/org_dataset.
├── config.py
├── data
│ ├── exec_files
│ │ └── org_dataset #create folder
│ │ ├── malware_directory_1
│ │ ├── malware_directory_2
├── data_preprocess.py
├── data_utils
.
.
Execute data_preprocess.py
with below mentioned options to preprocess the data.
python data_preprocess.py --extract_pe_features
python data_preprocess.py --bin_to_img
python data_preprocess.py --extract_opcodes
python data_preprocess.py --split_opcodes
- Research on malware using AI and Sandbox Technology.
- Read and analyse what previous NYP researchers did.
- Fix Kaelan's unzip.py
* Now able to unzip .7z, .rar, .zip folders regardless of OS.
- Research on malware using AI and Sandbox Technology.
- Read and analyse what previous NYP researchers did.
- Use pre-trained VGG19 model to train with pre-processed data.
Preprocess malware images collected from the internet including polymorph malware
- Create Python scripts.
1. convert_pdf_doc.py
- Convert PDFs and Word Documents into grayscale images.
2. convert_bin_to_img.py
- Convert compiled malware (i.e., .msi, .exe, .jar) into grayscale and RGB images.
3. resize.py
- Able to resize original images in directory to specific width and height
4. train_test_split.py
- Split datasets in to train and test folders.
- Use pre-trained VGG16 model to train with pre-processed data.
- Merge Python script into a singular file with python parser module.
1. bin_to_img.py
- Convert any files including malwares into grayscale images.
- Edit the notebook to allow grayscale image as input_data
There is another way in preprocessing data. Basically, get the training datasets that are grouped in classes and convert them to numpy array immediately. Do not need to waste time and disk space for the pre-processed images.
* Use this on original images - Instead of resizing every image and store in another folder.
def imagearray(path,size):
data = []
for folder in os.listdir(path): # Loop the train/test folder
sub_path = path +"/" + folder # Subfolder - Classes
for img in os.listdir(sub_path): # Loop the images
image_path = sub_path + "/" + img
img_arr = cv2.imread(image_path)
img_arr = cv2.resize(img_arr,size)
data.append(img_arr)
return data
To pad grayscale images to the same width and height, you can use the resize method from a library like OpenCV or Pillow. This method allows you to specify the target width and height for the resized images, and it will automatically pad the images with zeros to ensure that they have the specified dimensions.
Here is an example of how you might use the resize method from the OpenCV library to pad your grayscale images:
# Import the necessary libraries
import cv2
# Load the grayscale image
img = cv2.imread('grayscale_image.png', cv2.IMREAD_GRAYSCALE)
# Resize the image to the desired width and height
img = cv2.resize(img, (1024, 1024))
# Save the resized and padded image
cv2.imwrite('resized_image.png', img)
In this example, the grayscale image is first loaded from a file using the imread method from OpenCV. The resize method is then used to resize the image to the desired width and height, and the resulting image is saved to a new file using the imwrite method.
Keep in mind that this is just an example, and you may need to adjust the code to fit your specific use case. Additionally, this code assumes that your grayscale images are in the PNG format and that you want to save the resized and padded images in the same format. You may need to modify this code to handle other image formats or to save the images in a different format.
- Create own CNN models to train with pre-processed data using 8:2 ratio train and test dataset.
- Use pre-trained VGG16 model to train with pre-processed data using 8:2 ratio train and test dataset.
1. resize_recursively_pad.py.py
- Able to resize original images in directory **recursively** to specific width and height
- Pad all images to 1024x1024 pixels
2. Setup Cuda and cuDNN on laptop
- NVIDIA GeForce MX130 GPU - 6 years old
- Outdated hardware thus will not work
3. Setup PlaidML on Desktop
- Desktop uses AMD GPU
- PlaidML uses RocM architecture to train DL model
- **DO NOT DO THIS**
- **Corrupt Window OS, bootscreen white undercursor error**
- To fix: Requires Windows recovery drive
4. Create own CNN models
- Models trained with RGB images
- Models with different image sizes
- Models with myraid learning rates
5. Use pre-trained VGG-16 model to train greyscale images
- Configure whole VGG16 architecture for greyscale images training
6. Obtained datasets of malwares categories (e.g. Trojan, Ransomware, Obfusacted virus)
7. Carry on with my Sandbox Project
- Fix my corrupted desktop Window OS
- Understanding VGG16 architecture and how to modify it to suit greyscale images
- Not enough greyscale data: Search for additional threat groups malware
How to create your own CNN model and VGG16 configured pre-trained model and everything stated above^
- Create own CNN models to train with pre-processed data using 8:2 ratio train and test dataset.
- Use pre-trained VGG16 model to train with pre-processed data using 8:2 ratio train and test dataset.
- Merge Python script into a singular file with python parser module.
1. extract_pe_features.py
- Bulk process and ]eExtract file features from binary files grouped in their classes, and save them into csv file.
- File entropy measures the randomness of the data in a file and is used to determine whether a file contains hidden data or suspicious scripts. The scale of randomness is from 0, not random, to 8, totally random, such as an encrypted file.
- The following array is the column name of necessary headers extracted into csv files:
features_columns = [
"Name",
"md5",
"Machine",
"SizeOfOptionalHeader",
"Characteristics",
"MajorLinkerVersion",
"MinorLinkerVersion",
"SizeOfCode",
"SizeOfInitializedData",
"SizeOfUninitializedData",
"AddressOfEntryPoint",
"BaseOfCode",
"BaseOfData",
"ImageBase",
"SectionAlignment",
"FileAlignment",
"MajorOperatingSystemVersion",
"MinorOperatingSystemVersion",
"MajorImageVersion",
"MinorImageVersion",
"MajorSubsystemVersion",
"MinorSubsystemVersion",
"SizeOfImage",
"SizeOfHeaders",
"CheckSum",
"Subsystem",
"DllCharacteristics",
"SizeOfStackReserve",
"SizeOfStackCommit",
"SizeOfHeapReserve",
"SizeOfHeapCommit",
"LoaderFlags",
"NumberOfRvaAndSizes",
"SectionsNb",
"SectionsMeanEntropy",
"SectionsMinEntropy",
"SectionsMaxEntropy",
"SectionsMeanRawsize",
"SectionsMinRawsize",
"SectionMaxRawsize",
"SectionsMeanVirtualsize",
"SectionsMinVirtualsize",
"SectionMaxVirtualsize",
"ImportsNbDLL",
"ImportsNb",
"ImportsNbOrdinal",
"ExportNb",
"ResourcesNb",
"ResourcesMeanEntropy",
"ResourcesMinEntropy",
"ResourcesMaxEntropy",
"ResourcesMeanSize",
"ResourcesMinSize",
"ResourcesMaxSize",
"LoadConfigurationSize",
"VersionInformationSize",
"Malware_ClassID",
"Malware_ClassName"
]
Carry on with my Sandbox Project
2. extract_opcode.py
- Bulk process and extract opcodes in malware binary files grouped in their classes
3. Use pre-trained VGG-16 model to train RGB images
4. Create own CNN models
- Models trained with RGB images
- Models with different image sizes
- Models with myraid learning rates
- Models trained with 6:4 dataset ratio
Carry on with my Sandbox Project
- Creating the extract_pe_features.py and extract_opcode.py python scripts.
Life is great!
- Create own CNN models to train with pre-processed data using 6:4 ratio train and test dataset.
- Use pre-trained VGG16 model to train with pre-processed data using 6:4 ratio train and test dataset.
- Create **GOOD** CNN models.
- How to ensure CNN models that are not overfitted and underfitted.
I love AI!