Skip to content

Latest commit

 

History

History
175 lines (108 loc) · 8.31 KB

README.md

File metadata and controls

175 lines (108 loc) · 8.31 KB

Malicious URL Detection Using Deep Learning

Deep Learning 046211 – winter 2021

Technion – Israel Institute of Technology

Edan Kinderman Lev Panov


animated


-----------------------------------------------------

📖 Summary

In this project, we implemented and tested DL models which classify URLs as malicious or benign, based solely on lexicographical analysis of the addresses. We used an embedding layer which maps chars from our dataset to vector representations, and we integrated it with two different models (LSTM and transformer). From comparison of their results we concluded that the Transformer is more suitable for our task, so we created a bigger version of it which reached accuracy of 94.8% on the test set.


-----------------------------------------------------

🧑‍🏫 Introduction

Malicious URLs or malicious website is a very serious threat on cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year.

Our goal in this project is to develop DL models to precisely classify URLs as malicious or benign, and alert users if given URLS are potentially harmful. classification is based solely on lexicographical analysis of the addresses. Namely, it is a Binary Time Series Classification problem (many to one).


-----------------------------------------------------

💭 Method

URL addresses contain multiple characteristics, which make traditional NLP methods yield non satisfying results for address text analysis. To begin with, most URL addresses are not constructed of dictionary words, like regular sentences.

We can try to define certain char delimiters (e.g. ‘/’,’-‘) and tokenize the addresses, thus creating word separated “sentences”. However, tokens (words) generated from this method will not necessarily output logical or frequent words that will make a good-enough representation of the data. Moreover, after splitting the addresses, we’ll see that many of these words are exceptionally long, extremely rare and sometimes unique in our dataset.


Parameters maps


Therefore, we conclude that our project requires a method which examines the URLs char-by-char, rather than word-by-word. Hence, we used an Embedding layer to map valid chars from our dataset to vector representations.


Parameters maps


-----------------------------------------------------

💻 The Models

We implemented, trained and evaluated two models. A LSTM classifier network (FC and SoftMax layers concatenated to a LSTM network’s last hidden state), and a Transformer classifier, using same concatenation of layers at output for classification.


Parameters maps


-----------------------------------------------------

📈 Results

Here we can see the loss, train accuracy and validation accuracy during training. The third graphs shows the validation accuracy with increased resolution, in the epochs where the performances improved the most (the "focused epochs").


Parameters maps


Parameters maps


The model Test accuracy Number of parameters Number of epochs Train time Inference time
LSTM 84.25% 14,474 21 3 hours and 11 minutes 0.01 seconds
Transformer 93.39% 11,890 10 51 minutes 0.003 seconds

From those measures, we conclude that the Transformer is more suitable for our classification task, so we continued to work with it.

We changed the model’s hyper-parameters so it will contain more learned parameters (27,618 in total), and we trained it for more epochs with additional URLs. Finally, model performance reached 94.8% test accuracy. Here his loss, train accuracy and validation accuracy graphs:

Parameters maps


-----------------------------------------------------

👨‍💻 Files and Usage

File Name Description
preprocessing.py Loads the data, preprocessing it, and creates batches
training_and_evaluating.py Implementation of our training and evaluation loops
embedding_and_positional_encoding.py Implementation of our embedding and positional encoding layers
LSTM.py Implementation of our LSTM-classifier model
transformer.py Implementation of our Transformer-classifier model
transformer_optuna.py Optuna trials for our transformer
malicious_phish_CSV The URL's dataset

For running the models, you can run the files transformer.py or LSTM.py, with your own hyper-parameters.


Prerequisites

Library Version
Python 3.5.5 (Anaconda)
torch 1.10.1
numpy 1.19.5
matplotlib 3.3.4

-----------------------------------------------------

🙌 References and credits

  • [1] The dataset "malicious_phish.csv" was taken from kaggle.
  • [2] We used some of the explanation and code parts from the "deep learning - 046211" course tutorials.
  • [3] For more information on classification of URLs using lexical methods, see: “Detecting Malicious URLs Using Lexical Analysis”, M. Mamun, M. Rathore, A. Lashkari, N. Stakhanova, A. Ghorbani, International Conference on Network and System Security, 2016.