In this project, we implemented and tested DL models which classify URLs as malicious or benign, based solely on lexicographical analysis of the addresses. We used an embedding layer which maps chars from our dataset to vector representations, and we integrated it with two different models (LSTM and transformer). From comparison of their results we concluded that the Transformer is more suitable for our task, so we created a bigger version of it which reached accuracy of 94.8% on the test set.
Malicious URLs or malicious website is a very serious threat on cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year.
Our goal in this project is to develop DL models to precisely classify URLs as malicious or benign, and alert users if given URLS are potentially harmful. classification is based solely on lexicographical analysis of the addresses. Namely, it is a Binary Time Series Classification problem (many to one).
URL addresses contain multiple characteristics, which make traditional NLP methods yield non satisfying results for address text analysis. To begin with, most URL addresses are not constructed of dictionary words, like regular sentences.
We can try to define certain char delimiters (e.g. ‘/’,’-‘) and tokenize the addresses, thus creating word separated “sentences”. However, tokens (words) generated from this method will not necessarily output logical or frequent words that will make a good-enough representation of the data. Moreover, after splitting the addresses, we’ll see that many of these words are exceptionally long, extremely rare and sometimes unique in our dataset.
Therefore, we conclude that our project requires a method which examines the URLs char-by-char, rather than word-by-word. Hence, we used an Embedding layer to map valid chars from our dataset to vector representations.
We implemented, trained and evaluated two models. A LSTM classifier network (FC and SoftMax layers concatenated to a LSTM network’s last hidden state), and a Transformer classifier, using same concatenation of layers at output for classification.
Here we can see the loss, train accuracy and validation accuracy during training. The third graphs shows the validation accuracy with increased resolution, in the epochs where the performances improved the most (the "focused epochs").
The model | Test accuracy | Number of parameters | Number of epochs | Train time | Inference time |
---|---|---|---|---|---|
LSTM | 84.25% | 14,474 | 21 | 3 hours and 11 minutes | 0.01 seconds |
Transformer | 93.39% | 11,890 | 10 | 51 minutes | 0.003 seconds |
From those measures, we conclude that the Transformer is more suitable for our classification task, so we continued to work with it.
We changed the model’s hyper-parameters so it will contain more learned parameters (27,618 in total), and we trained it for more epochs with additional URLs. Finally, model performance reached 94.8% test accuracy. Here his loss, train accuracy and validation accuracy graphs:
File Name | Description |
---|---|
preprocessing.py | Loads the data, preprocessing it, and creates batches |
training_and_evaluating.py | Implementation of our training and evaluation loops |
embedding_and_positional_encoding.py | Implementation of our embedding and positional encoding layers |
LSTM.py | Implementation of our LSTM-classifier model |
transformer.py | Implementation of our Transformer-classifier model |
transformer_optuna.py | Optuna trials for our transformer |
malicious_phish_CSV | The URL's dataset |
For running the models, you can run the files transformer.py or LSTM.py, with your own hyper-parameters.
Prerequisites
Library | Version |
---|---|
Python |
3.5.5 (Anaconda) |
torch |
1.10.1 |
numpy |
1.19.5 |
matplotlib |
3.3.4 |
- [1] The dataset "malicious_phish.csv" was taken from kaggle.
- [2] We used some of the explanation and code parts from the "deep learning - 046211" course tutorials.
- [3] For more information on classification of URLs using lexical methods, see: “Detecting Malicious URLs Using Lexical Analysis”, M. Mamun, M. Rathore, A. Lashkari, N. Stakhanova, A. Ghorbani, International Conference on Network and System Security, 2016.