Malicious URL Detection Using Deep Learning

Deep Learning 046211 – winter 2021

Technion – Israel Institute of Technology

Summary
Introduction
Method
The Models
Results
Files and Usage
References and credits

📖 Summary

In this project, we implemented and tested DL models which classify URLs as malicious or benign, based solely on lexicographical analysis of the addresses. We used an embedding layer which maps chars from our dataset to vector representations, and we integrated it with two different models (LSTM and transformer). From comparison of their results we concluded that the Transformer is more suitable for our task, so we created a bigger version of it which reached accuracy of 94.8% on the test set.

🧑‍🏫 Introduction

Malicious URLs or malicious website is a very serious threat on cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year.

Our goal in this project is to develop DL models to precisely classify URLs as malicious or benign, and alert users if given URLS are potentially harmful. classification is based solely on lexicographical analysis of the addresses. Namely, it is a Binary Time Series Classification problem (many to one).

💭 Method

URL addresses contain multiple characteristics, which make traditional NLP methods yield non satisfying results for address text analysis. To begin with, most URL addresses are not constructed of dictionary words, like regular sentences.

We can try to define certain char delimiters (e.g. ‘/’,’-‘) and tokenize the addresses, thus creating word separated “sentences”. However, tokens (words) generated from this method will not necessarily output logical or frequent words that will make a good-enough representation of the data. Moreover, after splitting the addresses, we’ll see that many of these words are exceptionally long, extremely rare and sometimes unique in our dataset.

Therefore, we conclude that our project requires a method which examines the URLs char-by-char, rather than word-by-word. Hence, we used an Embedding layer to map valid chars from our dataset to vector representations.

💻 The Models

We implemented, trained and evaluated two models. A LSTM classifier network (FC and SoftMax layers concatenated to a LSTM network’s last hidden state), and a Transformer classifier, using same concatenation of layers at output for classification.

📈 Results

Here we can see the loss, train accuracy and validation accuracy during training. The third graphs shows the validation accuracy with increased resolution, in the epochs where the performances improved the most (the "focused epochs").

The model	Test accuracy	Number of parameters	Number of epochs	Train time	Inference time
LSTM	84.25%	14,474	21	3 hours and 11 minutes	0.01 seconds
Transformer	93.39%	11,890	10	51 minutes	0.003 seconds

From those measures, we conclude that the Transformer is more suitable for our classification task, so we continued to work with it.

We changed the model’s hyper-parameters so it will contain more learned parameters (27,618 in total), and we trained it for more epochs with additional URLs. Finally, model performance reached 94.8% test accuracy. Here his loss, train accuracy and validation accuracy graphs:

👨‍💻 Files and Usage

File Name	Description
preprocessing.py	Loads the data, preprocessing it, and creates batches
training_and_evaluating.py	Implementation of our training and evaluation loops
embedding_and_positional_encoding.py	Implementation of our embedding and positional encoding layers
LSTM.py	Implementation of our LSTM-classifier model
transformer.py	Implementation of our Transformer-classifier model
transformer_optuna.py	Optuna trials for our transformer
malicious_phish_CSV	The URL's dataset

For running the models, you can run the files transformer.py or LSTM.py, with your own hyper-parameters.

Prerequisites

Library	Version
`Python`	`3.5.5 (Anaconda)`
`torch`	`1.10.1`
`numpy`	`1.19.5`
`matplotlib`	`3.3.4`

🙌 References and credits

[1] The dataset "malicious_phish.csv" was taken from kaggle.
[2] We used some of the explanation and code parts from the "deep learning - 046211" course tutorials.
[3] For more information on classification of URLs using lexical methods, see: “Detecting Malicious URLs Using Lexical Analysis”, M. Mamun, M. Rathore, A. Lashkari, N. Stakhanova, A. Ghorbani, International Conference on Network and System Security, 2016.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Malicious URL Detection Using Deep Learning

Deep Learning 046211 – winter 2021

Technion – Israel Institute of Technology

📖 Summary

🧑‍🏫 Introduction

💭 Method

💻 The Models

📈 Results

👨‍💻 Files and Usage

🙌 References and credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

Malicious URL Detection Using Deep Learning

Deep Learning 046211 – winter 2021

Technion – Israel Institute of Technology

📖 Summary

🧑‍🏫 Introduction

💭 Method

💻 The Models

📈 Results

👨‍💻 Files and Usage

🙌 References and credits