Welcome to the SOC Analyst Level 1 Replacement using RAG LLM project! This repository presents a small research-oriented Proof of Concept (POC) aimed at exploring the feasibility of using a Retrieval-Augmented Generation (RAG) Large Language Model (LLM) to replace or assist a Level 1 SOC (Security Operations Center) Analyst.
Security Operations Centers are the backbone of cybersecurity in organizations, continuously monitoring and analyzing data to detect potential threats. However, the increasing volume of security logs and alerts can overwhelm human analysts, particularly those at Level 1, who are responsible for initial triage and response.
This project explores the potential of using an LLM, combined with a retrieval system, to automate some of the tasks typically performed by a Level 1 SOC analyst. By leveraging advanced natural language processing (NLP) techniques, the system can answer queries related to server logs and provide actionable insights.
- LangChain: Utilized for orchestrating the retrieval-augmented generation (RAG) pipeline.
- Ollama LLM: The LLM backbone, capable of understanding and processing natural language queries.
- FAISS: A vector store for efficient retrieval of relevant log information.
- Python: The core language used for implementation.
- Pandas & Matplotlib (Optional): For potential future extensions involving data analysis and visualization.
-
Log Ingestion: The system loads and processes server logs stored in a Markdown file (
logs1.md
). The logs are split into manageable chunks for efficient processing. -
Vectorization: Each chunk of log data is embedded into a vector space using the
OllamaEmbeddings
model. This allows for efficient similarity searches. -
Query Processing: Users can input natural language queries, such as "What are the suspicious activities in the logs?" The system retrieves relevant log information and uses the LLM to generate a concise and contextually accurate response.
-
Response Generation: The system provides a response based on the retrieved context, simulating the role of a Level 1 SOC analyst by answering queries about the logs.
├── logs/ # Directory containing log files for analysis
│ ├── logs1.md # Sample log data file 1
│ ├── logs2.md # Sample log data file 2
├── main.py # Main Python script implementing the POC
├── unit_testing.py # Script for unit testing of the POC
├── README.md # Project documentation (you are here!)
├── requirements.txt # Python dependencies
└── LICENSE # License information for the project
Before you start, ensure you have the following installed:
- Python 3.8+
- Virtual environment tools (optional but recommended)
-
Clone the repository:
git clone https://github.com/clab60917/RAG-LLM-SOC_analyst.git cd RAG-LLM-SOC_analyst
-
Create a virtual environment (optional):
python -m venv env source env/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the POC:
python main.py
-
Querying the Logs: Once the script is running, you can start querying the logs. Type your query and press enter. For example:
Query: What are the most recent suspicious activities?
Query: Summarize the failed login attempts.
-
Exit: To exit the script, simply type
exit
.
This POC lays the groundwork for a more comprehensive system capable of fully automating Level 1 SOC operations. Future enhancements might include:
- Real-time Log Streaming: Integrate with live data sources for real-time analysis.
- Advanced Analytics: Implement graph-based and statistical analysis of log data.
- Actionable Responses: Automate responses such as blocking IP addresses or triggering alerts.
This project is part of an ongoing small research initiative. The ultimate goal is to evaluate whether RAG-based LLMs can efficiently scale the capabilities of SOC teams, reducing the workload on human analysts and enabling faster, more accurate incident response.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! If you have suggestions or improvements, feel free to fork the repository and submit a pull request.
Special thanks to the creators of LangChain, Ollama, and the open-source community for providing the tools and frameworks that made this project possible.
👤 Author: Clab60917