Merge pull request #80 from CSIRT-MU/on-guest-provisioning

On guest provisioning
CSIRT-MU · May 22, 2018 · 056aa80 · 056aa80
2 parents bd889d2 + 74eaae2
commit 056aa80
Show file tree

Hide file tree

Showing 18 changed files with 408 additions and 290 deletions.
diff --git a/README.md b/README.md
@@ -4,21 +4,21 @@ A framework for the real-time IP flow data analysis built on Apache Spark Stream
 
 ## About Stream4Flow
 
-The basis of the Stream4Flow framework is formed by the IPFIXCol collector, Kafka messaging system, Apache Spark, and Elastic Stack. IPFIXCol is able to receive IP flows from majority of network Netflow/IPFIX probes (e.g. Flowmon Probe, softflowd, etc.). IPFIXCol enables incoming IP flow records to be transformed into the JSON format provided to the Kafka messaging system. The selection of Kafka was based on its scalability and partitioning possibilities, which provide sufficient data throughput. Apache Spark was selected as the data stream processing framework for its quick IP flow data throughput, available programming languages (Scala, Java, or Python) and MapReduce programming model. The analysis results are stored in Elastic Stack containing Logstash, Elasticsearch, and Kibana, which enable storage, querying, and visualizing the results. The Stream4Flow framework also contains the additional web interface in order to make administration easier and visualize complex results of the analysis.
+The basis of the Stream4Flow framework is formed by the IPFIXCol collector, Kafka messaging system, Apache Spark, and Elastic Stack. IPFIXCol is able to receive IP flows from a majority of network Netflow/IPFIX probes (e.g., Flowmon Probe, softflowd, etc.). IPFIXCol enables incoming IP flow records to be transformed into the JSON format provided to the Kafka messaging system. The selection of Kafka was based on its scalability and partitioning possibilities, which provide sufficient data throughput. Apache Spark was selected as the data stream processing framework for its quick IP flow data throughput, available programming languages (Scala, Java, or Python) and MapReduce programming model. The analysis results are stored in Elastic Stack containing Logstash, Elasticsearch, and Kibana, which enable storage, querying, and visualizing the results. The Stream4Flow framework also contains the additional web interface to make administration easier and visualize complex results of the analysis.
 
 ![architecture](https://stream4flow.ics.muni.cz/images/architecture.png?)
 
 
 
 ### Framework Features
-- **Full Stack Solution**: The framework provides full stack solution for IP flow analysis prototyping. It is possible to connect to majority of IP flow network probes. The framework integrates tools for data collection, data processing, manipulation, storage, and presentation. It is compatible with common network probes for IP flow monitoring and export.
-- **Easy Deployment**: The deployment of the framework is fully automated for cloud deployment using cutting edge technologies for software orchestration. The deployment comes with example prototype applications and initial tests to further ease the prototype development.
+- **Full Stack Solution**: The framework provides full stack solution for IP flow analysis prototyping. It is possible to connect to the majority of IP flow network probes. The framework integrates tools for data collection, data processing, manipulation, storage, and presentation. It is compatible with common network probes for IP flow monitoring and export.
+- **Easy Deployment**: The deployment of the framework is fully automated for cloud deployment using cutting-edge technologies for software orchestration. The deployment comes with example prototype applications and initial tests to further ease the prototype development.
 - **High Performance**: Thanks to the scalability of the framework, it is fitted for processing network traffic in a wide range of networks from small company network to large-scale, high-speed networks of ISPs. Its distributed nature enables computationally intensive analyses.
 - **Real-time Analysis**: The stream-based approach provides results of IP flow analysis prototype with only a few seconds delay. The results can be explored in various ways in a user interface in real time. IP analysis prototype can be immediately improved according to provided results.
 
 ### Use-cases
 - **Stream-Based Network Monitoring**: The framework enables to run analyses in data streams. It is suitable for various data pre-processing, continuous queries.
-- **Real-time Attack Detection**: The stream-based approach enables to detect attacks with only a few seconds delay. An instant attack detection provides time to set up a defense and lowers harms caused by an attack. A sample detections are provided, and you can easily create custom detection method.
+- **Real-time Attack Detection**: The stream-based approach enables to detect attacks with only a few seconds delay. An instant attack detection provides time to set up a defense and lowers harms caused by an attack. Sample detections are provided, and you can easily create custom detection method.
 - **Host Profiling**: Apart from monitoring of the whole network, the monitoring can be focused on individual hosts. No host agents are needed, and the monitoring is transparent. You can get a long-term profile for each host connected to your network and explore its behavior.
 - **Situational Awareness**: Data from network monitoring, attack detection, and host profiling can be gathered together to provide complex situational awareness over your network. The advanced analysis of the collected data can reveal information important both from security and business perspective.
 
@@ -44,27 +44,31 @@ We have it all prepared for you. Everything is preconfigured. You have to only c
 
 #### Requirements
 - [Vagrant](https://www.vagrantup.com/) >= 1.9.7
-- [ansible](https://www.ansible.com/) >= 2.1.0
+- [Ansible](https://www.ansible.com/) >= 2.1.0
 - python 2.7
 - Internet connection
 
 ### We support two types of deployment:
 - **Standalone deployment:** Stream4Flow will be deployed into virtual machines on your physical machine using [Vagrant](https://www.vagrantup.com/)
-- **Cluster deployment:** you can deploy Stream4Flow on your own cluster using [ansible](https://www.ansible.com/)
+- **Cluster deployment:** you can deploy Stream4Flow on your cluster using [Ansible](https://www.ansible.com/)
     - requirement: Debian-based OS
 
 ### Standalone deployment
 
-_Note: The minimum hardware requirement is 16GB of RAM_
+_Note: The minimum hardware requirement is 12GB of RAM_
 
 1. download repository
 2. go to folder **provisioning/**
-3. run vagrant provisioning: `vagrant up`
+3. (optional) update guests configuration in **configuration.yml**
+3. run vagrant provisioning: `vagrant up` or start guests separately `vagrant up <guest-name>`
+4. upload your SSH key to guests or allow password based SSH login (use `vagrant ssh <guest-name>`)
+
+See [provision/README.md](./provisioning/README.md) for additional information about provisioning and Vagrant usage.
 
 
 ### Cluster deployment
 
-_Note:  machines in cluster must run Debian OS with systemd_
+_Note:  machines in the cluster must run Debian OS with systemd_
 
 1. download repository
 2. go to folder **provisioning/ansible**
@@ -82,8 +86,7 @@ _Note:  machines in cluster must run Debian OS with systemd_
 
 #### Run an example application protocols_statistics
 
-1. login to Spark Master machine via ssh
-`ssh [email protected]`
+1. login to Spark Master machine via ssh `ssh [email protected]`
 2. go to application directory
 `cd /home/spark/applications/`
 3. run example application
@@ -111,16 +114,16 @@ Stream4Flow is compatible with any Netflow v5/9 or IPFIX network probe. To measu
 
 ````bibtex
 @ARTICLE{jirsik-2017-toward, 
-author={T. Jirsik and M. Cermak and D. Tovarnak and P. Celeda}, 
-journal={IEEE Communications Magazine}, 
-title={Toward Stream-Based IP Flow Analysis}, 
-year={2017}, 
-volume={55}, 
-number={7}, 
-pages={70-76}, 
-doi={10.1109/MCOM.2017.1600972}, 
-ISSN={0163-6804}, 
-month={},}
+  author={Jirsik, Tomas and Cermak, Milan and Tovarnak, Daniel and Celeda, Pavel}, 
+  journal={IEEE Communications Magazine}, 
+  title={Toward Stream-Based IP Flow Analysis}, 
+  year={2017}, 
+  volume={55}, 
+  number={7}, 
+  pages={70-76}, 
+  doi={10.1109/MCOM.2017.1600972}, 
+  ISSN={0163-6804},
+}
 ````
 
 **Plain text**
@@ -129,9 +132,11 @@ T. Jirsik, M. Cermak, D. Tovarnak and P. Celeda, "Toward Stream-Based IP Flow An
 doi: 10.1109/MCOM.2017.1600972
 ```
 
-**Related Publication**
+**Related Publications**
 
-[Toward Stream-Based IP Flow Analysis](https://doi.org/10.1109/MCOM.2017.1600972)
+- [Toward Stream-Based IP Flow Analysis](https://doi.org/10.1109/MCOM.2017.1600972)
+- [A Performance Benchmark for NetFlow Data Analysis on Distributed Stream Processing Systems](https://doi.org/10.1109/NOMS.2016.7502926)
+- [Real-time analysis of NetFlow data for generating network traffic statistics using Apache Spark](https://doi.org/10.1109/NOMS.2016.7502952)
 
 # Acknowledgement
 

diff --git a/provisioning/README.md b/provisioning/README.md
@@ -2,22 +2,41 @@
 
 ## Vagrant Provisioning
 
-Configuration of vagrant provisioning is in Vagrantfile.
-
-Configurable options:
-- IP CONFIGURATION: This section configures IP addresses of virtual machines
-- NUMBER OF SLAVES: You can choose number of slaves to built (max. 155), Spark Slave IP address starts at sparkSlave_prefix.101 and increments by one.
-- VM PROPERTIES CONFIGURATION: Options for virtual machines (can be set memory in MB a number of CPUs.)
+All configuration of guest deployed by vagrant provisioning is in [configuration.yml](./configuration.yml).
+
+**Configurable options:**
+- *common* – Settings common for all guests
+    - *box, box_url* – used Vagrant boxes (using different boxes can cause malfunction of the provisioning)
+    - *provision_on_guest* – true value allows to provision guest separately, false will provision all at once but faster
+- *producer* – Producer guest settings
+    - *ip* – used address of the guest
+    - *memory* – reserved memory (decrease can cause malfunction of the framework)
+    - *cpu* – a number of virtual CPUs (decrease can cause malfunction of the framework)
+- *sparkMaster* – Spark master guest settings
+    - *ip* – used address of the guest
+    - *memory* – reserved memory (decrease can cause malfunction of the framework)
+    - *cpu* – a number of virtual CPUs (decrease can cause malfunction of the framework)
+- *sparkSlave* – Slave guests settings (each slave will have the same configuration)
+    - *count* – a number of slaves that will be provisioned (max. 155)
+    - *ip_prefix* – IP address prefix of slave guests (suffix starts at 101)
+    - *memory* – reserved memory (decrease can cause malfunction of the framework)
+    - *cpu* – a number of virtual CPUs (decrease can cause malfunction of the framework)
+- *consumer* – Consumer guest settings
+    - *ip* – used address of the guest
+    - *memory* – reserved memory (decrease can cause malfunction of the framework)
+    - *cpu* – a number of virtual CPUs (decrease can cause malfunction of the framework)
 
 ### Vagrant commands:
-- vagrant up [<VM_name>]  : Brings up virtual machine(s)
-- vagrant halt [<VM_name>]: Shutdown virtual machine(s)
-- vagrant destroy [<VM_name>]: Completely delete virtual machine(s) and their associated resources (virtual hard drives, ...)
-
-## Ansible Provisioning
+- `vagrant up` – Brings up the whole framework
+- `vagrant up <guest_name>` – Brings up the guest
+- `vagrant halt <guest_name>` – Shutdown the guest
+- `vagrant destroy <guest_name>`– Completely delete given guest and its associated resources (virtual hard drives, ...)
+- `vagrant provision <guest_name>` – Run Ansible provisioning on the guest
+- `vagrant ssh <guest_name>` – Connect to the guest via SSH
 
+Available guest names: *producer*, *sparkMaster*, *sparkSlave101* ... *sparkSlave156*, *consumer*.
 
-All configurable variables are stored in ansible/roles/globas_vars/*
-Templates of configuration files are stored in ansible/roles/<name_of_role>/templates/*
-
+## Ansible Provisioning
 
+Stream4Flow framework confogiration and variables are available in [ansible/group_vars/*](./ansible/group_vars/).
+- Templates of configuration files are stored in *ansible/roles/<name_of_role>/templates/**