Skip to content

Core Services Failover Campaign 2014

rahimbouchra edited this page Dec 12, 2014 · 2 revisions

December 2014 Core Services Failover Campaign - #AAROCCoreAvaill2014

Campaign overview

  • Leader: Bouchra Rahim (CNRST, MAGRID)
  • Start: 15/12/2014
  • End: 23/12/2014

Motivation

The Regional Operations Centre operates certain so-called "Core Services", which are described in the Resource Infrastructure Provider MoU signed between CSIR Meraka and EGI.eu. These need to be 100 % available (actual thresholds are defined in the MoU on a per-service basis), and as such, as fail-over capability is needed.

Currently, we have a next version of the services, to provide continuous integration and rolling updates, however they are at the same site as the actual production services, this when these suffer a network or power outage, we lose everything.

The main impact is that A/R for sites is degraded, which is not the sites' problem but the fault of the ROC.

Description of work to be done

  • ROC : Definition of services to be replicated and DevOps code for deployment
  • MA-01-CNRST : Provision VMs with relevant IP and config on which to deploy the services.

We would like the machines to be available in both "regions" of the ROC - north and south. For this reason, we plan to put the failover services in Morocco. Futher backup instances can be considered later.

Top-BDII

Requirements:

  • VM Resources :
    • 2 core, > 4GB RAM

    • 50 GB disk

  • Network :
    • public IP
    • BDII ports open
    • ssh port to Ansible control machine open

Procedure See https://wiki.egi.eu/wiki/MAN05_top-BDII_and_site-BDII_High_Availability

SAM-NAGIOS

Requirements:

  • VM Resources :
    • 4 core, > 6GB RAM

    • 50 GB disk

  • Network :
    • public IP
    • SAM-NAGIOS ports open
    • ssh port to Ansible control machine open

Procedure ... todo ...