-
Notifications
You must be signed in to change notification settings - Fork 28
CoreOS Distributed Setup
- Go to your AWS console, select cloud formation and create a new stack
- Use the coreos-stable-hvm-codeocean.template file as a template for your cluster
- Configure your cluster and machine sizes and provide a new etcd discovery URL (https://discovery.etcd.io/new)
- After the stack creation, you should see your running instances in EC2
coreos-stable-hvm-codeocean.template:
{
"AWSTemplateFormatVersion": "2010-09-09",
"Description": "CoreOS on EC2: http://coreos.com/docs/running-coreos/cloud-providers/ec2/",
"Mappings" : {
"RegionMap" : {
"eu-central-1" : {
"AMI" : "ami-02211b1f"
},
"ap-northeast-1" : {
"AMI" : "ami-22d27b22"
},
"us-gov-west-1" : {
"AMI" : "ami-e53a59c6"
},
"sa-east-1" : {
"AMI" : "ami-45a62a58"
},
"ap-southeast-2" : {
"AMI" : "ami-2b2e6911"
},
"ap-southeast-1" : {
"AMI" : "ami-0ef1f15c"
},
"us-east-1" : {
"AMI" : "ami-6b1cd400"
},
"us-west-2" : {
"AMI" : "ami-f5a5a5c5"
},
"us-west-1" : {
"AMI" : "ami-bf8477fb"
},
"eu-west-1" : {
"AMI" : "ami-50f4b927"
}
}
},
"Parameters": {
"InstanceType" : {
"Description" : "EC2 HVM instance type (m3.medium, etc).",
"Type" : "String",
"Default" : "m3.medium",
"ConstraintDescription" : "Must be a valid EC2 HVM instance type."
},
"ClusterSize": {
"Default": "3",
"MinValue": "3",
"MaxValue": "12",
"Description": "Number of nodes in cluster (3-12).",
"Type": "Number"
},
"DiscoveryURL": {
"Description": "An unique etcd cluster discovery URL. Grab a new token from https://discovery.etcd.io/new?size=<your cluster size>",
"Type": "String"
},
"AdvertisedIPAddress": {
"Description": "Use 'private' if your etcd cluster is within one region or 'public' if it spans regions or cloud providers.",
"Default": "private",
"AllowedValues": ["private", "public"],
"Type": "String"
},
"AllowSSHFrom": {
"Description": "The net block (CIDR) that SSH is available to.",
"Default": "0.0.0.0/0",
"Type": "String"
},
"KeyPair" : {
"Description" : "The name of an EC2 Key Pair to allow SSH access to the instance.",
"Type" : "String"
}
},
"Resources": {
"CoreOSSecurityGroup": {
"Type": "AWS::EC2::SecurityGroup",
"Properties": {
"GroupDescription": "CoreOS SecurityGroup",
"SecurityGroupIngress": [
{"IpProtocol": "tcp", "FromPort": "22", "ToPort": "22", "CidrIp": {"Ref": "AllowSSHFrom"}}
]
}
},
"Ingress4001": {
"Type": "AWS::EC2::SecurityGroupIngress",
"Properties": {
"GroupName": {"Ref": "CoreOSSecurityGroup"}, "IpProtocol": "tcp", "FromPort": "4001", "ToPort": "4001", "SourceSecurityGroupId": {
"Fn::GetAtt" : [ "CoreOSSecurityGroup", "GroupId" ]
}
}
},
"Ingress2379": {
"Type": "AWS::EC2::SecurityGroupIngress",
"Properties": {
"GroupName": {"Ref": "CoreOSSecurityGroup"}, "IpProtocol": "tcp", "FromPort": "2379", "ToPort": "2379", "SourceSecurityGroupId": {
"Fn::GetAtt" : [ "CoreOSSecurityGroup", "GroupId" ]
}
}
},
"Ingress2380": {
"Type": "AWS::EC2::SecurityGroupIngress",
"Properties": {
"GroupName": {"Ref": "CoreOSSecurityGroup"}, "IpProtocol": "tcp", "FromPort": "2380", "ToPort": "2380", "SourceSecurityGroupId": {
"Fn::GetAtt" : [ "CoreOSSecurityGroup", "GroupId" ]
}
}
},
"CoreOSServerAutoScale": {
"Type": "AWS::AutoScaling::AutoScalingGroup",
"Properties": {
"AvailabilityZones": {"Fn::GetAZs": ""},
"LaunchConfigurationName": {"Ref": "CoreOSServerLaunchConfig"},
"MinSize": "3",
"MaxSize": "12",
"DesiredCapacity": {"Ref": "ClusterSize"},
"Tags": [
{"Key": "Name", "Value": { "Ref" : "AWS::StackName" }, "PropagateAtLaunch": true}
]
}
},
"CoreOSServerLaunchConfig": {
"Type": "AWS::AutoScaling::LaunchConfiguration",
"Properties": {
"ImageId" : { "Fn::FindInMap" : [ "RegionMap", { "Ref" : "AWS::Region" }, "AMI" ]},
"InstanceType": {"Ref": "InstanceType"},
"KeyName": {"Ref": "KeyPair"},
"SecurityGroups": [{"Ref": "CoreOSSecurityGroup"}],
"UserData" : { "Fn::Base64":
{ "Fn::Join": [ "", [
"#cloud-config\n\n",
"coreos:\n",
" etcd2:\n",
" discovery: ", { "Ref": "DiscoveryURL" }, "\n",
" advertise-client-urls: http://$", { "Ref": "AdvertisedIPAddress" }, "_ipv4:2379\n",
" initial-advertise-peer-urls: http://$", { "Ref": "AdvertisedIPAddress" }, "_ipv4:2380\n",
" listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001\n",
" listen-peer-urls: http://$", { "Ref": "AdvertisedIPAddress" }, "_ipv4:2380\n",
" units:\n",
" - name: etcd2.service\n",
" command: start\n",
" - name: fleet.service\n",
" command: start\n",
" - name: flanneld.service\n",
" drop-ins:\n",
" - name: 50-network-config.conf\n",
" content: |\n",
" [Service]\n",
" ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ \"Network\": \"10.1.0.0/16\" }'\n",
" command: start\n",
" - name: docker.service\n",
" drop-ins:\n",
" - name: 60-docker-config.conf\n",
" content: |\n",
" [Service]\n",
" ExecStart=\n",
" ExecStart=/usr/lib/coreos/dockerd --daemon --host=fd:// --host=tcp://$", { "Ref": "AdvertisedIPAddress" }, "_ipv4:2376 $DOCKER_OPTS $DOCKER_OPT_BIP $DOCKER_OPT_MTU $DOCKER_OPT_IPMASQ\n"
] ]
}
}
}
}
}
}
In order for all services to be reachable within the cluster, you have to allow inbound traffic on the following ports from within the cluster's private network:
- TCP 4001
- TCP 2379
- TCP 2380
Additionally, to be able to reach the codeocean web app publically you have to open the following ports for all networks:
- TCP 3000 0.0.0.0/0
You can configure these rules for the cluster's security group which can be found unter NETWORK & SECURITY in your EC2 Dashboard
All applications are managed by fleet and run in docker containers. Use fleetctl to manage services. You first have to submit a service file to the cluster, to be able to run it on an arbitrary node. fleetctl can be run on every node within the cluster.
- Copy the co-postgres.service file to a node
- run fleetctl submit co-postgres.service to load the file into the cluster
- To start the postgres service, run fleetctl start co-postgres.service
The service kills and deletes existing postgres containers, pulls the newest image, and saves the IP of the postgres container into the etcd store under /postgres-ip. Currently, the postgres data is not being persisted, so removing/restarting the container resets the database.
- Copy the [email protected] file to a node
- run fleetctl submit [email protected] to load the template into the cluster
- To start the python execution environment service, run fleetctl start [email protected], where i is an arbitrary but unique number. You can use bash syntax to start multiple containers in a single line (fleetctl start pythondev@{1..10}.service)
- Copy the [email protected] file to a node
- run fleetctl submit [email protected] to load the template into the cluster
- To start the codeocean service, run fleetctl start [email protected], where i is an arbitrary but unique number. You can use bash syntax to start multiple containers in a single line (fleetctl start codeocean@{1..2}.service)
The codeocean service kills and removes containers with the same name, pulls the newest version of codeocean from github, starts the rails application and initializes the database by running rake db:setup. It also reads the address of the postgres container and enables communication between the rails app and the postgres server even when these containers are running on different nodes.
Things to consider before migrating:
- Submissions are currently stored in etcd and execution containers download these files from there before execution. This requires these containers to have access to the etcd store. Currently, there is no way to limit access to the etcd store (e.g. via ACLs) so the user submitted program can access and modify everything in the etcd store. This can be avoided by switching to websockets for file transfer.
- Currently, the postgres container does not persist data. If the postgres container is killed and/or removed, all data is lost. You could use a data container or other storage options like S3 to persist the database.
- The codeocean container always runs rake:db setup on launch. To avoid losing data you should remove this when persisting postgres data.
- Because we use container pooling, the containers should basically never stop and thus never get deleted. Currently, the submission files are not being removed from the containers, which could result in a lot of unnecessary disk usage.
- The postgres container IP address is written to etcd after launching the postgres container. When starting the codeocean container, this IP is being used to connect to the database. If the postgres container is being restarted and the IP changes, codeocean does not automatically update its config for the new database address. You would have to restart the codeocean container in this case. To void this, you could use an internal DNS server so that the postgres address does not change.