Skip to content

Commit

Permalink
Add support for operating without pre-install cloud image
Browse files Browse the repository at this point in the history
  • Loading branch information
samkumar committed Jun 28, 2021
1 parent bfa8267 commit 11c6ed6
Show file tree
Hide file tree
Showing 10 changed files with 167 additions and 27 deletions.
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,11 +81,13 @@ We use the term _cluster_ to mean a group of (virtual) machines that are used to

One can use `magebench.py` to spawn a cluster, with a particular configuration passed to the cluster on the command line. This exercise will help you get familiar with this arrangement.

If you are using your own Google Cloud account, you'll need to specify the name of your Google Cloud project on the command line, via the `-p` flag to `./magebench.py spawn`. For example, if your Google Cloud project is `myproject`, you should run the command below as `./magebench.py spawn -a 1 -g oregon -p myproject`. By default, the project name used is `rise-mage`, so you may also prefer to just create a Google Cloud project with that name for the purpose of running these commands, so you don't have to change them manually from what's given below.

Run the following command:
```
$ ./magebench.py spawn -a 1 -g oregon
```
This command will spawn one virtual machine instance on Microsoft Azure and one virtual machine instance on Google Cloud. Microsoft Azure instances are always in US West 2 (Oregon); the Google Cloud instance is in `us-west1` (as indicated by the `oregon` CLI argument). Then, it will wait for a few minutes for the virtual machines to boot. After that, it will run scripts (called _provisioning_) to establish shared CKKS secrets across the machines (so that they can work together to perform a computation using CKKS) and generate configuration files for experiments using the machines' IP addresses. There is also a `./magebench.py provision` command, but you do not normally need to run it because `magebench.py spawn` will provision the machines already.
This command will spawn one virtual machine instance on Microsoft Azure and one virtual machine instance on Google Cloud. Microsoft Azure instances are always in US West 2 (Oregon); the Google Cloud instance is in `us-west1` (as indicated by the `oregon` CLI argument). Then, it will wait for a few minutes for the virtual machines to boot. After that, it will run scripts (called _provisioning_) to install MAGE, establish shared CKKS secrets across the machines (so that they can work together to perform a computation using CKKS), and generate configuration files for experiments using the machines' IP addresses. There is also a `./magebench.py provision` command, but you do not normally need to run it because `magebench.py spawn` will provision the machines already.

Once you've spawned the cluster, you'll notice that a new file, `cluster.json` has been created. This file allows `magebench.py` to keep track of what resources it has allocated, including the IP addresses of the virtual machines so that it can interact with them. If you're curious, you can use Python to pretty-print `cluster.json`, which will produce output similar to this:
```
Expand Down Expand Up @@ -134,14 +136,18 @@ If you want to take a break, or if you're done for the day, you should deallocat
```
$ ./magebench.py deallocate
```
This will free all of the resources associated with the cluster and delete the `cluster.json` file. _You should make sure not to move, rename, or delete the `cluster.json` file before running `./magebench.py deallocate`._ If you do, `magebench.py` won't know how to contact the machines in the cluster. A copy of `cluster.json` is placed in the home directory of the user `mage` of each machine in the cluster. If you accidentally lose the `cluster.json` file, but still know the IP address of one of the machines, you can recover `cluster.json` by using `scp`. Barring that, you can delete the cluster and start over by running `./magebench.py purge`, passing the same command line arguments that were passed to `./magebench.py spawn`. For example, if you accidentally lost the `cluster.json` file in the above example, you still could delete the cluster by running `./magebench.py purge -a 1 -g oregon`.
This will free all of the resources associated with the instances and delete the `cluster.json` file. _You should make sure not to move, rename, or delete the `cluster.json` file before running `./magebench.py deallocate`._ If you do, `magebench.py` won't know how to contact the machines in the cluster. A copy of `cluster.json` is placed in the home directory of the user `mage` of each machine in the cluster. If you accidentally lose the `cluster.json` file, but still know the IP address of one of the machines, you can recover `cluster.json` by using `scp`. Barring that, you can delete the cluster and start over by running `./magebench.py purge`, passing the same command line arguments that were passed to `./magebench.py spawn`. For example, if you accidentally lost the `cluster.json` file in the above example, you still could delete the cluster by running `./magebench.py purge -a 1 -g oregon`.

The one resource that isn't freed is a firewall rule called `mage-wan` that is created for the Google Cloud instances, but isn't associated with any single instance. It's free of charge (according to https://cloud.google.com/vpc/pricing#firewall-rules), so there's no harm in keeping it alive. If you want to delete it, you'll need to do so manually.

When you run benchmarks using the `magebench.py` tool, the log file containing the measurements are stored on the virtual machines themselves. **Thus, you should copy the log files to the machine where you are running `./magebench.py` before deallocating the cluster.** The following command will copy the logs from each node in the cluster to a directory called `logs` on the local machine:
```
$ ./magebench.py fetch-logs
```
Once you have the logs locally, you can use an IPython notebook to generate figures in the same form as the ones in the OSDI paper. Run `jupyter notebook` and open `graphs.ipynb` to do this.

**Note on advanced usage:** The above command installs MAGE from scratch on each node, which is robust but takes a long time. The `-i` option (as in, `./magebench.py spawn -a 1 -g oregon -i`) uses a pre-installed image instead, which is faster. Unfortunately, there isn't an easy way to make an image public in Azure, so this optimization won't work for you unless I've shared the corresponding images with you. If you're interested, there are additional command-line flags you can use, which you can find at the bottom of `magebench.py`.

A Simple, Guided Example (15 minutes working, 20 minutes waiting)
-----------------------------------------------------------------
Now that you're able to spawn and deallocate a cluster, let's walk through a simple example in which you run an experiment and generate a plot using the provided scripts.
Expand Down
21 changes: 16 additions & 5 deletions azure_cloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

SUBSCRIPTION_ID = "a8bdae60-f431-4620-bf0a-fad96eb36ca4"
LOCATION = "westus2"
IMAGE_ID = "/subscriptions/a8bdae60-f431-4620-bf0a-fad96eb36ca4/resourceGroups/MAGE-2/providers/Microsoft.Compute/images/mage-deps-v7"
MAGE_IMAGE_ID = "/subscriptions/a8bdae60-f431-4620-bf0a-fad96eb36ca4/resourceGroups/MAGE-2/providers/Microsoft.Compute/images/mage-deps-v7"

credential = DefaultAzureCredential()

Expand All @@ -22,7 +22,7 @@
ip_name = lambda cluster_name, instance_id: vm_name(cluster_name, instance_id) + "-ip"
nic_name = lambda cluster_name, instance_id: vm_name(cluster_name, instance_id) + "-nic"

def spawn_cluster(c, name, count, disk_layout_name, use_large_work_disk = False, subscription_id = SUBSCRIPTION_ID, location = LOCATION, image_id = IMAGE_ID):
def spawn_cluster(c, name, count, image_name, disk_layout_name, use_large_work_disk = False, subscription_id = SUBSCRIPTION_ID, location = LOCATION):
cloud_init_file = "cloud-init-azure.yaml"
if disk_layout_name == "paired-noswap":
cloud_init_file = "cloud-init-azure-paired.yaml"
Expand Down Expand Up @@ -147,6 +147,18 @@ def spawn_vm(_, id):
}
})

if image_name == "mage":
image_reference = {
"id": MAGE_IMAGE_ID
}
else:
image_reference = {
"publisher": "canonical",
"offer": "0001-com-ubuntu-server-focal",
"sku": "20_04-lts",
"version": "latest"
}

poller = compute_client.virtual_machines.begin_create_or_update(resource_group, vm_name(name, id),
{
"location": location,
Expand All @@ -155,9 +167,7 @@ def spawn_vm(_, id):
"vm_size": "Standard_D16d_v4"
},
"storage_profile": {
"image_reference": {
"id": IMAGE_ID
},
"image_reference": image_reference,
"data_disks": data_disks
},
"os_profile": {
Expand Down Expand Up @@ -189,6 +199,7 @@ def spawn_vm(_, id):
c.machines[id].vm_name = vm_result.name
c.machines[id].disk_name = vm_result.storage_profile.os_disk.name
c.machines[id].provider = "azure"
c.machines[id].image_name = image_name

c.for_each_concurrently(spawn_vm, range(count))

Expand Down
8 changes: 4 additions & 4 deletions cloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import azure_cloud
import google_cloud

def spawn_cluster(name, num_lan_machines, use_large_work_disk, setup, *wan_machine_locations):
def spawn_cluster(name, num_lan_machines, image_name, use_large_work_disk, setup, project_gcloud, *wan_machine_locations):
num_wan_machines = len(set(wan_machine_locations))
if len(wan_machine_locations) != num_wan_machines:
print("Some WAN locations are repeated")
Expand Down Expand Up @@ -31,7 +31,7 @@ def spawn_cluster(name, num_lan_machines, use_large_work_disk, setup, *wan_machi
def init(_, id):
if id == 0 and num_lan_machines > 0:
# Initializes all machines from indices 0 to num_lan_machines - 1
azure_cloud.spawn_cluster(c, name, num_lan_machines, setup, use_large_work_disk)
azure_cloud.spawn_cluster(c, name, num_lan_machines, image_name, setup, use_large_work_disk)
elif id >= num_lan_machines:
if setup in ("paired-swap", "paired-noswap"):
wan_index = (id // num_lan_machines) - 1
Expand All @@ -42,15 +42,15 @@ def init(_, id):
gcp_instance_name = "{0}-{1}-{2}".format(name, wan_location, location_id)
if (id % num_lan_machines) == 0:
c.location_to_id[wan_location] = id
google_cloud.spawn_instance(c.machines[id], gcp_instance_name, "n2-highmem-4", 2, setup, *region_zone)
google_cloud.spawn_instance(c.machines[id], gcp_instance_name, "n2-highmem-4", 2, image_name, setup, *region_zone, project_gcloud)
else:
wan_index = id - num_lan_machines
wan_location = wan_machine_locations[wan_index]
region_zone = gcp_locations[wan_index]
with gcp_lock:
gcp_instance_name = "{0}-{1}".format(name, wan_location)
c.location_to_id[wan_location] = id
google_cloud.spawn_instance(c.machines[id], gcp_instance_name, "n2-highcpu-2", 1, setup, *region_zone)
google_cloud.spawn_instance(c.machines[id], gcp_instance_name, "n2-highcpu-2", 1, image_name, setup, *region_zone, project_gcloud)

c.for_each_concurrently(init)
c.num_lan_machines = num_lan_machines
Expand Down
2 changes: 2 additions & 0 deletions cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ def __init__(self):
self.wdisk_name = None
self.gcp_zone = None
self.provider = None
self.image_name = None
self.gcp_firewall_rule = None

def as_dict(self):
return dict(self.__dict__)
Expand Down
26 changes: 22 additions & 4 deletions google_cloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import cluster

GCP_PROJECT = "rise-mage"
GCP_FIREWALL_RULE = "mage-wan"

oregon = ("us-west1", "b")
iowa = ("us-central1", "b")
Expand Down Expand Up @@ -36,7 +37,7 @@ def wait_for_operation(compute, project, zone, operation):
time.sleep(1)

# Reference: https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
def spawn_instance(m, name, instance_type, num_local_ssds, disk_layout_name, region, zone_letter, project_name = GCP_PROJECT):
def spawn_instance(m, name, instance_type, num_local_ssds, image_name, disk_layout_name, region, zone_letter, project_name = GCP_PROJECT):
cloud_init_file = "cloud-init-gcp.yaml"
if disk_layout_name == "paired-noswap":
cloud_init_file = "cloud-init-gcp-paired.yaml"
Expand All @@ -49,7 +50,23 @@ def spawn_instance(m, name, instance_type, num_local_ssds, disk_layout_name, reg
target_zone = "{0}-{1}".format(region, zone_letter)
# Based on https://cloud.google.com/compute/docs/tutorials/python-guide

image_response = compute.images().getFromFamily(project = project_name, family = "mage-deps").execute()
if image_name == "mage":
image_response = compute.images().getFromFamily(project = project_name, family = "mage-deps").execute()
image_link = image_response["selfLink"]
else:
image_response = compute.images().getFromFamily(project = "ubuntu-os-cloud", family = "ubuntu-2004-lts").execute()
image_link = image_response["selfLink"]
#image_link = "projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210623"

# Create firewall rule for inbound WAN traffic if it does not already exist
prev_firewalls = compute.firewalls().list(project = GCP_PROJECT, filter = "name = {0}".format(GCP_FIREWALL_RULE)).execute()
if "items" not in prev_firewalls or len(prev_firewalls["items"]) == 0:
compute.firewalls().insert(project = GCP_PROJECT, body = {
"name": GCP_FIREWALL_RULE,
"description": "Allow inbound TCP connections for wide-are network experiments to benchmark MAGE.",
"targetTags": [GCP_FIREWALL_RULE],
"allowed": [{"IPProtocol": "tcp", "ports": ["57000-57999"]}]
}).execute()

local_ssds = []
for i in range(num_local_ssds):
Expand Down Expand Up @@ -88,7 +105,7 @@ def spawn_instance(m, name, instance_type, num_local_ssds, disk_layout_name, reg
},
"tags": {
"items": [
"mage-wan" # Associates this instance with a firewall rule that allows inbound TCP connections on the relevant ports
GCP_FIREWALL_RULE # Associates this instance with the firewall rule created above
]
},
"disks": [
Expand All @@ -100,7 +117,7 @@ def spawn_instance(m, name, instance_type, num_local_ssds, disk_layout_name, reg
"autoDelete": True,
"deviceName": name,
"initializeParams": {
"sourceImage": image_response["selfLink"],
"sourceImage": image_link,
"diskType": "projects/{0}/zones/{1}/diskTypes/pd-standard".format(project_name, target_zone),
"diskSizeGb": "10",
},
Expand Down Expand Up @@ -171,6 +188,7 @@ def spawn_instance(m, name, instance_type, num_local_ssds, disk_layout_name, reg
m.disk_name = info["disks"][0]["deviceName"]
m.gcp_zone = target_zone
m.provider = "gcloud"
m.image = image_name

def deallocate_instance(m, project_name = GCP_PROJECT):
deallocate_instance_by_info(m.gcp_zone, m.vm_name, project_name)
Expand Down
9 changes: 7 additions & 2 deletions magebench.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,10 @@ def generate_ckks_keys(c):

def provision_cluster(c, repository, checkout):
def provision_machine(machine, id):
remote.exec_script(machine.public_ip_address, "./scripts/provision.sh", "{0} {1} {2} {3}".format(machine.provider, c.setup, repository, checkout))
if machine.image_name != "mage":
remote.exec_script(machine.public_ip_address, "./scripts/install_deps.sh", "--install-mage-deps --install-utils --setup-wan-tcp")
remote.exec_script(machine.public_ip_address, "./scripts/provision.sh", "{0} {1}".format(machine.provider, c.setup))
remote.exec_script(machine.public_ip_address, "./scripts/setup_code.sh", "{0} {1} {2}".format(machine.image_name, repository, checkout))
remote.copy_to(machine.public_ip_address, False, "./cluster.json", "~")
if id < c.num_lan_machines:
remote.exec_script(machine.public_ip_address, "./scripts/generate_configs.py", "~/cluster.json {0} lan ~/config {1}".format(id, "true" if c.setup.startswith("paired") else "false"))
Expand Down Expand Up @@ -70,7 +73,7 @@ def spawn(args):
if args.name == "":
args.name = "mage-{0}".format(socket.gethostname())
print("Spawning cluster...")
c = cloud.spawn_cluster(args.name, args.azure_machine_count, args.large_work_disk, args.wan_setup, *args.gcloud_machine_locations)
c = cloud.spawn_cluster(args.name, args.azure_machine_count, "mage" if args.image else "ubuntu", args.large_work_disk, args.wan_setup, args.project_gcloud, *args.gcloud_machine_locations)
c.save_to_file("cluster.json")
print("Waiting three minutes for the machines to start up...")
time.sleep(180)
Expand Down Expand Up @@ -272,6 +275,8 @@ def fetch_logs_from(machine, id):
parser_spawn.add_argument("-s", "--wan-setup", default = "regular")
parser_spawn.add_argument("-r", "--repository", default = "https://github.com/ucbrise/mage")
parser_spawn.add_argument("-c", "--checkout", default = "main")
parser_spawn.add_argument("-i", "--image", action = "store_true")
parser_spawn.add_argument("-p", "--project-gcloud", default = "rise-mage")
parser_spawn.set_defaults(func = spawn)

parser_provision = subparsers.add_parser("provision")
Expand Down
2 changes: 2 additions & 0 deletions scripts/generate_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,8 @@ def generate_paired_wan_config_dict(protocol, scenario, num_workers_per_party, i
for ot_pipeline_depth in tuple(2 ** i for i in range(9)):
for ot_num_daemons in tuple(2 ** i for i in range(9)):
ot_params = (ot_pipeline_depth, ot_num_daemons)
if scenario == "max":
continue # Max not needed here, but we can add support for it if needed
config_dict = generate_wan_config_dict(protocol, scenario, party_size, azure_id, gcloud_id, cluster, *ot_params)
output_path = os.path.join(output_dir_path, "config_halfgates_{0}_{1}_{2}.yaml".format(party_size, ot_pipeline_depth, ot_num_daemons))
with open(output_path, "w") as f:
Expand Down
Loading

0 comments on commit 11c6ed6

Please sign in to comment.