Add support for operating without pre-install cloud image

ucbrise · Jun 28, 2021 · 11c6ed6 · 11c6ed6
1 parent bfa8267
commit 11c6ed6
Show file tree

Hide file tree

Showing 10 changed files with 167 additions and 27 deletions.
diff --git a/README.md b/README.md
@@ -81,11 +81,13 @@ We use the term _cluster_ to mean a group of (virtual) machines that are used to
 
 One can use `magebench.py` to spawn a cluster, with a particular configuration passed to the cluster on the command line. This exercise will help you get familiar with this arrangement.
 
+If you are using your own Google Cloud account, you'll need to specify the name of your Google Cloud project on the command line, via the `-p` flag to `./magebench.py spawn`. For example, if your Google Cloud project is `myproject`, you should run the command below as `./magebench.py spawn -a 1 -g oregon -p myproject`. By default, the project name used is `rise-mage`, so you may also prefer to just create a Google Cloud project with that name for the purpose of running these commands, so you don't have to change them manually from what's given below.
+
 Run the following command:
 ```
 $ ./magebench.py spawn -a 1 -g oregon
 ```
-This command will spawn one virtual machine instance on Microsoft Azure and one virtual machine instance on Google Cloud. Microsoft Azure instances are always in US West 2 (Oregon); the Google Cloud instance is in `us-west1` (as indicated by the `oregon` CLI argument). Then, it will wait for a few minutes for the virtual machines to boot. After that, it will run scripts (called _provisioning_) to establish shared CKKS secrets across the machines (so that they can work together to perform a computation using CKKS) and generate configuration files for experiments using the machines' IP addresses. There is also a `./magebench.py provision` command, but you do not normally need to run it because `magebench.py spawn` will provision the machines already.
+This command will spawn one virtual machine instance on Microsoft Azure and one virtual machine instance on Google Cloud. Microsoft Azure instances are always in US West 2 (Oregon); the Google Cloud instance is in `us-west1` (as indicated by the `oregon` CLI argument). Then, it will wait for a few minutes for the virtual machines to boot. After that, it will run scripts (called _provisioning_) to install MAGE, establish shared CKKS secrets across the machines (so that they can work together to perform a computation using CKKS), and generate configuration files for experiments using the machines' IP addresses. There is also a `./magebench.py provision` command, but you do not normally need to run it because `magebench.py spawn` will provision the machines already.
 
 Once you've spawned the cluster, you'll notice that a new file, `cluster.json` has been created. This file allows `magebench.py` to keep track of what resources it has allocated, including the IP addresses of the virtual machines so that it can interact with them. If you're curious, you can use Python to pretty-print `cluster.json`, which will produce output similar to this:
 ```
@@ -134,14 +136,18 @@ If you want to take a break, or if you're done for the day, you should deallocat
 ```
 $ ./magebench.py deallocate
 ```
-This will free all of the resources associated with the cluster and delete the `cluster.json` file. _You should make sure not to move, rename, or delete the `cluster.json` file before running `./magebench.py deallocate`._ If you do, `magebench.py` won't know how to contact the machines in the cluster. A copy of `cluster.json` is placed in the home directory of the user `mage` of each machine in the cluster. If you accidentally lose the `cluster.json` file, but still know the IP address of one of the machines, you can recover `cluster.json` by using `scp`. Barring that, you can delete the cluster and start over by running `./magebench.py purge`, passing the same command line arguments that were passed to `./magebench.py spawn`. For example, if you accidentally lost the `cluster.json` file in the above example, you still could delete the cluster by running `./magebench.py purge -a 1 -g oregon`.
+This will free all of the resources associated with the instances and delete the `cluster.json` file. _You should make sure not to move, rename, or delete the `cluster.json` file before running `./magebench.py deallocate`._ If you do, `magebench.py` won't know how to contact the machines in the cluster. A copy of `cluster.json` is placed in the home directory of the user `mage` of each machine in the cluster. If you accidentally lose the `cluster.json` file, but still know the IP address of one of the machines, you can recover `cluster.json` by using `scp`. Barring that, you can delete the cluster and start over by running `./magebench.py purge`, passing the same command line arguments that were passed to `./magebench.py spawn`. For example, if you accidentally lost the `cluster.json` file in the above example, you still could delete the cluster by running `./magebench.py purge -a 1 -g oregon`.
+
+The one resource that isn't freed is a firewall rule called `mage-wan` that is created for the Google Cloud instances, but isn't associated with any single instance. It's free of charge (according to https://cloud.google.com/vpc/pricing#firewall-rules), so there's no harm in keeping it alive. If you want to delete it, you'll need to do so manually.
 
 When you run benchmarks using the `magebench.py` tool, the log file containing the measurements are stored on the virtual machines themselves. **Thus, you should copy the log files to the machine where you are running `./magebench.py` before deallocating the cluster.** The following command will copy the logs from each node in the cluster to a directory called `logs` on the local machine:
 ```
 $ ./magebench.py fetch-logs
 ```
 Once you have the logs locally, you can use an IPython notebook to generate figures in the same form as the ones in the OSDI paper. Run `jupyter notebook` and open `graphs.ipynb` to do this.
 
+**Note on advanced usage:** The above command installs MAGE from scratch on each node, which is robust but takes a long time. The `-i` option (as in, `./magebench.py spawn -a 1 -g oregon -i`) uses a pre-installed image instead, which is faster. Unfortunately, there isn't an easy way to make an image public in Azure, so this optimization won't work for you unless I've shared the corresponding images with you. If you're interested, there are additional command-line flags you can use, which you can find at the bottom of `magebench.py`.
+
 A Simple, Guided Example (15 minutes working, 20 minutes waiting)
 -----------------------------------------------------------------
 Now that you're able to spawn and deallocate a cluster, let's walk through a simple example in which you run an experiment and generate a plot using the provided scripts.

diff --git a/azure_cloud.py b/azure_cloud.py
@@ -9,7 +9,7 @@
 
 SUBSCRIPTION_ID = "a8bdae60-f431-4620-bf0a-fad96eb36ca4"
 LOCATION = "westus2"
-IMAGE_ID = "/subscriptions/a8bdae60-f431-4620-bf0a-fad96eb36ca4/resourceGroups/MAGE-2/providers/Microsoft.Compute/images/mage-deps-v7"
+MAGE_IMAGE_ID = "/subscriptions/a8bdae60-f431-4620-bf0a-fad96eb36ca4/resourceGroups/MAGE-2/providers/Microsoft.Compute/images/mage-deps-v7"
 
 credential = DefaultAzureCredential()
 
@@ -22,7 +22,7 @@
 ip_name = lambda cluster_name, instance_id: vm_name(cluster_name, instance_id) + "-ip"
 nic_name = lambda cluster_name, instance_id: vm_name(cluster_name, instance_id) + "-nic"
 
-def spawn_cluster(c, name, count, disk_layout_name, use_large_work_disk = False, subscription_id = SUBSCRIPTION_ID, location = LOCATION, image_id = IMAGE_ID):
+def spawn_cluster(c, name, count, image_name, disk_layout_name, use_large_work_disk = False, subscription_id = SUBSCRIPTION_ID, location = LOCATION):
     cloud_init_file = "cloud-init-azure.yaml"
     if disk_layout_name == "paired-noswap":
         cloud_init_file = "cloud-init-azure-paired.yaml"
@@ -147,6 +147,18 @@ def spawn_vm(_, id):
                 }
             })
 
+        if image_name == "mage":
+            image_reference = {
+                "id": MAGE_IMAGE_ID
+            }
+        else:
+            image_reference = {
+                "publisher": "canonical",
+                "offer": "0001-com-ubuntu-server-focal",
+                "sku": "20_04-lts",
+                "version": "latest"
+            }
+
         poller = compute_client.virtual_machines.begin_create_or_update(resource_group, vm_name(name, id),
         {
             "location": location,
@@ -155,9 +167,7 @@ def spawn_vm(_, id):
                 "vm_size": "Standard_D16d_v4"
             },
             "storage_profile": {
-                "image_reference": {
-                    "id": IMAGE_ID
-                },
+                "image_reference": image_reference,
                 "data_disks": data_disks
             },
             "os_profile": {
@@ -189,6 +199,7 @@ def spawn_vm(_, id):
         c.machines[id].vm_name = vm_result.name
         c.machines[id].disk_name = vm_result.storage_profile.os_disk.name
         c.machines[id].provider = "azure"
+        c.machines[id].image_name = image_name
 
     c.for_each_concurrently(spawn_vm, range(count))
 

diff --git a/cloud.py b/cloud.py
@@ -3,7 +3,7 @@
 import azure_cloud
 import google_cloud
 
-def spawn_cluster(name, num_lan_machines, use_large_work_disk, setup, *wan_machine_locations):
+def spawn_cluster(name, num_lan_machines, image_name, use_large_work_disk, setup, project_gcloud, *wan_machine_locations):
     num_wan_machines = len(set(wan_machine_locations))
     if len(wan_machine_locations) != num_wan_machines:
         print("Some WAN locations are repeated")
@@ -31,7 +31,7 @@ def spawn_cluster(name, num_lan_machines, use_large_work_disk, setup, *wan_machi
     def init(_, id):
         if id == 0 and num_lan_machines > 0:
             # Initializes all machines from indices 0 to num_lan_machines - 1
-            azure_cloud.spawn_cluster(c, name, num_lan_machines, setup, use_large_work_disk)
+            azure_cloud.spawn_cluster(c, name, num_lan_machines, image_name, setup, use_large_work_disk)
         elif id >= num_lan_machines:
             if setup in ("paired-swap", "paired-noswap"):
                 wan_index = (id // num_lan_machines) - 1
@@ -42,15 +42,15 @@ def init(_, id):
                     gcp_instance_name = "{0}-{1}-{2}".format(name, wan_location, location_id)
                     if (id % num_lan_machines) == 0:
                         c.location_to_id[wan_location] = id
-                    google_cloud.spawn_instance(c.machines[id], gcp_instance_name, "n2-highmem-4", 2, setup, *region_zone)
+                    google_cloud.spawn_instance(c.machines[id], gcp_instance_name, "n2-highmem-4", 2, image_name, setup, *region_zone, project_gcloud)
             else:
                 wan_index = id - num_lan_machines
                 wan_location = wan_machine_locations[wan_index]
                 region_zone = gcp_locations[wan_index]
                 with gcp_lock:
                     gcp_instance_name = "{0}-{1}".format(name, wan_location)
                     c.location_to_id[wan_location] = id
-                    google_cloud.spawn_instance(c.machines[id], gcp_instance_name, "n2-highcpu-2", 1, setup, *region_zone)
+                    google_cloud.spawn_instance(c.machines[id], gcp_instance_name, "n2-highcpu-2", 1, image_name, setup, *region_zone, project_gcloud)
 
     c.for_each_concurrently(init)
     c.num_lan_machines = num_lan_machines

diff --git a/cluster.py b/cluster.py
@@ -15,6 +15,8 @@ def __init__(self):
         self.wdisk_name = None
         self.gcp_zone = None
         self.provider = None
+        self.image_name = None
+        self.gcp_firewall_rule = None
 
     def as_dict(self):
         return dict(self.__dict__)

diff --git a/google_cloud.py b/google_cloud.py
@@ -3,6 +3,7 @@
 import cluster
 
 GCP_PROJECT = "rise-mage"
+GCP_FIREWALL_RULE = "mage-wan"
 
 oregon = ("us-west1", "b")
 iowa = ("us-central1", "b")
@@ -36,7 +37,7 @@ def wait_for_operation(compute, project, zone, operation):
         time.sleep(1)
 
 # Reference: https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
-def spawn_instance(m, name, instance_type, num_local_ssds, disk_layout_name, region, zone_letter, project_name = GCP_PROJECT):
+def spawn_instance(m, name, instance_type, num_local_ssds, image_name, disk_layout_name, region, zone_letter, project_name = GCP_PROJECT):
     cloud_init_file = "cloud-init-gcp.yaml"
     if disk_layout_name == "paired-noswap":
         cloud_init_file = "cloud-init-gcp-paired.yaml"
@@ -49,7 +50,23 @@ def spawn_instance(m, name, instance_type, num_local_ssds, disk_layout_name, reg
     target_zone = "{0}-{1}".format(region, zone_letter)
     # Based on https://cloud.google.com/compute/docs/tutorials/python-guide
 
-    image_response = compute.images().getFromFamily(project = project_name, family = "mage-deps").execute()
+    if image_name == "mage":
+        image_response = compute.images().getFromFamily(project = project_name, family = "mage-deps").execute()
+        image_link = image_response["selfLink"]
+    else:
+        image_response = compute.images().getFromFamily(project = "ubuntu-os-cloud", family = "ubuntu-2004-lts").execute()
+        image_link = image_response["selfLink"]
+        #image_link = "projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210623"
+
+    # Create firewall rule for inbound WAN traffic if it does not already exist
+    prev_firewalls = compute.firewalls().list(project = GCP_PROJECT, filter = "name = {0}".format(GCP_FIREWALL_RULE)).execute()
+    if "items" not in prev_firewalls or len(prev_firewalls["items"]) == 0:
+        compute.firewalls().insert(project = GCP_PROJECT, body = {
+            "name": GCP_FIREWALL_RULE,
+            "description": "Allow inbound TCP connections for wide-are network experiments to benchmark MAGE.",
+            "targetTags": [GCP_FIREWALL_RULE],
+            "allowed": [{"IPProtocol": "tcp", "ports": ["57000-57999"]}]
+        }).execute()
 
     local_ssds = []
     for i in range(num_local_ssds):
@@ -88,7 +105,7 @@ def spawn_instance(m, name, instance_type, num_local_ssds, disk_layout_name, reg
         },
         "tags": {
             "items": [
-                "mage-wan" # Associates this instance with a firewall rule that allows inbound TCP connections on the relevant ports
+                GCP_FIREWALL_RULE # Associates this instance with the firewall rule created above
             ]
         },
         "disks": [
@@ -100,7 +117,7 @@ def spawn_instance(m, name, instance_type, num_local_ssds, disk_layout_name, reg
                 "autoDelete": True,
                 "deviceName": name,
                 "initializeParams": {
-                    "sourceImage": image_response["selfLink"],
+                    "sourceImage": image_link,
                     "diskType": "projects/{0}/zones/{1}/diskTypes/pd-standard".format(project_name, target_zone),
                     "diskSizeGb": "10",
                 },
@@ -171,6 +188,7 @@ def spawn_instance(m, name, instance_type, num_local_ssds, disk_layout_name, reg
     m.disk_name = info["disks"][0]["deviceName"]
     m.gcp_zone = target_zone
     m.provider = "gcloud"
+    m.image = image_name
 
 def deallocate_instance(m, project_name = GCP_PROJECT):
     deallocate_instance_by_info(m.gcp_zone, m.vm_name, project_name)

diff --git a/magebench.py b/magebench.py
@@ -39,7 +39,10 @@ def generate_ckks_keys(c):
 
 def provision_cluster(c, repository, checkout):
     def provision_machine(machine, id):
-        remote.exec_script(machine.public_ip_address, "./scripts/provision.sh", "{0} {1} {2} {3}".format(machine.provider, c.setup, repository, checkout))
+        if machine.image_name != "mage":
+            remote.exec_script(machine.public_ip_address, "./scripts/install_deps.sh", "--install-mage-deps --install-utils --setup-wan-tcp")
+        remote.exec_script(machine.public_ip_address, "./scripts/provision.sh", "{0} {1}".format(machine.provider, c.setup))
+        remote.exec_script(machine.public_ip_address, "./scripts/setup_code.sh", "{0} {1} {2}".format(machine.image_name, repository, checkout))
         remote.copy_to(machine.public_ip_address, False, "./cluster.json", "~")
         if id < c.num_lan_machines:
             remote.exec_script(machine.public_ip_address, "./scripts/generate_configs.py", "~/cluster.json {0} lan ~/config {1}".format(id, "true" if c.setup.startswith("paired") else "false"))
@@ -70,7 +73,7 @@ def spawn(args):
     if args.name == "":
         args.name = "mage-{0}".format(socket.gethostname())
     print("Spawning cluster...")
-    c = cloud.spawn_cluster(args.name, args.azure_machine_count, args.large_work_disk, args.wan_setup, *args.gcloud_machine_locations)
+    c = cloud.spawn_cluster(args.name, args.azure_machine_count, "mage" if args.image else "ubuntu", args.large_work_disk, args.wan_setup, args.project_gcloud, *args.gcloud_machine_locations)
     c.save_to_file("cluster.json")
     print("Waiting three minutes for the machines to start up...")
     time.sleep(180)
@@ -272,6 +275,8 @@ def fetch_logs_from(machine, id):
     parser_spawn.add_argument("-s", "--wan-setup", default = "regular")
     parser_spawn.add_argument("-r", "--repository", default = "https://github.com/ucbrise/mage")
     parser_spawn.add_argument("-c", "--checkout", default = "main")
+    parser_spawn.add_argument("-i", "--image", action = "store_true")
+    parser_spawn.add_argument("-p", "--project-gcloud", default = "rise-mage")
     parser_spawn.set_defaults(func = spawn)
 
     parser_provision = subparsers.add_parser("provision")

diff --git a/scripts/generate_configs.py b/scripts/generate_configs.py
@@ -184,6 +184,8 @@ def generate_paired_wan_config_dict(protocol, scenario, num_workers_per_party, i
                     for ot_pipeline_depth in tuple(2 ** i for i in range(9)):
                         for ot_num_daemons in tuple(2 ** i for i in range(9)):
                             ot_params = (ot_pipeline_depth, ot_num_daemons)
+                            if scenario == "max":
+                                continue # Max not needed here, but we can add support for it if needed
                             config_dict = generate_wan_config_dict(protocol, scenario, party_size, azure_id, gcloud_id, cluster, *ot_params)
                             output_path = os.path.join(output_dir_path, "config_halfgates_{0}_{1}_{2}.yaml".format(party_size, ot_pipeline_depth, ot_num_daemons))
                             with open(output_path, "w") as f: