Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing start, stop with ec2.py, adding validations #1191

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

abuabraham-ttd
Copy link
Contributor

@abuabraham-ttd abuabraham-ttd commented Dec 9, 2024

Changes enclave start script to Python, to improve readability, remove duplications, and add more validations while starting enclave.

  • Removed unused configurations
    HOSTNAME, IDENTITY_SCOPE, CORE + OPTOUT_URLS fetched from userdata
  • Removes duplications in user-data parsing to get secret manger key name
  • Removes unused default cpu, memory and added values from nitro-enclave config to get right value
  • Remove read_allocation, update_allocation as it is unused
  • Adds typing to config_server, to improve readability
  • Kills only valid processes, added flask as it was missing earlier
  • Requires core, optout urls to be specified in secret manager

Testing

Code is tested by creating an EC2 instance using "ami-0df2894e3f4971bda" (the recently published AMI) and copying the python files onto the instance and starting with python files.

python3 -m venv /opt/uid2operator/init
nano requirements.txt and copy the requirements.txt there 
/opt/uid2operator/init/bin/pip install -r requirements.txt 
nano /opt/uid2operator/confidential_compute.py copy and save contents 
nano /opt/uid2operator/ec2.py copy and save contents
nano /etc/systemd/system/uid2operator.service update contents and save
sudo systemctl daemon-reload
sudo systemctl restart uid2operator

and

curl http://127.0.0.1:80/ops/healthcheck

You can change the values in secret manager to see errors thrown based on validations added

Sample run

Running in us-east-1
Fetched configs from uid2-config-stack-tjm-unvalidate-eif-test1
Validated https://core-integ.uidapi.com/ matches other config parameters
Validated https://optout-integ.uidapi.com/ matches other config parameters
Validated operator key matches environment
Validated connectivity to https://core-integ.uidapi.com/
Validated connectivity to https://optout-integ.uidapi.com/
Completed static validation of confidential compute config values
Running command: /usr/bin/vsockpx -c /etc/uid2operator/proxy.yaml --workers 5 --log-level 3 --daemon
Running command: ./bin/flask run --host 127.0.0.1 --port 27015
Running command: sockd -D
Connecting to config server, attempt 1 failed with ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=27015): Max retries exceeded with url: /getConfig (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efebaadfca0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Config server is reachable
Connectivity check to config server passes
Running in debug_mode
Running command: nitro-cli run-enclave --eif-path /opt/uid2operator/uid2operator.eif --memory 24576 --cpu-count 6 --enclave-cid 42 --enclave-name uid2operator --debug-mode --attach-console
Start allocating memory...

@abuabraham-ttd abuabraham-ttd changed the title Removing start, stop with ec2.py, adding validations DRAFT: Removing start, stop with ec2.py, adding validations Dec 9, 2024
@abuabraham-ttd abuabraham-ttd marked this pull request as draft December 9, 2024 22:25
@abuabraham-ttd abuabraham-ttd changed the title DRAFT: Removing start, stop with ec2.py, adding validations Removing start, stop with ec2.py, adding validations Dec 9, 2024
scripts/aws/ec2.py Outdated Show resolved Hide resolved
@abuabraham-ttd abuabraham-ttd force-pushed the abu-UID2-4555-EC2-improvements branch from 8f297b5 to 2542232 Compare December 10, 2024 19:42
@abuabraham-ttd abuabraham-ttd marked this pull request as ready for review December 10, 2024 20:59
scripts/aws/ec2.py Show resolved Hide resolved
scripts/aws/ec2.py Show resolved Hide resolved
scripts/aws/ec2.py Outdated Show resolved Hide resolved
self.__setup_vsockproxy(log_level)
self.__run_config_server()
self.__run_socks_proxy()
time.sleep(5) #TODO: Change to while loop if required.
Copy link
Contributor

@sunnywu sunnywu Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to sleep here? doesn't subprocess.run already run the command synchronously and will wait for its completion? For config server, it would be run as separate process so will never need to wait for it anyway?

or is this to wait for config server to startup so you could do validations? if so, you might wanna consider making it more robust (are we 100% sure 5 seconds wait is enough?), or at the very least, logs something during this 5 second wait to inform the customer this script is still running but waiting for something.

or maybe it should be in a loop and every 5 seconds the validation script will try to connect to the server until it's successful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is this to wait for config server to startup so you could do validations? i

Yes. we can "Change to while loop if required. " as mentioned in comment.

scripts/aws/ec2.py Show resolved Hide resolved
"""Sets up the necessary auxiliary services and configuration."""
self.configs = self._get_secret(self.__get_secret_name_from_userdata())
self.validate_configuration()
log_level = 3 if self.configs["debug_mode"] else 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this right translation from the old bash script:

VSOCK_LOG_LEVEL=${VSOCK_LOG_LEVEL:-3}
?
(granted don't quite understand this bash script)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if VSOCK_LOG_LEVEL ? then VSOCK_LOG_LEVEL else 3

there is no way to set VSOCK_LOG_LEVEL though.

So I added if debug_mode ? then set to 3 otherwise 1.

In future ticket, we will be adding more tracing on debug mode.

Plan is to evaluate adding

echo 1 > /sys/kernel/debug/tracing/events/vsock/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on

scripts/aws/ec2.py Outdated Show resolved Hide resolved
setup_vsockproxy
setup_dante
run_config_server
run_enclave
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought this is very clear in terms of what are the actual steps and now most of this is grouped under _setup_auxiliaries in py script which isn't necessarily as just easy to read what are the steps comparing to this.

i don't have any suggestion how to improve but food for thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abstracted to _setup_auxiliaries which has

self.__setup_vsockproxy(log_level)
self.__run_config_server()
self.__run_socks_proxy()

and run_compute to run compute.

I went on that route, as developers can easily compare and understand differences between Cloud provider implementation.

As in, _setup_auxiliaries will setup all required processes for starting enclave.

If you need to know what Azure needs, just look into _setup_auxiliaries in Azure.py

This also enables the consistently phased rollout

scripts/aws/ec2.py Outdated Show resolved Hide resolved
self._validate_auxiliaries()
command = [
"nitro-cli", "run-enclave",
"--eif-path", "/opt/uid2operator/uid2operator.eif",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be helpful for debugging to have the ability to pass this in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

We need to build an EIF and register the EIF in core. Why not just use the ami build along with EIF? otherwise we have to copy the registered EIF and run that by SSH ing into the host right?

scripts/aws/ec2.py Outdated Show resolved Hide resolved
scripts/aws/ec2.py Outdated Show resolved Hide resolved
scripts/aws/ec2.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants