Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ephemeral runners #302

Closed
wants to merge 11 commits into from
Closed

add ephemeral runners #302

wants to merge 11 commits into from

Conversation

DaMandal0rian
Copy link
Member

@DaMandal0rian DaMandal0rian commented Apr 5, 2024

User description

Terraform module Self-Hosted Scalable GitHub Actions runners on AWS based on the official Philips AWS Module. https://github.com/philips-labs/terraform-aws-github-runner under MIT License.

This PR removes some features and modules not required for our infra use case and makes use of more granular modules.

Scale up and down based on GitHub events
Scale down to zero when no jobs are running
Runners are created on-demand and terminated after use (ephemeral runners)
Runners are created on spot instances
use custom AMI, define the instance types and subnets to use.
OS support: Linux (x64/arm64) and Windows
CI workflow for deployment


Type

enhancement


Description

  • Added comprehensive variables for AWS configuration, GitHub app parameters, runner settings, and lambda functions.
  • Configured AWS resources including SQS queues, SSM parameters, webhook module, runners module, and logging.
  • Setup AWS launch template, security groups, IAM roles, and policies for runners.
  • Introduced configuration for runner binaries syncer lambda function and trigger.
  • Defined lambda function for managing runner pools with IAM roles, policies, and cloudwatch event rules.
  • Added variables and configuration for termination watcher lambda function.

Changes walkthrough

Relevant files
Configuration changes
variables.tf
Add Variables for Configuring AWS, GitHub App, Runners, and Lambdas

github-runners/terraform/autoscaling/variables.tf

  • Added a comprehensive list of variables for configuring AWS region,
    VPC, subnets, tags, GitHub app parameters, runner settings, and more.
  • Introduced variables for configuring ephemeral runners, including
    enabling ephemeral runners, job queued check, and managed runner
    security group.
  • Added variables for configuring lambda functions, including memory
    size, timeout, and S3 bucket details for lambda functions.
  • Included variables for AMI configuration, instance types, and runner
    OS support.
  • +886/-0 
    main.tf
    Configure AWS Resources, Runners, and Lambda Functions     

    github-runners/terraform/autoscaling/main.tf

  • Setup for AWS resources including SQS queues, SSM parameters, webhook
    module, and runners module.
  • Configuration for ephemeral runners and pool management.
  • Defined resources for logging, IAM roles, and lambda functions.
  • +394/-0 
    main.tf
    Setup AWS Launch Template and Security Groups for Runners

    github-runners/terraform/autoscaling/modules/runners/main.tf

  • Defined AWS launch template for runners with configurations for AMI,
    instance type, security groups, and user data.
  • Setup security groups for runners with egress rules.
  • Configuration for IAM roles and policies for runners.
  • +229/-0 
    variables.tf
    Add Variables for Runner Binaries Syncer Lambda Configuration

    github-runners/terraform/autoscaling/modules/runner-binaries-syncer/variables.tf

  • Introduced variables for configuring the runner binaries syncer lambda
    function.
  • Variables include configurations for S3 bucket, lambda function
    details, and logging.
  • +238/-0 
    runner-binaries-syncer.tf
    Configure Runner Binaries Syncer Lambda and Trigger           

    github-runners/terraform/autoscaling/modules/runner-binaries-syncer/runner-binaries-syncer.tf

  • Setup for runner binaries syncer lambda function including IAM role,
    policies, and cloudwatch event rule.
  • Configuration for S3 object trigger to execute the lambda after
    deployment.
  • +191/-0 
    main.tf
    Setup Lambda Function for Runner Pool Management                 

    github-runners/terraform/autoscaling/modules/runners/pool/main.tf

  • Defined AWS lambda function for managing runner pools.
  • Setup IAM roles and policies for the lambda function.
  • Configuration for cloudwatch event rules to trigger the lambda based
    on schedule.
  • +179/-0 
    variables.tf
    Add Variables for Termination Watcher Lambda Configuration

    github-runners/terraform/autoscaling/modules/termination-watcher/variables.tf

  • Introduced variables for configuring the termination watcher lambda
    function.
  • Variables include configurations for lambda function details, logging,
    and environment variables.
  • +65/-0   

    PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    @github-actions github-actions bot added the enhancement New feature or request label Apr 5, 2024
    Copy link

    github-actions bot commented Apr 5, 2024

    PR Description updated to latest commit (2656d95)

    Copy link

    github-actions bot commented Apr 5, 2024

    PR Review

    ⏱️ Estimated effort to review [1-5]

    5, due to the extensive number of files and changes involved in this PR, including multiple Terraform modules, Packer configurations, and Lambda function implementations. The complexity and interdependencies between these components significantly increase the review effort required to ensure correctness, security, and adherence to best practices.

    🧪 Relevant tests

    No

    🔍 Possible issues

    Possible Security Concern: The PR introduces numerous AWS resources and Lambda functions. Without proper IAM policies and security group configurations, there might be potential security risks, such as overly permissive access or unintended exposure of resources.

    Possible Performance Issue: The use of ephemeral runners and on-demand scaling based on GitHub events can lead to unpredictable costs and performance implications under high load or misconfiguration.

    Dependency Management: The PR includes external dependencies (e.g., GitHub Actions runner binaries, Docker, AWS CLI). Changes in these dependencies could impact the stability and functionality of the setup.

    🔒 Security concerns

    - IAM Policies: Ensure that IAM roles and policies assigned to Lambda functions and EC2 instances follow the principle of least privilege to minimize security risks.

    • Public Exposure: Verify that security groups and network access controls are correctly configured to prevent unauthorized access to resources.
    Code feedback:
    relevant filegithub-runners/terraform/autoscaling/modules/runners/main.tf
    suggestion      

    Consider adding lifecycle hooks to the aws_launch_template resource to manage updates and replacements more gracefully. This can help in ensuring that instances are only replaced according to the defined policies, minimizing disruptions. [important]

    relevant lineresource "aws_launch_template" "runner" {

    relevant filegithub-runners/terraform/autoscaling/modules/runner-binaries-syncer/runner-binaries-syncer.tf
    suggestion      

    Implement error handling and retry mechanisms in the Lambda function responsible for syncing runner binaries. This will improve the robustness of the setup by handling transient errors that may occur during the download or update process. [important]

    relevant lineresource "aws_lambda_function" "syncer" {

    relevant filegithub-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy/github_agent.ubuntu.pkr.hcl
    suggestion      

    Optimize the Packer build by using a more specific source AMI filter to reduce build times and potential costs. Narrowing down the search to a more recent base AMI can also ensure that the runners are built on a more secure and updated OS. [medium]

    relevant linesource_ami_filter {

    relevant filegithub-runners/terraform/autoscaling/modules/runners/pool/main.tf
    suggestion      

    Add monitoring and alerting for the Lambda function that adjusts the runner pool size. Monitoring metrics such as invocation errors, duration, and throttles can help in identifying issues early and maintaining the desired state of runner pools. [medium]

    relevant lineresource "aws_lambda_function" "pool" {


    ✨ Review tool usage guide:

    Overview:
    The review tool scans the PR code changes, and generates a PR review which includes several types of feedbacks, such as possible PR issues, security threats and relevant test in the PR. More feedbacks can be added by configuring the tool.

    The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on any PR.

    • When commenting, to edit configurations related to the review tool (pr_reviewer section), use the following template:
    /review --pr_reviewer.some_config1=... --pr_reviewer.some_config2=...
    
    [pr_reviewer]
    some_config1=...
    some_config2=...
    

    See the review usage page for a comprehensive guide on using this tool.

    Copy link

    github-actions bot commented Apr 5, 2024

    PR Code Suggestions

    CategorySuggestions                                                                                                                                                       
    Enhancement
    Use dynamic blocks for tags to improve code maintainability.

    Consider using a dynamic block for tags to improve maintainability and readability. This
    approach allows you to easily add or remove tags without modifying the structure of each
    resource definition.

    github-runners/terraform/autoscaling/gh-runners/ephemeral/main.tf [25-27]

    -tags = {
    -  Project = "subspace-scale-runners"
    +dynamic "tags" {
    +  for_each = {
    +    Project = "subspace-scale-runners"
    +    # Add more tags here
    +  }
    +  content {
    +    key = tags.key
    +    value = tags.value
    +  }
     }
     
    Change associate_public_ip_address variable type to boolean for clarity.

    For the associate_public_ip_address variable, it's more appropriate to use a boolean type
    instead of a string since the description suggests a true/false value. This change
    improves clarity and enforces correct input values.

    github-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy-arm64/github_agent.ubuntu.pkr.hcl [33-37]

     variable "associate_public_ip_address" {
       description = "If using a non-default VPC, there is no public IP address assigned to the EC2 instance. If you specified a public subnet, you probably want to set this to true. Otherwise the EC2 instance won't have access to the internet"
    -  type        = string
    +  type        = bool
       default     = null
     }
     
    Use multiple subnets for Lambda functions for higher availability.

    To ensure high availability and fault tolerance, consider specifying multiple subnet IDs
    for the Lambda functions by using a list of subnets rather than a single subnet. This
    allows the Lambda functions to be deployed across multiple Availability Zones.

    github-runners/terraform/autoscaling/main.tf [245]

    -lambda_subnet_ids = var.lambda_subnet_ids
    +lambda_subnet_ids = var.lambda_subnet_ids_list
     
    Add a condition to ensure s3_bucket, s3_key, and s3_object_version are only set when necessary.

    To ensure that the lambda function only uses the S3 bucket when necessary, consider adding
    a condition to check both var.lambda_s3_bucket and var.lambda_s3_key are not null before
    setting s3_bucket, s3_key, and s3_object_version. This will make the configuration more
    robust and avoid potential misconfigurations.

    github-runners/terraform/autoscaling/modules/ami-housekeeper/main.tf [7-9]

    -s3_bucket         = var.lambda_s3_bucket != null ? var.lambda_s3_bucket : null
    +s3_bucket         = var.lambda_s3_bucket != null && var.lambda_s3_key != null ? var.lambda_s3_bucket : null
     s3_key            = var.lambda_s3_key != null ? var.lambda_s3_key : null
    -s3_object_version = var.lambda_s3_object_version != null ? var.lambda_s3_object_version : null
    +s3_object_version = var.lambda_s3_object_version != null && var.lambda_s3_bucket != null && var.lambda_s3_key != null ? var.lambda_s3_object_version : null
     
    Best practice
    Parameterize instance types for flexibility.

    To ensure the scalability and flexibility of your Terraform configuration, consider
    parameterizing the instance_types to allow different instance types for different
    environments or use cases.

    github-runners/terraform/autoscaling/gh-runners/ephemeral/main.tf [49]

    -instance_types = ["m6a.4xlarge", "c6a.4xlarge"]
    +instance_types = var.instance_types
     
    Avoid hardcoding AWS region for security and flexibility.

    To avoid potential security risks, consider removing the hardcoded AWS region and instead
    use a variable. This will also increase the flexibility of your Terraform configuration.

    github-runners/terraform/autoscaling/gh-runners/ephemeral/main.tf [3]

    -aws_region  = "us-west-2"
    +aws_region  = var.aws_region
     
    Use a more specific version constraint for the amazon plugin.

    It's recommended to use a more specific version constraint for the amazon plugin to avoid
    potential compatibility issues with future versions. Instead of ">= 0.0.2", consider
    specifying a range that caps the major version, such as "~> 0.0.2" which allows patch
    updates but prevents major version changes.

    github-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy-arm64/github_agent.ubuntu.pkr.hcl [3-5]

     amazon = {
    -  version = ">= 0.0.2"
    +  version = "~> 0.0.2"
       source  = "github.com/hashicorp/amazon"
     }
     
    Use the null coalescing operator for default values to improve code readability.

    For better maintainability and readability, consider using Terraform's null coalescing
    operator instead of ternary operators for default values. This operator simplifies the
    expression and makes the code cleaner.

    github-runners/terraform/autoscaling/modules/ami-housekeeper/main.tf [2-3]

    -lambda_zip = var.lambda_zip == null ? "${path.module}/../../lambdas/functions/ami-housekeeper/ami-housekeeper.zip" : var.lambda_zip
    -role_path  = var.role_path == null ? "/${var.prefix}/" : var.role_path
    +lambda_zip = var.lambda_zip ?? "${path.module}/../../lambdas/functions/ami-housekeeper/ami-housekeeper.zip"
    +role_path  = var.role_path ?? "/${var.prefix}/"
     
    Maintainability
    Use a local variable for webhook_secret to reduce duplication.

    For better maintainability and to avoid duplication, consider using a local variable or a
    module output for common values such as webhook_secret instead of repeating the
    random_id.random.hex in multiple places.

    github-runners/terraform/autoscaling/gh-runners/ephemeral/main.tf [32]

    -webhook_secret = random_id.random.hex
    +webhook_secret = local.webhook_secret
     
    Move variable definitions to a separate file for better maintainability.

    To improve the maintainability and readability of the code, consider using a separate
    variables file for defining variables. This helps in separating the configuration from the
    template logic and makes the template cleaner.

    github-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy-arm64/github_agent.ubuntu.pkr.hcl [10-12]

    -variable "runner_version" {
    -  description = "The version (no v prefix) of the runner software to install https://github.com/actions/runner/releases. The latest release will be fetched from GitHub if not provided."
    -  default     = null
    -}
    +# Variables should be defined in a separate file, e.g., variables.tf
     
    Use a variable for Terraform required version.

    To improve maintainability and avoid hardcoding, consider using a Terraform variable for
    the required_version of Terraform in versions.tf. This allows for easier updates and
    version management.

    github-runners/terraform/autoscaling/main.tf [2]

    -required_version = ">= 1"
    +required_version = var.terraform_required_version
     
    Use a separate file for IAM policies and roles to improve maintainability.

    To improve the maintainability of your Terraform code, consider using a separate file for
    IAM policies and roles. This approach makes it easier to manage and review IAM policies,
    especially as they grow in complexity.

    github-runners/terraform/autoscaling/modules/ami-housekeeper/main.tf [57-63]

    +# In a separate file, e.g., iam.tf
     resource "aws_iam_role" "ami_housekeeper" {
       name                 = "${var.prefix}-ami-housekeeper-role"
       assume_role_policy   = data.aws_iam_policy_document.lambda_assume_role_policy.json
       path                 = local.role_path
       permissions_boundary = var.role_permissions_boundary
     
       tags = var.tags
     }
     
    Security
    Review security practices when using public IP addresses.

    For better security practices, avoid using public IP addresses if not necessary. If
    associate_public_ip_address is set to true, ensure that the instances are secured and only
    necessary ports are exposed.

    github-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy-arm64/github_agent.ubuntu.pkr.hcl [104]

    -associate_public_ip_address = var.associate_public_ip_address
    +# Ensure security measures are in place if using public IP addresses
     
    Dynamically fetch AMI names using AWS SSM Parameter Store for enhanced security.

    To ensure the EC2 instances are using the latest and most secure AMIs, consider
    dynamically fetching the source_ami_filter name using AWS SSM Parameter Store instead of
    hardcoding the AMI name pattern.

    github-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy-arm64/github_agent.ubuntu.pkr.hcl [107-115]

     source_ami_filter {
       filters = {
    -    name                = "*ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-arm64-server-*"
    +    name                = "{{ data.aws_ssm_parameter.ubuntu_ami.value }}"
         root-device-type    = "ebs"
         virtualization-type = "hvm"
       }
       most_recent = true
       owners      = ["099720109477"]
     }
     
    Restrict SQS queue policy actions to only necessary ones.

    Consider using a more restrictive condition for the SQS queue policy to limit the actions
    to only what's necessary, rather than allowing all SQS actions ("sqs:*"). This follows the
    principle of least privilege, enhancing the security of your Terraform configuration.

    github-runners/terraform/autoscaling/main.tf [33-35]

     actions = [
    -  "sqs:*"
    +  "sqs:SendMessage",
    +  "sqs:ReceiveMessage",
    +  "sqs:DeleteMessage",
    +  "sqs:GetQueueAttributes"
     ]
     
    Enable server-side encryption on SQS queues by default.

    For better security, consider enabling server-side encryption on the SQS queues by
    default, unless there's a specific reason to disable it. This can be done by setting
    sqs_managed_sse_enabled to true in the SQS resource configurations.

    github-runners/terraform/autoscaling/main.tf [73]

    -sqs_managed_sse_enabled = var.queue_encryption.sqs_managed_sse_enabled
    +sqs_managed_sse_enabled = true
     
    Specify specific ARNs in IAM policy resources instead of using a wildcard.

    It's recommended to avoid using a broad wildcard ("*") for IAM policy resources when
    possible. Specify the ARNs of the resources that the policy should apply to, enhancing the
    security by following the principle of least privilege.

    github-runners/terraform/autoscaling/main.tf [38-39]

     resources = [
    -  "*"
    +  "arn:aws:sqs:region:account-id:queue-name"
     ]
     
    Specify a more restrictive assume_role_policy for enhanced security.

    To enhance security, consider specifying a more restrictive assume_role_policy for the
    aws_iam_role resource. The current policy allows any Lambda function to assume this role.
    You should restrict it to only the Lambda functions that need it.

    github-runners/terraform/autoscaling/modules/ami-housekeeper/main.tf [67-73]

     statement {
       actions = ["sts:AssumeRole"]
     
       principals {
         type        = "Service"
         identifiers = ["lambda.amazonaws.com"]
       }
    +  condition {
    +    StringEquals = {
    +      "aws:SourceArn": aws_lambda_function.ami_housekeeper.arn
    +    }
    +  }
     }
     

    ✨ Improve tool usage guide:

    Overview:
    The improve tool scans the PR code changes, and automatically generates suggestions for improving the PR code. The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on a PR.

    • When commenting, to edit configurations related to the improve tool (pr_code_suggestions section), use the following template:
    /improve --pr_code_suggestions.some_config1=... --pr_code_suggestions.some_config2=...
    
    [pr_code_suggestions]
    some_config1=...
    some_config2=...
    

    See the improve usage page for a comprehensive guide on using this tool.

    @DaMandal0rian DaMandal0rian marked this pull request as ready for review April 5, 2024 11:48
    @DaMandal0rian DaMandal0rian deleted the ephemeral-runners branch October 1, 2024 13:31
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant