diff --git a/404.html b/404.html index 7dc3f42..21fa02d 100644 --- a/404.html +++ b/404.html @@ -1 +1 @@ -
5/27/2024
8/11/2024
5/27/2024
Empirical Risk Minimization (ERM) is a mathematical framework for performing supervised machine learning (ML). This will be part 1 in a series of blog posts exploring ERM from the ground up.
-In part 1, we will broadly cover the task of supervised machine learning in a non-rigorous manner. By the end of this post, you should understand what supervised machine learning entails, so that later parts can formalize these concepts using math.
+Empirical Risk Minimization (ERM) is a mathematical framework for performing supervised machine learning (ML). This will be part 1 in a series of blog posts exploring ERM from the ground up.
+In part 1, we will broadly cover the task of supervised machine learning in a non-rigorous manner. By the end of this post, you should understand what supervised machine learning entails, so that later parts can formalize these concepts using math.
So on that note - what even is supervised machine learning? Let's start with a formal definition:
+So on that note - what even is supervised machine learning? Let's start with a formal definition:
--In supervised machine learning, a model is trained on a dataset containing both features and labels, with the goal of predicting the appropriate label for novel features.
+In supervised machine learning, a model is trained on a dataset containing both features and labels, with the goal of predicting the appropriate label for novel features.
By the end of this post, all of those words should make sense - but let's break it down piece by piece for now.
-The notion of a "model" is in no way unique to machine learning, but nevertheless is a core concept. In machine learning specifically, a model is simply a set of instructions (aka, a program) that specify how to predict an output value based on the inputs given to it.
-Differing machine learning tasks require different types of models. In supervised machine learning, there are two common types of prediction tasks, each requiring different model behavior:
-By the end of this post, all of those words should make sense - but let's break it down piece by piece for now.
+The notion of a "model" is in no way unique to machine learning, but nevertheless is a core concept. In machine learning specifically, a model is simply a set of instructions (aka, a program) that specify how to predict an output value based on the inputs given to it.
+Differing machine learning tasks require different types of models. In supervised machine learning, there are two common types of prediction tasks, each requiring different model behavior:
+Example:
To build intution about models (and to establish a running example), consider the problem of fraud detection in credit card purchases.
An ML researcher may choose to implement a classification model for fraud detection as follows:
Example:
To build intution about models (and to establish a running example), consider the problem of fraud detection in credit card purchases.
An ML researcher may choose to implement a classification model for fraud detection as follows:
We'll defer a conversation on how the model predict if a credit card purchase is fraudulent to the Training and Learning section.
In machine learning, inputs and outputs of models are given special names which we'll use going forward:
-We'll defer a conversation on how the model predict if a credit card purchase is fraudulent to the Training and Learning section.
In machine learning, inputs and outputs of models are given special names which we'll use going forward:
+In this sense, a model is really a labelling function - some set of instructions to assign labels to input features.
-As before, different supervised machine learning tasks use different label types:
-In this sense, a model is really a labelling function - some set of instructions to assign labels to input features.
+As before, different supervised machine learning tasks use different label types:
+You might have read the fraud detection example above and have wondered
+You might have read the fraud detection example above and have wondered
--"Well how does a model know what feature values correspond to a fraudulent purchase or not?"
+"Well how does a model know what feature values correspond to a fraudulent purchase or not?"
If you were approaching this challenge like a traditional computer scientist, you might decide to develop a fixed algorithm with rules that determine if a credit card purchase is fraudulent or not. Anyone who has tried something like this before can attest - coming up with rules that are precise enough to catch fraud is tricky.
-In place of painstakingly crafting an algorithm to catch most cases of fraud, why not:
-If you were approaching this challenge like a traditional computer scientist, you might decide to develop a fixed algorithm with rules that determine if a credit card purchase is fraudulent or not. Anyone who has tried something like this before can attest - coming up with rules that are precise enough to catch fraud is tricky.
+In place of painstakingly crafting an algorithm to catch most cases of fraud, why not:
+Steps 1 and 2 above form the process of training a machine learning model - having the model learn to recognize patterns in how the features relate to the labels so that it can accurately predict labels in the future. This is where the "learning" in "machine learning" comes from!
-To classify new data, the model just has to identify any patterns that exist, and then predict the label that correspond to those patterns.
-Comparison of label prediction strategies, contrasting traditional and machine learning based approaches. Image by author.
The central requirement to training a model is access to historical data, which we collect as part of a training dataset. A training dataset is a collection of training examples (also called examples) that describe past occurrences of the phenomena we'd like to make a prediction for. Each example in the training dataset has two parts:
-Steps 1 and 2 above form the process of training a machine learning model - having the model learn to recognize patterns in how the features relate to the labels so that it can accurately predict labels in the future. This is where the "learning" in "machine learning" comes from!
+To classify new data, the model just has to identify any patterns that exist, and then predict the label that correspond to those patterns.
+Comparison of label prediction strategies, contrasting traditional and machine learning based approaches. Image by author.
The central requirement to training a model is access to historical data, which we collect as part of a training dataset. A training dataset is a collection of training examples (also called examples) that describe past occurrences of the phenomena we'd like to make a prediction for. Each example in the training dataset has two parts:
+The fact that our dataset contains ground truth labels makes this a supervised learning task. This contrasts with other methods that don't include labels in the training dataset (which are thus called unsupervised machine learning tasks).
-Assuming that the training data is representative of the phenomena at hand, the goal is to train our model so that it learns to recognize general patterns in how the features relate to the labels.
-Example:
Building on the fraud detection example, the model described in the prior excerpt can be trained as follows:
The fact that our dataset contains ground truth labels makes this a supervised learning task. This contrasts with other methods that don't include labels in the training dataset (which are thus called unsupervised machine learning tasks).
+Assuming that the training data is representative of the phenomena at hand, the goal is to train our model so that it learns to recognize general patterns in how the features relate to the labels.
+Example:
Building on the fraud detection example, the model described in the prior excerpt can be trained as follows:
If everything works out well, we should ideally have a strong model that accurately predict when a purchase is fraudulent or not.
Note that we still have many questions left unanswered. For example:
-If everything works out well, we should ideally have a strong model that accurately predict when a purchase is fraudulent or not.
Note that we still have many questions left unanswered. For example:
+To summarize what we've learned so far - in a supervised machine learning task, we have two components:
-To summarize what we've learned so far - in a supervised machine learning task, we have two components:
+The goal in a supervised machine learning task is to train the model to recognize patterns in the training dataset, so that the model can predict labels for features that it potentially hasn't seen before.
-Overview of Supervised Machine Learning setup. Image by author.
In the next part (coming soon), we'll formalize this intuition into mathematics and discuss how ERM presents us with a framework that can describe the training process behind a machine learning algorithm.
The goal in a supervised machine learning task is to train the model to recognize patterns in the training dataset, so that the model can predict labels for features that it potentially hasn't seen before.
+Overview of Supervised Machine Learning setup. Image by author.
In the next part (coming soon), we'll formalize this intuition into mathematics and discuss how ERM presents us with a framework that can describe the training process behind a machine learning algorithm.
I recently got to play around with deploying Lambda functions on AWS, which wasn't as straightforward as I imagined. Here are the steps I took to launch a Python-based AWS Lambda function running in a Docker environment.
+A serverless function (like AWS Lambda) is just a cloud-hosted version of the regular functions we're used to in code. All you need to provide a function definition - your cloud provider handles the details of running your function when called (or "triggered", in Lambda parlance).
+It's the familiarity of serverless functions that makes them so useful. (Good) developers already use functions as modular, maintainable pieces of logic in codebases - serverless functions encourage the same design patterns in the cloud. Your system can be decoupled into individual serverless functions, each running and automatically scaling to varying workloads.
+In code, a Lambda serverless function is implemented by a handler function.
+def lambda_handler(event, context):
+ pass
+
+As part of the setup process, we tell AWS to call lambda_handler()
when the Lambda is triggered by an event. Lambda will then call the function with the following two parameters:
event
- the JSON payload sent by the triggering event in dict
format. You control the exact payload schema and contents when defining an event.context
- info about the current execution environment. See the docs for what context info you have access to.You can import modules or define variables and classes all outside the handler function - AWS Lambda will load those dependencies in when calling the handler. Simple enough, right?
+The rest of this post will cover the steps needed to launch a Python-based Lambda function via Docker. We'll be using Python 3.12 for this tutorial, but you can use any Python version supported by Lambda.
+We'll start simple by creating a folder to store all of our Lambda-related files. The name of this folder doesn't really matter, I went with the straightforward my-lambda
.
mkdir my-lambda
+cd my-lambda
+
+Of course - for an actual Lambda with real use, please name it something more descriptive and identifiable for the sake of maintainability.
+As we discussed earlier, a Lambda function calls a handler in our code when triggered. Our next step will be to define and implement that handler function.
+First, create a Python file named handler.py
- our handler function definition will reside here. If you choose to name it something else, keep track of the name for the next step.
touch handler.py
+
+We'll then use our favorite text editor (perhaps Neovim?) to implement the handler. For this example, we'll have our handler do two things:
+import requests
+
+def lambda_handler(event, context):
+ for key in event.keys():
+ print(f"{key}: {event[key]}")
+
+ with requests.get("https://ipinfo.io") as response:
+ response.raise_for_status()
+ return response.json()
+
+If you're using Python's logging
module, the base Lambda container image defines its own stream handler for root logger on top of any handlers you define manually. You'll want to disable log propagation to avoid log message duplication.
import logging
+
+logger = logging.getLogger(__name__)
+logger.propagate = False # don't propagate messages to root logger
+
Once you have your handler function implemented to perform your desired logic, we can move on to creating the Docker container image to execute the handler function in.
+In general, there are multiple ways to package the handler function so that Lambda can run it. For this tutorial, we'll use a container-based approach. We won't use the alternative archive-based approach as it is much more error-prone to third-party dependency errors (as I learned the hard way) - but it works perfectly fine for simple Lambda functions.
+AWS offers multiple ways to define a container image for a Lambda function, but we'll take the simplest route and start from the offical AWS Lambda base images for Python. As we are using Python 3.12, we'll use the version tagged with 3.12
- but you should use the image that corresponds to your Python version as needed.
With that out of the way, here is the Dockerfile for the Lambda container image.
+FROM public.ecr.aws/lambda/python:3.12
+
+RUN pip install requests
+COPY handler.py ${LAMBDA_TASK_ROOT}
+CMD [ "handler.lambda_handler" ]
+
+Apart from pulling the 3.12
version of the base image, the image recipe above contains three other steps:
requests
package, as needed by our handler function. Additional Python packages can also be installed at this step.handler.py
into the Lambda execution root folder, whose value is given by the LAMBDA_TASK_ROOT
environment variable. You can see the list of Lambda specific environment variables in the base image definition.lambda_handler()
and can be found in handler.py
. More generally, this step's input should be <handler_source_file>.<handler_function_name>
- so your command will be different if you named your Python file or handler function differently.And that's it for the image definition! We'll go ahead and build this image using Docker to make sure everything looks correct. I named my image my-lambda-image
- but as before, you can choose your own name. Just keep track of the image name for later parts.
docker build -t my-lambda-image .
+
+Everything should build correctly if your setup is valid, and we can move onto the next step.
+You may have to install additional libraries for your Lambda based on what third-party packages you use. The base image derives from Amazon Linux 2023, so if a library is not included in AL2023's package repository, then you may have to install it from source.
At this point, we should have two files in the my-lambda
directory.
handler.py
which defines the handler function for the LambdaDockerfile
which defines the container image creation recipemy-lambda
+├── Dockerfile
+└── handler.py
+
+A benefit of using the offical AWS base image is that we can test our Lambda locally to make sure it runs correctly in the containerized setup. First, run the container, making sure to expose port 8080
of the container to your system. We'll use this port to send a message that triggers the Lambda function.
docker run -p 8080:8080 my-lambda-image
+
+You'll know the container is running correctly if you see a message similar to this:
+09 Aug 2024 15:22:28,064 [INFO] (rapid) exec '/var/runtime/bootstrap' (cwd=/var/task, handler=)
+
+For local testing, you will need to set environment variables with your AWS Credentials if your Lambda accesses other AWS services via boto3
.
However you will not need to do this in production - each Lambda function runs with a specific IAM role that can be assigned permissions to access necessary services. In the actual Lambda infrastructure, the requisite credentials corresponding to this IAM role will be set.
Then open up another terminal window. We are going to send a HTTP POST
request using curl
to the endpoint at port 8080
to trigger our Lambda. This POST
request includes a test event payload, which in this example is simply a few random keys and values.
curl -XPOST "http://localhost:8080/2015-03-31/functions/function/invocations" -d \
+'{"key1": 1, "key2": -3.14159265, "key3": {"innerkey": "hello!"}}'
+
+Send this POST request, and if everything runs correctly, the container should produce an output that looks a bit like the following.
+09 Aug 2024 15:24:49,199 [INFO] (rapid) INIT START(type: on-demand, phase: init)
+START RequestId: 6fa7492a-5313-478f-8193-1c6ecc5e0bb1 Version: $LATEST
+09 Aug 2024 15:24:49,199 [INFO] (rapid) The extension's directory "/opt/extensions" does not exist, assuming no extensions to be loaded.
+09 Aug 2024 15:24:49,199 [INFO] (rapid) Starting runtime without AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN , Expected?: false
+09 Aug 2024 15:24:49,308 [INFO] (rapid) INIT RTDONE(status: success)
+09 Aug 2024 15:24:49,308 [INFO] (rapid) INIT REPORT(durationMs: 109.507000)
+09 Aug 2024 15:24:49,308 [INFO] (rapid) INVOKE START(requestId: 03bc93e8-7dcc-4bf2-bdca-d615198a48a5)
+key1: 1
+key2: -3.14159265
+key3: {'innerkey': 'hello!'}
+END RequestId: 03bc93e8-7dcc-4bf2-bdca-d615198a48a5
+REPORT RequestId: 03bc93e8-7dcc-4bf2-bdca-d615198a48a5 Init Duration: 0.11 ms Duration: 201.31 ms Billed Duration: 202 ms Memory Size: 3008 MB Max Memory Used: 3008 MB
+09 Aug 2024 15:24:49,400 [INFO] (rapid) INVOKE RTDONE(status: success, produced bytes: 0, duration: 91.613000ms)
+
+There's a lot to parse here, so let's look at the log messages by phase:
+INIT *
log messages. In this phase, the Lambda loads the handler defintion file plus any extensions (which we didn't cover in this tutorial).INVOKE *
log messages. Each invocation is given a request ID, in my case 03bc93e8-7dcc-4bf2-bdca-d615198a48a5
. Our handler function is called, which then prints out the key-value pairs and makes an IP info request.Note that the Lambda doesn't fully shutdown here - instead the environment still runs, waiting for new invocations.
+If you return now to your terminal window from where you sent the POST
request, you should see that the JSON-formatted IP Info response was returned to the client:
{
+ "ip": "***.***.***.***",
+ "hostname": "**************",
+ "city": "**************",
+ "region": "New York",
+ "country": "US",
+ "loc": "*******,*******",
+ "org": "**************",
+ "postal": "*****",
+ "timezone": "America/New_York",
+ "readme": "https://ipinfo.io/missingauth"
+}
+
+I removed some identifying information for my privacy, but your response should look similar. We receive this data as the JSON response was the return value of lambda_handler()
in our implementation, which is then returned to the requesting client. In general, you can return any payload from the handler function, and can then send that return value to a number of destinations, including other Lambda functions.
Once we're satisfied with our Lambda function and are certain that everything works correctly, lets start deploying to production. The first step in this process will be uploading our container image to a repository on AWS Elastic Container Registry.
+We'll need to create a repository to store our container images in. Navigate to the ECR page on the AWS Console and press the "Create repository" button.
+ECR Landing page. To create new repository, click the 'Create repository' button.
Then complete the configuration page to create an ECR repository. I'll call mine my-lambda-images
and set it to private access. This repository name becomes part of the repository URI that we'll use in later steps to access our images.
ECR Repostory Creation Screen.
Check out the docs to learn about other configuration options for your repository.
+Now let's return to the terminal so we can upload the container image to ECR. If you don't already, make sure you have the AWS CLI installed and authenticated with an IAM role that supports pushing to ECR.
+You will also need to login to ECR with Docker so you have authenticated access to push to ECR. AWS already has docs on how to do this, so walk through that page before following the rest of this tutorial.
+In the previous step, if you're using IAM Identity Center, then you will have to specify your profile name using the --profile
option so AWS can validate your permissions. The AWS docs do not explicitly specify this unfortunately.
aws ecr get-login-password --region <your-region> --profile <your-profile-name>
+
Let's start by tagging our container image with the repository info so Docker knows where to push it to.
+docker tag my-lambda-image:latest [aws_account_id].dkr.ecr.[region].amazonaws.com/my-lambda-images:[tag]
+
+The last argument after the repository URI specifies the name to tag this image within the my-lambda-images
repository, which can be different from your local tag name.
Then all we have to do is push the tagged image to the repository. Since we tagged the image and logged in earlier, Docker will take care of the rest!
+docker push [aws_account_id].dkr.ecr.[region].amazonaws.com/my-lambda-images:[tag]
+
+Once everything is done, we should see our tagged version of the image in the AWS Console view of the my-lambda-images
repository.
Home page for 'my-lambda-images' repository after image with tag 'latest' was pushed.
Almost done - now we can create our Lambda function from our container image. Navigate to the Lambda home page on the AWS console. You should then click on the "Create function" to create the Lambda.
+Lambda landing page in the AWS Console. Click 'Create function' to create a Lambda function.
When creating your Lambda function, there are three primary options you should pay attention to:
+my-lambda
.Important: Make sure you choose the same architecture as the computer which your Lambda container image was built on! When we pulled the base Python Lambda image a few steps ago, Docker chose the correct version of the image to run on your computer's architecture, so we need to make sure that the Lambda also uses the same architecture. I use a M-chip MacBook, so I selected "arm64" for my Lambda.
Lambda function creation page in AWS Console.
Go ahead and create the function, and if everything worked smoothly, you should be presented with the my-lambda
function home page, similar to below.
'my-lambda' function home page in the AWS Console after a succesful creation.
Note that we can't edit the function definition since it's packaged as part of the container. This is one of the downsides of a container-based deployment approach. Everytime we want to update our lambda function, we have to:
+handler.py
This makes testing locally even more imperative so that there are fewer bugs in production and fewer times that the entire deployment cycle has to be performed. Or you could just build a CI/CD pipeline to streamline this so you can continue pushing buggy code to production 😔.
+Our Lambda is now created and should be ready to go!
+If you need to change any Lambda settings, including environment variables and the maximum allowed execution time, you will need to go to the "Configuration" tab on the function home page and edit these settings.
We can test the Lambda by navigating to the "Test" tab on the function home page and creating a test case. Below is a simple test case that sends the exact same payload as the local test we performed earlier in this tutorial.
+Test creation page for the 'my-lambda' function in the AWS Console. This test sends the same payload as the local test performed earlier in this tutorial.
Press the "Test" button to run the test. A window will appear above the test case editor containing information about the test and whether it succeeded. If everything went alright, you should see a successful execution!
+We can view the attached execution report and see how the statistics compare to our local test:
+START RequestId: 246837cf-30cb-455b-9df4-d7052574154d Version: $LATEST
+key1: 1
+key2: -3.14159265
+key3: {'innerkey': 'hello!'}
+END RequestId: 246837cf-30cb-455b-9df4-d7052574154d
+REPORT RequestId: 246837cf-30cb-455b-9df4-d7052574154d Duration: 117.93 ms Billed Duration: 1464 ms Memory Size: 128 MB Max Memory Used: 57 MB Init Duration: 1346.07 ms
+
+Notice that while the lambda only took ~118 milliseconds to execute, AWS billed us for nearly 1.5 seconds. This is due to the time needed to initialize and spin up the container - which was nearly 1.35 seconds! This is another downside of a container-based approach - actually creating and running the container can dominate the billed duration of your Lambda invocations.
+In that same window, we can also see the return value of the Lambda function, in our case the IP Information data.
+{
+ "ip": "34.230.24.96",
+ "hostname": "ec2-34-230-24-96.compute-1.amazonaws.com",
+ "city": "Ashburn",
+ "region": "Virginia",
+ "country": "US",
+ "loc": "39.0437,-77.4875",
+ "org": "AS14618 Amazon.com, Inc.",
+ "postal": "20147",
+ "timezone": "America/New_York",
+ "readme": "https://ipinfo.io/missingauth"
+}
+
+There is some surprisingly neat information here:
+"hostname"
value looks suspiciously like a hostname for a EC2 host - it could indicate that Lambdas are implemented on EC2, or at least use similar DNS structures internally.And that's it! We've launched a simple Python-based Lambda function with Docker. AWS Lambdas are a fairly versatile tool, and there's a bunch more you can explore:
+boto3
.