forked from cjackson202/AOAI-FED-CIV-Workshop
-
Notifications
You must be signed in to change notification settings - Fork 2
/
workshop_embedding.py
54 lines (40 loc) · 1.9 KB
/
workshop_embedding.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
"""
This script uses the Azure OpenAI API to generate text embeddings for a given dataset.
Functions:
- get_embedding(text, engine='text-embedding-ada-002-ce'): Retrieves the embedding for a given text using the specified engine.
- text: The input text to generate the embedding for.
- engine: The engine to use for generating the embedding (default: 'text-embedding-ada-002-ce').
Variables:
- client: An instance of the AzureOpenAI class configured with the API version, endpoint, and key from the environment variables.
- df: A pandas DataFrame containing the data read from the 'microsoft-earnings.csv' file.
- Columns: 'text', 'embedding'
- load_dotenv(): Loads the environment variables from the '.env' file.
Usage:
1. Set the required environment variables: AZURE_OPENAI_VERSION, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_KEY.
2. Call the 'get_embedding' function to generate embeddings for the text data in the DataFrame.
3. Save the DataFrame with the embeddings to a new CSV file named 'microsoft-earnings_embeddings.csv'.
"""
import os
from openai import AzureOpenAI
import pandas as pd
from dotenv import load_dotenv
# load in variables from .env
load_dotenv()
# configure Azure OpenAI client
client = AzureOpenAI(api_version=os.environ['AZURE_OPENAI_VERSION'],
azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT'],
api_key=os.environ['AZURE_OPENAI_KEY'])
embedding_model_deployment = os.environ['AZURE_EMBEDDINGS_DEPLOYMENT']
# Function to get embeddings
#def get_embedding(text, engine='text-embedding-ada-002-ce'):
def get_embedding(text):
response = client.embeddings.create(model=embedding_model_deployment, input=text)
return response.data[0].embedding
# read the data file to be embed
df = pd.read_csv('microsoft-earnings.csv')
print(df)
# calculate word embeddings
df['embedding'] = df['text'].apply(lambda x: get_embedding(x))
df.to_csv('microsoft-earnings_embeddings.csv')
time.sleep(3)
print(df)