Skip to content

Latest commit

 

History

History
126 lines (93 loc) · 3.89 KB

delayed_embed.md

File metadata and controls

126 lines (93 loc) · 3.89 KB

Delayed embed

You may decide to combine pgai with Timescaledb Background Actions. and you can also use the pgai extension to run background actions.

Security concerns

There are some security concerns when running background actions, specially because you may need to access some external service, like OpenAI API.

You should be aware of how you're storing your keys and how it can be accessed.

The background action should be run by a user that has the necessary permissions but it also needs to access some shared memory with the api key or receive it via payload.

If you store it in the payload, make sure you're using a secure connection and also that your secrets are renewed frequently.

You can also setup this settings at your server level, for example, you can add a new config to the system:

echo "ai.openai_api_key=${API_KEY}" >> /path/to/postgresql.conf

If you're running as admin, you can also use ALTER SYSTEM to persist the configuration directly from your session:

SET ai.openai_api_key TO :'OPENAI_API_KEY' ;
ALTER SYSTEM SET ai.openai_api_key TO :'OPENAI_API_KEY' ;

⚠️ Remember your keys are stored in plain text in the configuration file. ⚠️ ⚠️ Remember that ALTER SYSTEM will propagate the config in the server after changing the configuration. ⚠️

Delayed vector example

Let's imagine you're receiving documents and want to process the embedings in the background. You can use the pgai extension to run the embedding in the background action and store the result later.

Want to run it? Check the full script here.

First install pgvectorscale extension:

CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;

You can create a function that will be called by the background action:

CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;

This table will store the documents and their embeddings:

CREATE TABLE IF NOT EXISTS document_embedding  (
    id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    metadata JSONB,
    contents TEXT,
    embedding VECTOR(1536)
);

Let's also use our DiskANN index to speed up the search:

CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);

Now, we're going to build a background job with timescaledb action that will populate the embedding column with the embeddings of the contents:

⚠️ Note that the api_key is being passed as a parameter to the function. If you're using the ALTER SYSTEM approach, you can remove the parameter.

⚠️ Remember your keys are pure text, so, they'll also be copied as part of the payload and may be appearing in logs.

CREATE OR REPLACE FUNCTION populate_embedding(job_id int, config jsonb) returns void as $$
DECLARE
  r record;
  api_key text;
BEGIN
  api_key := config->>'api_key';
  FOR r IN select id, contents from document_embedding where embedding IS NULL LIMIT 1 FOR UPDATE SKIP LOCKED LOOP
    update document_embedding
    set embedding = openai_embed('text-embedding-ada-002', r.contents, api_key)
      SELECT id
    where id = r.id;
  END LOOP;
END;
$$ language plpgsql;

Schedule the job to run every 10 seconds:

SELECT add_job('populate_embedding','10s', fixed_schedule => true,
  config => format('{"api_key": "%s"}', :'OPENAI_API_KEY')::jsonb);

Now you can insert a new document:

insert into document_embedding (contents) values
  ('Test document embedding');

Wait at least 10 seconds and then check the table:

select * from document_embedding;