Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSearch migration for Alegre, including dedicated migration tasks in ECS #364

Open
wants to merge 27 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
cb2c13c
Run migrations as their own target in Makefile, rather than by default.
Oct 26, 2023
16c1671
Merge branch 'develop' into bugfix/no-migrations-test
Oct 26, 2023
fd0f018
Test migration task in QA
Oct 29, 2023
3702010
Fix typo.
Oct 29, 2023
749667c
Add migration step to Live deployments and remove branch builds in pr…
Oct 30, 2023
0859d64
Build and deploy for this branch.
Nov 4, 2023
ca2fe08
ELASTICSEARCH is now OPENSEARCH
Nov 21, 2023
0960ced
OpenSearch refactor.
Nov 25, 2023
8e3df67
OpenSearch refactor.
Nov 27, 2023
2c20b38
Merge develop
Nov 27, 2023
0de0d5d
Support both elasticsearch and opensearch model names for compatibility.
Nov 30, 2023
127c572
Remove builds for branch in preparation for merge.
Dec 4, 2023
4e2890b
Merge branch 'develop' into bugfix/no-migrations-test
Dec 4, 2023
5c43df7
OpenSearch is the future
Dec 4, 2023
a033a64
Typo.
Dec 5, 2023
bf674ac
Fix merge conflict.
Jan 10, 2024
91b7025
Remove dependency on kibana. We don't use it
Jan 11, 2024
dd85c98
More kibana deprecation.
Jan 11, 2024
f499882
Revert API breaking changes.
Jan 16, 2024
d137bce
Deprecate unused model services in ECS deploy.
Jan 18, 2024
cc97ff4
Build and deploy to QA from this branch.
Jan 18, 2024
ff6c560
Remove builds for branch in preparation for merge.
Jan 22, 2024
a2e9fed
Merge branch 'develop' into bugfix/no-migrations-test
Jan 23, 2024
fc845ae
Build and deploy to QA from this branch.
Jan 23, 2024
5c01ee4
Remove builds for branch in preparation for merge.
Jan 24, 2024
37fbbe9
Merge branch 'develop' into bugfix/no-migrations-test
Feb 9, 2024
69809a7
Merge branch 'develop' into bugfix/no-migrations-test
Jun 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .env_file.example
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
IMAGE_MODEL=phash
ELASTICSEARCH_URL=http://elasticsearch:9200
OPENSEARCH_URL=http://opensearch:9200
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_DATABASE=0
Expand Down
14 changes: 12 additions & 2 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ build_qa:
- docker push "$QA_ECR_API_BASE_URL:latest"
only:
- develop
- feature/CV2-3482_deploy-new-model

deploy_qa:
image: python:3-alpine
Expand All @@ -44,6 +43,12 @@ deploy_qa:
- pip install ecs-deploy==1.14.0
- pip install awscli==1.29.59
- aws ssm get-parameters-by-path --region $AWS_DEFAULT_REGION --path /qa/alegre/ --recursive --with-decryption --output text --query "Parameters[].[Name]" | sed -E 's#/qa/alegre/##' > env.qa.names
- for NAME in `cat env.qa.names`; do echo -n "-s qa-alegre-migration $NAME /qa/alegre/$NAME " >> qa-alegre-migration.env.args; done
- ecs update qa-alegre-migration --image qa-alegre-migration $QA_ECR_API_BASE_URL:$CI_COMMIT_SHA --exclusive-env -e qa-alegre-migration APP alegre -e qa-alegre-migration DEPLOY_ENV qa -e qa-alegre-migration AWS_REGION $AWS_DEFAULT_REGION -e qa-alegre-migration ALEGRE_PORT 8000 --exclusive-secrets `cat qa-alegre-migration.env.args`
- taskArn=$(aws ecs run-task --cluster ecs-qa --task-definition qa-alegre-migration --query 'tasks[].taskArn' --output text)
- echo "Migration task started - $taskArn"
- aws ecs wait tasks-stopped --cluster ecs-qa --tasks $taskArn
- echo "Migration task finished."
- for NAME in `cat env.qa.names`; do echo -n "-s qa-alegre-c $NAME /qa/alegre/$NAME " >> qa-alegre-c.env.args; done
- ecs deploy ecs-qa qa-alegre --diff --image qa-alegre-c $QA_ECR_API_BASE_URL:$CI_COMMIT_SHA --timeout 1200 --exclusive-env -e qa-alegre-c APP alegre -e qa-alegre-c DEPLOY_ENV qa -e qa-alegre-c ALEGRE_PORT 8000 --exclusive-secrets `cat qa-alegre-c.env.args`
- for NAME in `cat env.qa.names`; do echo -n "-s qa-alegre-indiansbert $NAME /qa/alegre/$NAME " >> qa-alegre-indiansbert.env.args; done
Expand All @@ -63,7 +68,6 @@ deploy_qa:
- echo "new Image was deployed $QA_ECR_API_BASE_URL:$CI_COMMIT_SHA"
only:
- develop
- feature/CV2-3482_deploy-new-model

build_live:
image: docker:latest
Expand Down Expand Up @@ -104,6 +108,12 @@ deploy_live:
- pip install ecs-deploy==1.14.0
- pip install awscli==1.29.59
- aws ssm get-parameters-by-path --region $AWS_DEFAULT_REGION --path /live/alegre/ --recursive --with-decryption --output text --query "Parameters[].[Name]" | sed -E 's#/live/alegre/##' > env.live.names
- for NAME in `cat env.live.names`; do echo -n "-s live-alegre-migration $NAME /live/alegre/$NAME " >> live-alegre-migration.env.args; done
- ecs update live-alegre-migration --image live-alegre-migration $QA_ECR_API_BASE_URL:$CI_COMMIT_SHA --exclusive-env -e live-alegre-migration APP alegre -e live-alegre-migration DEPLOY_ENV live -e live-alegre-migration AWS_REGION $AWS_DEFAULT_REGION -e live-alegre-migration ALEGRE_PORT 8000 --exclusive-secrets `cat live-alegre-migration.env.args`
sonoransun marked this conversation as resolved.
Show resolved Hide resolved
- taskArn=$(aws ecs run-task --cluster ecs-live --task-definition live-alegre-migration --query 'tasks[].taskArn' --output text)
- echo "Migration task started - $taskArn"
- aws ecs wait tasks-stopped --cluster ecs-live --tasks $taskArn
- echo "Migration task finished."
- for NAME in `cat env.live.names`; do echo -n "-s live-alegre-c $NAME /live/alegre/$NAME " >> live-alegre-c.env.args; done
- ecs deploy ecs-live live-alegre --image live-alegre-c $LIVE_ECR_API_BASE_URL:$CI_COMMIT_SHA --timeout 1200 --exclusive-env -e live-alegre-c APP alegre -e live-alegre-c DEPLOY_ENV live -e live-alegre-c ALEGRE_PORT 8000 --exclusive-secrets `cat live-alegre-c.env.args`
- for NAME in `cat env.live.names`; do echo -n "-s live-alegre-indiansbert $NAME /live/alegre/$NAME " >> live-alegre-indiansbert.env.args; done
Expand Down
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ before_script:
- docker-compose build --pull
- docker-compose -f docker-compose.yml -f docker-test.yml up -d
- docker-compose logs -t -f &
- echo "Waiting for Elasticsearch indexes..." && until curl --silent --fail -I "http://localhost:9200/alegre_similarity_test"; do sleep 1; done
- echo "Waiting for OpenSearch indexes..." && until curl --silent --fail -I "http://localhost:9200/alegre_similarity_test"; do sleep 1; done
- until curl --silent --fail -I "http://localhost:3100"; do sleep 1; done
- echo "Waiting for model servers..." && while [[ ! '2' =~ $(redis-cli -n 1 SCARD 'SharedModel') ]]; do sleep 1; done
#comment until fix timeout curl: (28) Operation timed out
Expand Down
7 changes: 5 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
.PHONY: run test wait

run: wait
migration: wait
python manage.py init_perl_functions
python manage.py init
python manage.py db stamp head
python manage.py db upgrade
echo "Migrations complete."

run: wait
python manage.py run

# The model and worker entry points run repeatedly to
Expand All @@ -28,7 +31,7 @@ test: wait
coverage run --source=app/main/ manage.py test

wait:
until curl --silent -XGET --fail $(ELASTICSEARCH_URL); do printf '.'; sleep 1; done
until curl --silent -XGET --fail $(OPENSEARCH_URL); do printf '.'; sleep 1; done

contract_testing: wait
curl -vvv -X POST "http://alegre:3100/image/similarity/" -H "Content-Type: application/json" -d '{"url":"https://i.pinimg.com/564x/0f/73/57/0f7357637b2b203e9f32e73c24d126d7.jpg","threshold":0.9,"context":{}}'
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ A media analysis service. Part of the [Check platform](https://meedan.com/check)
The Alegre API Swagger UI unfortunately [does not support sending body payloads to GET methods](https://github.com/swagger-api/swagger-ui/issues/2136). To test those API methods, you can still fill in your arguments, and click "Execute" - Swagger will fail, but show you a `curl` command that you can use in your console.

- Open http://localhost:5601 for the Kibana UI
- Open http://localhost:9200 for the Elasticsearch API
- Open http://localhost:9200 for the OpenSearch API
- `docker-compose exec alegre flask shell` to get inside a Python shell in docker container with the loaded app

## Testing
Expand All @@ -30,7 +30,7 @@ To test individual modules:

## Troubleshooting

- If you're having trouble starting Elasticsearch on macOS, with the error `container_name exited with code 137`, you will need to adjust your Docker settings, as per https://www.petefreitag.com/item/848.cfm
- If you're having trouble starting OpenSearch on macOS, with the error `container_name exited with code 137`, you will need to adjust your Docker settings, as per https://www.petefreitag.com/item/848.cfm
- Note that the alegre docker service definitions in the `alegre` repo may not align with the alegre service definitions in the `check` repository, so different variations of the service may be spun up depending on the directory where `docker-compose up` is executed.


Expand Down
6 changes: 3 additions & 3 deletions app/main/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
class Config:
SECRET_KEY = os.getenv('SECRET_KEY', 'my_precious_secret_key')
DEBUG = False
ELASTICSEARCH_URL = os.getenv('ELASTICSEARCH_URL', 'http://elasticsearch:9200')
ELASTICSEARCH_SIMILARITY = 'alegre_similarity'
OPENSEARCH_URL = os.getenv('OPENSEARCH_URL', 'http://opensearch:9200')
OPENSEARCH_SIMILARITY = 'alegre_similarity'
REDIS_HOST = os.getenv('REDIS_HOST', 'redis')
REDIS_PORT = os.getenv('REDIS_PORT', 6379)
REDIS_DATABASE = os.getenv('REDIS_DATABASE', 0)
Expand Down Expand Up @@ -49,7 +49,7 @@ class TestingConfig(Config):
DEBUG = True
TESTING = True
PRESERVE_CONTEXT_ON_EXCEPTION = False
ELASTICSEARCH_SIMILARITY = 'alegre_similarity_test'
OPENSEARCH_SIMILARITY = 'alegre_similarity_test'
REDIS_DATABASE = os.getenv('REDIS_DATABASE', 1)
SQLALCHEMY_DATABASE_URI = 'postgresql+psycopg2://%(user)s:%(password)s@%(host)s/%(dbname)s?client_encoding=utf8' % {
'user': os.getenv('DATABASE_USER', 'postgres'),
Expand Down
4 changes: 2 additions & 2 deletions app/main/controller/about_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ def get(self):
return {
'text/langid': AboutResource.list_providers('app.main.lib.langid', 'LangidProvider'),
'text/translation': ['google'],
'text/similarity': ['elasticsearch'] + SharedModel.get_servers(),
'text/bulk_similarity': ['elasticsearch'],
'text/similarity': ['opensearch'] + SharedModel.get_servers(),
'text/bulk_similarity': ['opensearch'],
'text/bulk_upload_similarity': SharedModel.get_servers(),
'image/classification': AboutResource.list_providers('app.main.lib.image_classification', 'ImageClassificationProvider'),
'image/similarity': ['phash'],
Expand Down
4 changes: 2 additions & 2 deletions app/main/controller/bulk_similarity_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ class BulkSimilarityResource(Resource):
def get_bulk_write_object(self, doc_id, body, op_type="index"):
return {
"_op_type": op_type,
'_index': app.config['ELASTICSEARCH_SIMILARITY'],
'_index': app.config['OPENSEARCH_SIMILARITY'],
'_id': doc_id,
'_source': body
}
Expand All @@ -46,7 +46,7 @@ def get_bodies_for_request(self):
return doc_ids, bodies

def submit_bulk_request(self, doc_ids, bodies, op_type="index"):
es = OpenSearch(app.config['ELASTICSEARCH_URL'])
es = OpenSearch(app.config['OPENSEARCH_URL'])
writables = []
for doc_body_set in each_slice(list(zip(doc_ids, bodies)), 8000):
to_write = []
Expand Down
6 changes: 3 additions & 3 deletions app/main/controller/bulk_update_similarity_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from app.main.controller.bulk_similarity_controller import BulkSimilarityResource
from app.main.lib import similarity
from app.main.lib.text_similarity import get_document_body
from app.main.lib.elasticsearch import merge_contexts
from app.main.lib.opensearch import merge_contexts
def get_documents_by_ids(index, ids, es):
query = {
"query": {
Expand Down Expand Up @@ -71,9 +71,9 @@ def get_cases(params, existing_docs, updateable=True):
class BulkUpdateSimilarityResource(Resource):
# Assumes less than 10k documents at a time.
def get_writeable_data_for_request(self):
es = OpenSearch(app.config['ELASTICSEARCH_URL'], timeout=30)
es = OpenSearch(app.config['OPENSEARCH_URL'], timeout=30)
params = request.json
existing_docs = get_documents_by_ids(app.config['ELASTICSEARCH_SIMILARITY'], [e.get("doc_id") for e in params.get("documents", [])], es)
existing_docs = get_documents_by_ids(app.config['OPENSEARCH_SIMILARITY'], [e.get("doc_id") for e in params.get("documents", [])], es)
updated_cases = get_cases(params, existing_docs)
new_cases = get_cases(params, existing_docs, False)
return updated_cases, new_cases
Expand Down
16 changes: 8 additions & 8 deletions app/main/controller/healthcheck_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,24 +15,24 @@ class HealthcheckResource(Resource):
@api.doc('Make a healthcheck query')
def get(self):
result = {
'ELASTICSEARCH': False,
'ELASTICSEARCH_SIMILARITY': False,
'OPENSEARCH': False,
'OPENSEARCH_SIMILARITY': False,
'REDIS': False,
'DATABASE': False,
'LANGID': False
}

# Elasticsearch
try:
es = OpenSearch(app.config['ELASTICSEARCH_URL'], timeout=10, max_retries=3, retry_on_timeout=True)
es = OpenSearch(app.config['OPENSEARCH_URL'], timeout=10, max_retries=3, retry_on_timeout=True)

except Exception as e:
result['ELASTICSEARCH'] = str(e)
result['OPENSEARCH'] = str(e)
else:
result['ELASTICSEARCH'] = True
result['ELASTICSEARCH_SIMILARITY'] = True if es.indices.exists(
index=[app.config['ELASTICSEARCH_SIMILARITY']]
) else 'Index not found `%s`' % app.config['ELASTICSEARCH_SIMILARITY']
result['OPENSEARCH'] = True
result['OPENSEARCH_SIMILARITY'] = True if es.indices.exists(
index=[app.config['OPENSEARCH_SIMILARITY']]
) else 'Index not found `%s`' % app.config['OPENSEARCH_SIMILARITY']

# Redis
try:
Expand Down
8 changes: 4 additions & 4 deletions app/main/controller/similarity_async_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,18 @@
'url': fields.String(required=False, description='url for item to be stored or queried for similarity'),
'callback_url': fields.String(required=False, description='callback_url for final search results'),
'doc_id': fields.String(required=False, description='text ID to constrain uniqueness'),
'models': fields.List(required=False, description='similarity models to use: ["elasticsearch"] (pure Elasticsearch, default) or the key name of an active model', cls_or_instance=fields.String),
'models': fields.List(required=False, description='similarity models to use: ["opensearch"] (pure OpenSearch, default) or the key name of an active model. Legacy elasticsearch model supported for migration purposes.', cls_or_instance=fields.String),
'language': fields.String(required=False, description='language code for the analyzer to use during the similarity query (defaults to standard analyzer)'),
'threshold': fields.Float(required=False, description='minimum score to consider, between 0.0 and 1.0 (defaults to 0.9)'),
'context': JsonObject(required=True, description='context'),
'fuzzy': fields.Boolean(required=False, description='whether or not to use fuzzy search on GET queries (only used when model is set to \'elasticsearch\')'),
'fuzzy': fields.Boolean(required=False, description='whether or not to use fuzzy search on GET queries (only used when model is set to \'opensearch\')'),
'requires_callback': fields.Boolean(required=False, description='whether or not to trigger a callback event to the provided URL'),
})
@api.route('/<string:similarity_type>')
class AsyncSimilarityResource(Resource):
@api.response(200, 'text similarity successfully queried.')
@api.doc('Make a text similarity query. Note that we currently require GET requests with a JSON body rather than embedded params in the URL. You can achieve this via curl -X GET -H "Content-type: application/json" -H "Accept: application/json" -d \'{"text":"Some Text", "threshold": 0.5, "model": "elasticsearch"}\' "http://[ALEGRE_HOST]/text/similarity"')
sonoransun marked this conversation as resolved.
Show resolved Hide resolved
@api.doc(params={'text': 'text to be stored or queried for similarity', 'threshold': 'minimum score to consider, between 0.0 and 1.0 (defaults to 0.9)', 'model': 'similarity model to use: "elasticsearch" (pure Elasticsearch, default) or the key name of an active model'})
@api.doc('Make a text similarity query. Note that we currently require GET requests with a JSON body rather than embedded params in the URL. You can achieve this via curl -X GET -H "Content-type: application/json" -H "Accept: application/json" -d \'{"text":"Some Text", "threshold": 0.5, "model": "opensearch"}\' "http://[ALEGRE_HOST]/text/similarity"')
@api.doc(params={'text': 'text to be stored or queried for similarity', 'threshold': 'minimum score to consider, between 0.0 and 1.0 (defaults to 0.9)', 'model': 'similarity model to use: "opensearch" (pure Elasticsearch, default) or the key name of an active model'})
def post(self, similarity_type):
args = request.json
app.logger.debug(f"Args are {args}")
Expand Down
10 changes: 5 additions & 5 deletions app/main/controller/similarity_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@
similarity_request = api.model('similarity_request', {
'text': fields.String(required=False, description='text to be stored or queried for similarity'),
'doc_id': fields.String(required=False, description='text ID to constrain uniqueness'),
'model': fields.String(required=False, description='similarity model to use: "elasticsearch" (pure Elasticsearch, default) or the key name of an active model'),
'models': fields.List(required=False, description='similarity models to use: ["elasticsearch"] (pure Elasticsearch, default) or the key name of an active model', cls_or_instance=fields.String),
'model': fields.String(required=False, description='similarity model to use: "opensearch" (pure Elasticsearch, default) or the key name of an active model. Legacy elasticsearch model supported for migration purposes.'),
sonoransun marked this conversation as resolved.
Show resolved Hide resolved
'models': fields.List(required=False, description='similarity models to use: ["opensearch"] (pure Elasticsearch, default) or the key name of an active model. Legacy elasticsearch model supported for migration purposes.', cls_or_instance=fields.String),
sonoransun marked this conversation as resolved.
Show resolved Hide resolved
'language': fields.String(required=False, description='language code for the analyzer to use during the similarity query (defaults to standard analyzer)'),
'threshold': fields.Float(required=False, description='minimum score to consider, between 0.0 and 1.0 (defaults to 0.9)'),
'context': JsonObject(required=False, description='context'),
'fuzzy': fields.Boolean(required=False, description='whether or not to use fuzzy search on GET queries (only used when model is set to \'elasticsearch\')'),
'fuzzy': fields.Boolean(required=False, description='whether or not to use fuzzy search on GET queries (only used when model is set to \'opensearch\')'),
sonoransun marked this conversation as resolved.
Show resolved Hide resolved
})
@api.route('/')
class SimilarityResource(Resource):
Expand All @@ -42,8 +42,8 @@ def post(self):
@api.route('/search/')
class SimilaritySearchResource(Resource):
@api.response(200, 'text similarity successfully queried.')
@api.doc('Make a text similarity query. Note that we currently require GET requests with a JSON body rather than embedded params in the URL. You can achieve this via curl -X GET -H "Content-type: application/json" -H "Accept: application/json" -d \'{"text":"Some Text", "threshold": 0.5, "model": "elasticsearch"}\' "http://[ALEGRE_HOST]/text/similarity"')
@api.doc(params={'text': 'text to be stored or queried for similarity', 'threshold': 'minimum score to consider, between 0.0 and 1.0 (defaults to 0.9)', 'model': 'similarity model to use: "elasticsearch" (pure Elasticsearch, default) or the key name of an active model'})
@api.doc('Make a text similarity query. Note that we currently require GET requests with a JSON body rather than embedded params in the URL. You can achieve this via curl -X GET -H "Content-type: application/json" -H "Accept: application/json" -d \'{"text":"Some Text", "threshold": 0.5, "model": "opensearch"}\' "http://[ALEGRE_HOST]/text/similarity"')
@api.doc(params={'text': 'text to be stored or queried for similarity', 'threshold': 'minimum score to consider, between 0.0 and 1.0 (defaults to 0.9)', 'model': 'similarity model to use: "opensearch" (pure Elasticsearch, default) or the key name of an active model'})
def post(self):
args = request.json
app.logger.debug(f"Args are {args}")
Expand Down
Loading
Loading