-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build embeddings with the load data script. #4402
base: master
Are you sure you want to change the base?
Conversation
keyurva
commented
Jun 26, 2024
- Ran the script with the floret dataset and here are the build embeddings results.
# TODO: try using the large model | ||
-f https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl | ||
en_core_web_sm==3.7.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shifucun - Needed to add this to the embeddings requirements to eliminate installing the nl_server requirements. Let me know if this is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was even thinking get rid of the lib here and make a shared requirements.txt. Duplicating the lib makes the version diverge very easily. Saw an bug before due to this.
) | ||
local start_ts=$(date +%s) | ||
set -x | ||
python -m tools.nl.embeddings.build_embeddings \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajaits - The run_cmd
function did not work for argument values with spaces so calling the script directly here. Let me know if there's a better way or we can go with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are no spaces within the json values in CUSTOM_CATALOG_DICT, you can try:
cmd=(python -m tools.nl.embeddings.build_embeddings)
cmd+=(--embeddings_name "$CUSTOM_EMBEDDING_INDEX")
cmd+=(--output_dir "$NL_EMBEDDINGS_DIR")
cmd+=(--catalog "$(echo $CUSTOM_CATALOG_DICT | sed -e 's/ //g')")
run_cmd ${cmd[@]}
--output_dir "$NL_EMBEDDINGS_DIR" \ | ||
--catalog "$CUSTOM_CATALOG_DICT" >> $LOG 2>&1 | ||
set +x | ||
status=$? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this before set, right after the python command so we get the status of python (not set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Keyur!
Added @pradh to also take a look. We had some offline discussion about maintaining and testing the script. Would like to get more opinions. |