Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add unique_id to subjects table to track subject across ML workloads #22

Open
camallen opened this issue Sep 23, 2022 · 0 comments
Open

Comments

@camallen
Copy link
Contributor

GZ uses unique_id str on subject metadata to uniquely identify the subject in the research domain context, e.g.

def unique_id
unique_id = payload.dig('subject', 'metadata', '#name')
return unique_id if unique_id
# staging has older data with different subject metadata - fallback to handling this special env case
payload.dig('subject', 'metadata', '!SDSS_ID') if Rails.env.staging? || Rails.env.test?
end

This is the data that flows into the catalgoues and ML systems to uniquely identify the subjects, not the subject_id in our systems. As such we'll need to have this attribute added to the subjects table with a unique index and backfilled when importing the subject data to the system.

One solution is to add the metadata import on the subject backfiller job,

Import::SubjectLocations.new(subject).run
. Alternatively this metadata comes through via the caesar reductions payload, we can use this flow of data to extract the information as it comes through.

We can then use this field to uniquely identify the subject linkage when importing / upsert ML results (vector representations, predictions etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant