add unique_id to subjects table to track subject across ML workloads #22

camallen · 2022-09-23T12:41:52Z

GZ uses unique_id str on subject metadata to uniquely identify the subject in the research domain context, e.g.

Lines 68 to 74 in bcb057c

    
           def unique_id 
        
             unique_id = payload.dig('subject', 'metadata', '#name') 
        
             return unique_id if unique_id 
        
             # staging has older data with different subject metadata - fallback to handling this special env case 
        
             payload.dig('subject', 'metadata', '!SDSS_ID') if Rails.env.staging? || Rails.env.test? 
        
           end

This is the data that flows into the catalgoues and ML systems to uniquely identify the subjects, not the subject_id in our systems. As such we'll need to have this attribute added to the subjects table with a unique index and backfilled when importing the subject data to the system.

One solution is to add the metadata import on the subject backfiller job,

kade/app/sidekiq/subject_backfiller_job.rb

Line 8 in bcb057c

Import::SubjectLocations.new(subject).run

. Alternatively this metadata comes through via the caesar reductions payload, we can use this flow of data to extract the information as it comes through.

We can then use this field to uniquely identify the subject linkage when importing / upsert ML results (vector representations, predictions etc).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add unique_id to subjects table to track subject across ML workloads #22

add unique_id to subjects table to track subject across ML workloads #22

camallen commented Sep 23, 2022

add unique_id to subjects table to track subject across ML workloads #22

add unique_id to subjects table to track subject across ML workloads #22

Comments

camallen commented Sep 23, 2022