Update aggregation to match ancestry output #79

nebfield · 2024-02-20T16:06:27Z

Doing this in the quarto report with data.table causing problems with building the dockerfile.

Also, this fixes an issue where the DENOM column is missing when multiple custom scoring files are aggregated

smlmbrt

Will need to change read_pgs in ancestry_analysis, as it reads in the old version of the aggregated scores.

pgscatalog_utils/pgscatalog_utils/ancestry/read.py

Lines 74 to 87 in c672be7

    
           def read_pgs(loc_aggscore, onlySUM: bool): 
        
               """ 
        
               Function to read the output of aggreagte_scores 
        
               :param loc_aggscore: path to aggregated scores output 
        
               :param onlySUM: whether to return only _SUM columns (e.g. not _AVG) 
        
               :return: 
        
               """ 
        
               logger.debug('Reading aggregated score data: {}'.format(loc_aggscore)) 
        
               df = pd.read_csv(loc_aggscore, sep='\t', index_col=['sampleset', 'IID'], converters={"IID": str}, header=0) 
        
               if onlySUM: 
        
                   df = df[[x for x in df.columns if x.endswith('_SUM')]] 
        
                   rn = [x.rstrip('_SUM') for x in df.columns] 
        
                   df.columns = rn 
        
               return df

Signed-off-by: smlmbrt <[email protected]>

smlmbrt

I think my last commit solves it

* update vulnerable dependencies * Update aggregation to match ancestry output (#79) * match ancestry aggregation output * bump version * fix column name (accession -> PGS) * fix column name * add aggregate tests * fix not respecting outdir * read new version of pgs * drop onlySUM parameter * Make sure it only reads SUM and provides the correct column names back Signed-off-by: smlmbrt <[email protected]> * drop deprecated parameter --------- Signed-off-by: smlmbrt <[email protected]> Co-authored-by: smlmbrt <[email protected]> --------- Signed-off-by: smlmbrt <[email protected]> Co-authored-by: smlmbrt <[email protected]>

match ancestry aggregation output

2a8ae0e

nebfield requested a review from smlmbrt February 20, 2024 16:29

nebfield self-assigned this Feb 20, 2024

nebfield mentioned this pull request Feb 21, 2024

Drop aggregation during report generation PGScatalog/pgsc_calc#249

Merged

nebfield added 3 commits February 21, 2024 09:22

bump version

7f7da3d

fix column name (accession -> PGS)

8b72fff

fix column name

c816125

smlmbrt requested changes Feb 21, 2024

View reviewed changes

nebfield added 4 commits February 21, 2024 11:04

add aggregate tests

236a1b6

fix not respecting outdir

9a6d0d7

read new version of pgs

a4b9fd9

drop onlySUM parameter

df52a69

nebfield requested a review from smlmbrt February 21, 2024 11:36

Make sure it only reads SUM and provides the correct column names back

707a268

Signed-off-by: smlmbrt <[email protected]>

smlmbrt approved these changes Feb 21, 2024

View reviewed changes

drop deprecated parameter

e9397c5

nebfield merged commit a4df14d into dev Feb 21, 2024
1 check passed

nebfield deleted the fix-aggregate branch February 21, 2024 12:34

nebfield mentioned this pull request Feb 21, 2024

Make aggregated output consistent #72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update aggregation to match ancestry output #79

Update aggregation to match ancestry output #79

nebfield commented Feb 20, 2024 •

edited

Loading

smlmbrt left a comment •

edited

Loading

smlmbrt left a comment

	def read_pgs(loc_aggscore, onlySUM: bool):
	"""
	Function to read the output of aggreagte_scores
	:param loc_aggscore: path to aggregated scores output
	:param onlySUM: whether to return only _SUM columns (e.g. not _AVG)
	:return:
	"""
	logger.debug('Reading aggregated score data: {}'.format(loc_aggscore))
	df = pd.read_csv(loc_aggscore, sep='\t', index_col=['sampleset', 'IID'], converters={"IID": str}, header=0)
	if onlySUM:
	df = df[[x for x in df.columns if x.endswith('_SUM')]]
	rn = [x.rstrip('_SUM') for x in df.columns]
	df.columns = rn
	return df

Update aggregation to match ancestry output #79

Update aggregation to match ancestry output #79

Conversation

nebfield commented Feb 20, 2024 • edited Loading

smlmbrt left a comment • edited Loading

Choose a reason for hiding this comment

smlmbrt left a comment

Choose a reason for hiding this comment

nebfield commented Feb 20, 2024 •

edited

Loading

smlmbrt left a comment •

edited

Loading