Skip to content

MAXINELSX/pysarg

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pysarg

Python implementation of ARGs_OAP.

Warning

This repo is only for testing, please don't use!

Installation

Pre-compiled conda packages (osx-64/linux-64, python>=3.8).

conda install -c bioconda -c conda-forge xinehc::pysarg

If you encounter dependency conflicts or your current python version is lower than 3.8 (python<3.8), you may want to create a new conda environment with arbitrary name (here use -n pysarg as an example). Then switch to it by conda activate.

conda create -n pysarg -c bioconda -c conda-forge xinehc::pysarg
conda activate pysarg

Pysarg depends on python>=3.8, diamond>=2.0.15, bwa>=0.7.17, blast>=2.12, samtools>=1.15, pandas>=1.4. If your system has all the dependencies, then you can build it from source.

git clone https://github.com/xinehc/pysarg
cd pysarg
python setup.py install # use python3 if you needed

Example

Download the example files

Two examples (100k paired-end reads, 100 bp each) can be found here. The zipped file can be downloaded using wget:

wget https://dl.dropboxusercontent.com/s/054ufvfahchfk7f/example.tar.gz
tar -xvf example.tar.gz
cd example

Step 1: Make database

Pysarg supports both protein (prot) and nucleotide (nucl) database. By default it will use the SARG v3.0 database full version and the corresponding structure files as input to build a database named sarg. If customized databases or structures (e.g. SARG without multidrug resistant type) is of interest, you can change the default parameters of --input, --struc and --db (see pysarg makedb --help for more details). The type of the customized database will be detected automatically (prot or nucl).

Please note that the structure file --struc need to be tab separated and have no header. The first column of --struc need to match the sequence ID of --input and cannot have any white space (blast consider everything before white space as Sequence ID). For SARG, the three columns of the hierarchical structure file are gene, subtype and type, e.g. AAB20441 AAC(3)-Ia aminoglycoside.

pysarg makedb
# pysarg makedb --input *.fa --struc *.txt --db yourfavdb 

After pysarg makedb, the information of available database will be printed on screen, for example:

db size type directory
sarg 28517690 port .../pysarg/DB/sarg
gg85 15295160 nucl .../pysarg/DB/gg85
ko30 7159023 port .../pysarg/DB/ko30

Database gg85 and ko30 are default databases for quantifying the 16s rRNA or cell numbers in samples, they will be used in stageone (see below). Database sarg (or customized yourfavdb) will be used in stagetwo (see below). If you do not want to use the database anymore, please delete (rm -rf) them use the absolute file path given in the table.

Run stageone

The 16s rRNA and cell numbers of each file in inputfqs will be estimated in stageone. By default, no compressed .gz files are supported as they may slow down the overall I/O, please unzip them by e.g. gzip -d first.

If reads are paired, the forward/reverse files need to have a format of *{_1, _2}{.fa, .fq, .fasta, .fastq}. Extension .fa/.fasta or .fq/.fastq will be detected automatically.

pysarg stageone -i inputfqs -o outputdir --clean

After stageone, a metadata.txt file can be found in outputdir (if flag --clean is given, then all temporary files will be removed and the metadata file should be the only one available in outputdir). It summarizes the 16s and cell numbers of each samples, for example:

filename n_reads n_16s n_cells filepath
STAS_1 100000 4.174475545952642 1.6004431914056478 .../example/inputdir/STAS_1.fa
STAS_2 100000 4.054822333841412 1.545260664720151 .../example/inputdir/STAS_2.fa
SWHAS104_1 100000 3.4944371029452244 1.7516589512278578 .../example/inputdir/SWHAS104_1.fa
SWHAS104_2 100000 3.515110704179947 1.7600058056139576 .../example/inputdir/SWHAS104_2.fa

Please check whether column filepath contains all files in your inputdir. If not, please make sure the extensions of the missing files are in {.fa, .fq, .fasta, .fastq}.

Run stagetwo

The number of ARGs (or other type of genes if you use a different database) will be estimated in stagetwo. The input directory (parameter -i) of stagetwo need to contain stageone's metadata.txt (in the above example outputdir). If no -o (output) given, stagetwo will save everything to -i. By default stagetwo use the sarg database, if you want to use a customized database, please change parameter -d.

pysarg stagetwo -i outputdir --clean
# pysarg stagetwo -i outputdir -o otheroutputdir -d yourfavdb --clean

After stagetwo, the normalized ARGs copies (or other type of genes) per 16s/cells or hits/reads will be shown in several *_normalized_*.txt files. For example, sarg_normalized_16s_struc2.txt means:

  • sarg - the database name (default sarg)
  • normalized_16s - hits are normalized against 16s rRNA
  • struc2 - the most coarse structure of the hierarchical structure file. In the sarg case struc0 means genes, struc1 means subtypes and struc2 means types.
struc2 STAS SWHAS104
MLS 0.0 0.02062257391617481
aminoglycoside 0.016202273302162947 0.0702348721175952
bacitracin 0.014243756749154238 0.029022199416685344
beta-lactam 0.0 0.07435959439262274
multidrug 0.014763429463243433 0.042417186801122636
mupirocin 0.0029667248478081566 0.004557467877130284
quinolone 0.14468645528642876 0.04399873193528583
sulfonamide 0.013452071929840085 0.06808763199694166
tetracycline 0.004659396079993178 0.04969500656817937

Please note that the forward/reverse files are merged in stagetwo, so only two columns (two samples) are available in the example.

Change log

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%