Python implementation of ARGs_OAP.
This repo is only for testing, please don't use!
Pre-compiled conda packages (osx-64/linux-64
, python>=3.8
).
conda install -c bioconda -c conda-forge xinehc::pysarg
If you encounter dependency conflicts or your current python version is lower than 3.8 (python<3.8
), you may want to create a new conda environment with arbitrary name (here use -n pysarg
as an example). Then switch to it by conda activate
.
conda create -n pysarg -c bioconda -c conda-forge xinehc::pysarg
conda activate pysarg
Pysarg depends on python>=3.8
, diamond>=2.0.15
, bwa>=0.7.17
, blast>=2.12
, samtools>=1.15
, pandas>=1.4
. If your system has all the dependencies, then you can build it from source.
git clone https://github.com/xinehc/pysarg
cd pysarg
python setup.py install # use python3 if you needed
Two examples (100k paired-end reads, 100 bp each) can be found here. The zipped file can be downloaded using wget
:
wget https://dl.dropboxusercontent.com/s/054ufvfahchfk7f/example.tar.gz
tar -xvf example.tar.gz
cd example
Pysarg supports both protein (prot
) and nucleotide (nucl
) database. By default it will use the SARG v3.0 database full version and the corresponding structure files as input to build a database named sarg
. If customized databases or structures (e.g. SARG without multidrug resistant type) is of interest, you can change the default parameters of --input
, --struc
and --db
(see pysarg makedb --help
for more details). The type of the customized database will be detected automatically (prot
or nucl
).
Please note that the structure file --struc
need to be tab separated and have no header. The first column of --struc
need to match the sequence ID of --input
and cannot have any white space (blast
consider everything before white space as Sequence ID). For SARG, the three columns of the hierarchical structure file are gene
, subtype
and type
, e.g. AAB20441 AAC(3)-Ia aminoglycoside
.
pysarg makedb
# pysarg makedb --input *.fa --struc *.txt --db yourfavdb
After pysarg makedb
, the information of available database will be printed on screen, for example:
db | size | type | directory |
---|---|---|---|
sarg | 28517690 | port | .../pysarg/DB/sarg |
gg85 | 15295160 | nucl | .../pysarg/DB/gg85 |
ko30 | 7159023 | port | .../pysarg/DB/ko30 |
Database gg85
and ko30
are default databases for quantifying the 16s rRNA or cell numbers in samples, they will be used in stageone
(see below). Database sarg
(or customized yourfavdb
) will be used in stagetwo
(see below). If you do not want to use the database anymore, please delete (rm -rf
) them use the absolute file path given in the table.
The 16s rRNA and cell numbers of each file in inputfqs
will be estimated in stageone
. By default, no compressed .gz
files are supported as they may slow down the overall I/O, please unzip them by e.g. gzip -d
first.
If reads are paired, the forward/reverse files need to have a format of *{_1, _2}{.fa, .fq, .fasta, .fastq}
. Extension .fa/.fasta or .fq/.fastq will be detected automatically.
pysarg stageone -i inputfqs -o outputdir --clean
After stageone
, a metadata.txt
file can be found in outputdir
(if flag --clean
is given, then all temporary files will be removed and the metadata file should be the only one available in outputdir
). It summarizes the 16s and cell numbers of each samples, for example:
filename | n_reads | n_16s | n_cells | filepath |
---|---|---|---|---|
STAS_1 | 100000 | 4.174475545952642 | 1.6004431914056478 | .../example/inputdir/STAS_1.fa |
STAS_2 | 100000 | 4.054822333841412 | 1.545260664720151 | .../example/inputdir/STAS_2.fa |
SWHAS104_1 | 100000 | 3.4944371029452244 | 1.7516589512278578 | .../example/inputdir/SWHAS104_1.fa |
SWHAS104_2 | 100000 | 3.515110704179947 | 1.7600058056139576 | .../example/inputdir/SWHAS104_2.fa |
Please check whether column filepath
contains all files in your inputdir
. If not, please make sure the extensions of the missing files are in {.fa, .fq, .fasta, .fastq}.
The number of ARGs (or other type of genes if you use a different database) will be estimated in stagetwo
. The input directory (parameter -i
) of stagetwo
need to contain stageone's metadata.txt
(in the above example outputdir
). If no -o
(output) given, stagetwo
will save everything to -i
. By default stagetwo
use the sarg
database, if you want to use a customized database, please change parameter -d
.
pysarg stagetwo -i outputdir --clean
# pysarg stagetwo -i outputdir -o otheroutputdir -d yourfavdb --clean
After stagetwo
, the normalized ARGs copies (or other type of genes) per 16s/cells or hits/reads will be shown in several *_normalized_*.txt
files. For example, sarg_normalized_16s_struc2.txt
means:
- sarg - the database name (default
sarg
) - normalized_16s - hits are normalized against 16s rRNA
- struc2 - the most coarse structure of the hierarchical structure file. In the sarg case
struc0
means genes,struc1
means subtypes andstruc2
means types.
struc2 | STAS | SWHAS104 |
---|---|---|
MLS | 0.0 | 0.02062257391617481 |
aminoglycoside | 0.016202273302162947 | 0.0702348721175952 |
bacitracin | 0.014243756749154238 | 0.029022199416685344 |
beta-lactam | 0.0 | 0.07435959439262274 |
multidrug | 0.014763429463243433 | 0.042417186801122636 |
mupirocin | 0.0029667248478081566 | 0.004557467877130284 |
quinolone | 0.14468645528642876 | 0.04399873193528583 |
sulfonamide | 0.013452071929840085 | 0.06808763199694166 |
tetracycline | 0.004659396079993178 | 0.04969500656817937 |
Please note that the forward/reverse files are merged in stagetwo
, so only two columns (two samples) are available in the example.