Skip to content

Usage: Model Estimation & Analysis

Jordi Abante edited this page Jun 13, 2018 · 3 revisions

Command:

informME_run.sh [OPTIONS] MAT_FILES PHENO CHR_NUM

This step is comprised of two phases. During the first phase, informME learns the parameters of the Ising probability distribution by combining the methylation data matrices provided through the argument MAT_FILES (comma-separated list) for chromosome number CHR_NUM. By default, the MAT_FILES are expected to be in a subdirectory named after CHR_NUM in INTERDIR. The output generated during this phase is also stored in a subdirectory in INTERDIR named after chromosome number CHR_NUM. The output file has as prefix PHENO and the suffix '_fit.mat' appended to it (e.g. if 'normal' is the PHENO, and CHR_NUM is 10, then the output is stored as INTERDIR/chr10/normal_fit.mat). The file produced contains the following information:

  • CpG distances

  • CpG densities

  • estimated alpha, beta, and gamma parameters of the Ising model

  • initial and transition probabilities of the inhomogeneous Markov chain representation of the Ising model

  • marginal probabilities at each CpG site

  • the log partition function of the estimated Ising model

The second phase of this step consists in analyzing the model learned by computing a number of statistical summaries of the methylation state, including probability distributions of methylation levels, mean methylation levels, and normalized methylation entropies, as well as mean and entropy based classifications. This step also computes entropic sensitivity indices, methylation sensitivity indices, as well information-theoretic quantities associated with methylation channels, such as turnover ratios, channel capacities, and relative dissipated energies. The output generated during this phase is stored in the same directory as the output generated during the first phase, using the same prefix as before. However, the suffix is now '_analysis.mat' (e.g. following the previous example, the output file of this phase is stored as INTERDIR/chr10/normal_analysis.mat). This file contains the following information:

  • the locations of the CpG sites within the genomic region

  • numbers of CpG sites within the analysis subregions

  • which analysis subregions are modeled and which are not

  • estimated parameters of Ising model in genomic region

  • methylation level probabilities in modeled subregions

  • coarse methylation level probabilities

  • mean methylation levels

  • normalized methylation entropies

  • entropic sensitivity indices

  • methylation sensitivity indices

  • turnover ratios

  • channel capacities

  • relative dissipated energies

NOTE1: We recommend taking advantage of the array feature available in SGE and SLURM based clusters to submit an individual job for each chromosome.

NOTE2: Here is the full help file from informME_run.sh:

Description:
    This function learns the parameters of the Ising model and performs methylation analysis. 
    It estimates the parameters of the Ising probability distribution used to model 
    methylation within equally sized (in base pairs) non-overlapping regions of the genome. 
    The input is expected to be in INTERDIR, and the output is also stored in INTERDIR by 
    default. The output file produced by the learning phase contains the following 
    information for each genomic region used in model estimation:
    o CpG distances
    o CpG densities
    o estimated alpha, beta, and gamma parameters of the Ising model
    o initial and transition probabilities of the inhomogeneous Markov chain representation of 
      the Ising model
    o marginal probabilities at each CpG site
    o the log partition function of the estimated Ising model
    The output file produced by the analysis phase contains the following information:
    o the locations of the CpG sites within the genomic region
    o numbers of CpG sites within the analysis subregions
    o which analysis subregions are modeled and which are not
    o estimated parameters of Ising model in genomic region
    o methylation level probabilities in modeled subregions
    o coarse methylation level probabilities
    o mean methylation levels
    o normalized methylation entropies
    o entropic sensitivity indices
    o methylation sensitivity indices
    o turnover ratios
    o channel capacities
    o relative dissipated energies

Usage:
    informME_run.sh  [OPTIONS]  MAT_FILES PHENO CHR_NUM

Mandatory arguments:
    o MAT_FILES: list of methylation matrices to be modeled
    o PHENO: prefix of output files (name of phenotype)
    o CHR_NUM: chromosome to be processed

Options:
    -h|--help           help
    -r|--refdir         directoty of reference genome and CpG location files (default: $REFGENEDIR)
    -m|--matdir         matrices directory (default: $INTERDIR)
    -e|--estdir         modeling directory (default: $INTERDIR)
    -d|--outdir         output analysis directory (default: $INTERDIR)
    -q|--threads        number of threads used (default: 1)
    --tmpdir         	temporary directory (default: $SCRATCHDIR)
    --time_limit     	maximum time (in minutes) allowed for each thread to complete (default: 60)
    -l|--MATLICENSE     path to MATLAB's license

Example:
    * Running informME on chromosome 1 using 5 threads:
    	informME_run.sh -q 5 sample1 pheno_1 1
    * Running informME on chromosome 1 using 5 threads and 3 samples pooled into one model:
    	informME_run.sh -q 5 sample2,sample3,sample4 pheno_2 1

Output:
    MATLAB .mat file

Dependancies:
    * MATLAB
    * estimation.sh
    * mergeEstimation.sh
    * singleMethAnalysis.sh
    * mergeSingleMethAnalysis.sh

Upstream:
    getMatrices.sh

Downstream:
    singleMethAnalysisToBed.sh
    diffMethAnalysisToBed.sh

Authors:
    Garrett Jenkinson <[email protected]>
    Jordi Abante <[email protected]>