tackling problems in medical genomics.


MAMBA is a computer program for the analysis and interpretation of genomics data with an emphasis on understanding the genetic basis of biomedical traits.

Download and dependencies


The following libraries and packages are required by MAMBA:

URGENT NOTE: We recommend download and installation of Anaconda Python 2.7.

MAMBA Download


This software is supplied without any warranty or guaranteed support whatsoever. University of Oxford cannot be made responsible for its use, misuse, or functionality.

Python egg distribution

A Python egg distribution and PyPLINK/SEQ linkers. Check your Linux environment. If not listed below please contact us.


Installing MAMBA and its dependencies.

In order to install MAMBA setuptools should be installed:

$ tar xvfz mamba.1.0.0.[environment].tar.gz
$ cd mamba-1.0.0;
# Make sure to run in bash
$ bash;
$ export MAMBA=/path/to/anaconda/bin/;
#WARNING: DON'T EXPORT TO LITERAL /path/to/anaconda/...
$ export PYTHONPATH=/path/to/anaconda/lib/python2.7/site-packages/:${PYTHONPATH};
$ wget http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg;
$ sh setuptools-0.6c11-py2.7.egg --install-dir=$MAMBA;
$ export LD_LIBRARY_PATH=[current-directory]/python/dist/:${LD_LIBRARY_PATH};

## We include linker files for protobuf dependency, boost, and pyplinkseq dependencies.

$ ./configure --prefix=/path/to/anaconda; $ make clean ; make uninstall # Clear up $ make ; make install $ export PATH=$MAMBA:${PATH};

## You may want to add the export statements to a .my.bashrc file.

A list of options can be obtained with mamba -h.

To run some test data please visit the SEM/GEM page.

Rare variant association studies

With advances in sequencing technologies rare variant association analysis are presenting new challenges and opportunities. In MAMBA we implement statistical methods for the analysis of multiple study designs including case-control, continuous trait, cross-disorder, and cross-phenotype analysis. Zuk, Lander, and colleagues present a nice manuscript on designing rare variant association studies.

Protein truncating variants and quantitative traits

SEM (similar effects model) and GEM (grouped effects model) are Bayesian statistical methods for the analysis of protein truncating variants and quantative traits introduced in Rivas, Pirinen et al Bioinformatics 2013.

In order to run SEM and GEM analyses we make available some test data:

$ mamba --project mambaproj --locdb /PATH/TO/locdb --seqdb /PATH/TO/seqdb --mask "loc.group=refseq" --locset refseq --phenotype phe  --perm 1 --a 1.5 --t .7 --module SEMGEM --fileout testSEMGEM
where mambaproj is a PLINK/SEQ project; locdb and seqdb are reference LOCDB and SEQDB databases; and --mask are mask specifications ; --a and --t are hyperparameters for the priors.

To prepare a PLINK/SEQ project we can run the following commands:

NOTE: PLINK/SEQ v0.09 can be downloaded from PLINK/SEQ DOWNLOAD PAGE

# Create new PLINK/SEQ project.
$ pseq mambaproj new-project --vcf semgemex1.vcf --resources /path/to/pseq/resources/ 
# Load VCF file.
$ pseq mambaproj load-vcf
# Load Phenotype file.
$ pseq mambaproj load-pheno --file semgem.phe 

Results Output.

Results output for example should have the following header and results:

NM_017774 20883687 chr6:20534749..21232625 978388.8562 652473.2522 652473.2522 5.9905 5.8146 5.8146 p.R40*fsX0,p.R40SfsX16 frameshift,frameshift 20546698,20546700 NA NA 2

Header Description.

Results output header description.

NAME - Transcript NAME.
MIDPOS - Middle Position of transcript coordinate.
COORD - Transcript Coordinates.
BFSEM - Bayes Factor for SEM.
BFGEM - Bayes Factor for GEM.
BFGEMNMD - Bayes Factor for GEM-NMD.
log10BFSEM - log10 Bayes Factor for SEM. 
log10BFGEM - log10 Bayes Factor for GEM. 
log10BFGEMNMD - log10 Bayes Factor for GEM-NMD. 
HGVS - Human Genome Variation Society (HGVS) Nomenclature for the description of sequence variants.
NMDANNOT - Annotation of variants predicted to trigger NMD.
POSNMD - Genomic position of variants predicted to trigger NMD.
ESCNMDANNOT - Annotation of variants predicted to escape NMD.
POSESCNMD - Genomic position of variants predicted to escape NMD.
PTVVARS - Number of protein truncating variants.

Test genotype data was generated using HapGen2 and phenotype data was simulated according to an N(-2,1) distribution for protein truncating variant carriers.

C-alpha for multiple study designs

Binomial C-alpha

Binomial C-alpha was introduced in Neale, Rivas, et al. PLoS Genetics 2011. It was designed to assess signal of overdispersion in the distribution of copies of rare variants in a case-control study design. In a burden test we sum up all copies, but this has a disadvantage in the setting of risk and protective effects, e.g.

variant case control
var1 16 0
var2 0 16

Binomial C-alpha is implemented in PLINK/SEQ as --tests calpha.

MAMBA also provides a module for running binomial C-alpha (due to the generalization in MRP C-alpha) with --module crossdisorder.

Generalized C-alpha

Generalized C-alpha is an extension, reported in a publication by Clarke, Rivas, and Morris, PLoS Genetics 2013, suitable for quantitative traits and summary statistics. Given that generalized C-alpha is a special case of MRP C-alpha you can simply use --module MRPQT.

MRP C-alpha for cross disorder and cross phenotype analysis

Correlation of genetic effects between variants.

Correlation of genetic effects between traits.

A C-alpha test statistic can be derived for any likelihood function. Namely, the binomial likelihood for binomial C-alpha, and the normal likelihood for the generalized C-alpha.

Similarly one can derive a C-alpha test for the multivariate and the multinomial likelihood incorporating covariance estimates of the effect sizes. This allows a cross-phenotype and cross-disorder analysis of rare variants.

Cross-phenotype analysis

To run the multi-trait analysis for multiple continous traits the following options are suitable:

        --Q @[input file]
        --mask [string]
        --locset [gencode,refseq,ccds]
        --locdb [locdb]
        --seqdb [seqdb]
        --protdb [protdb]
        --perms [int]
        --phenotype @[input file]
        --project [pseq project]

MAMBA is built using the PyPLINK/SEQ bindings (written by Manuel Rivas and Jeff Hammerbacher). As a result, the mask syntax is the same as PLINK/SEQ.

locdb, seqdb, and protdb can be obtained from the RESOURCES site. This is up to date as of PSEQ 0.09.

For a single phenotype, simply using a label like --phenotype phen.1 suffice. When analyzing multiple phenotypes a file with the list of phenotypes is preferred with the @ character indicating a file is to be used, e.g. --phenotype @phenotypes.list.

Whenever estimates of the correlation of genetic effects across phenotypes is available we can exploit it for rare variant association studies. --Q @ points to a file with input file containing string separated by commas denoting a symmetric matrix (alternatively one may use --Q [comma separated string]). For example, assume I am performing a cross-phenotype analysis of total cholesterol (TC), high density lipoprotein cholesterol (HDL-C), low density lipoprotein cholesterol (LDL-C), and triglycerides (TG) we use a file containing

as a string, which represented more clearly is

1 .35 .95 .30
.35 1 .13 -.49
.95 .13 1 .26
.30 -.48 .26 1
Cross-disorder analysis of rare variants

In the setting of multiple diseases (or in cross-disorder analysis) we can derive a C-alpha statistic based on the multinomial distribution. As in the continuous trait setting we can exploit estimates of the correlation of genetic effects to improve power to detect association. Estimates of the correlation of genetic effects have been generated for psychiatric disorders.

        --Q @[input file]
        --mask [string]
        --locset [gencode,refseq,ccds]
        --locdb [locdb]
        --seqdb [seqdb]
        --protdb [protdb]
        --perms [int]
        --phenotype psych
        --project [pseq project]
        --ncpu [int]

where --phenotype [string] is coded as 1,2,3,... where 1 is the control coding and 2,3,... are disorder id in the study. Cross-disorder analysis can be run in parallel in --ncpu [int].

Manuscripts are currently in preparation.

Models for extreme phenotypes

We have developed methodology extending the SEM/GEM models for protein truncating variants and quantitative traits to the multivariate setting. One may be interested in identifying genetic variants that lead to a profile that is of clinical interest, e.g. metabolic complications of obesity (commonly referred to as Metabolic Syndrome) or a protective profile, e.g. lower TG, lower LDL-C, lower TC, and higher HDL-C.

Options to use when calling --module MRP include:

        --amv [comma separated string, e.g. 1,-1,1,1 for TG,HDL-C,LDL-C,TC]
        --mask [string]
        --locset [gencode,refseq,ccds]
        --locdb [locdb]
        --seqdb [seqdb]
        --protdb [protdb]
        --perms [int]
        --phenotype [comma separated string]
        --project [pseq project]
        --T [effect size covariance matrix]       


As in PLINK/SEQ, MAMBA allows stratified permutations on phenotype labels by using --strata [label].

Adaptive permutations are implemented in MAMBA with a maximum cap given by --perms [n].

Analysis of RNA-seq data

Assessing allele specific expression (ASE) across multiple tissues from RNA-seq read data

Allele specific expression (ASE) analysis is commonly applied to RNA-seq data for the study of rare regulatory variants. We have developed statistical methods for assessing allele specific expression across multiple tissues from RNA-seq read data (written by Matti Pirinen and Manuel Rivas with additional colleagues) as part of work done for the GTEx consortium. Manuscript is posted on bioRxiv.

Three approaches for ASE analysis are available in MAMBA, using a Bayesian model comparison framework and computation done via a Gibbs sampler (For a great introduction to Gibbs sampling on univariate normal distributions read Matthew Stephens' thesis.):

  • Independent Tissue Model (ITM);
  • Grouped Tissue Model (GTM);
  • and the Hierarchical Grouped Tissue Model (GTM*).

Required inputs at the moment are file formats generated from scripts available from Tuuli Lappalainen's lab at NYGC.

The Independent Tissue Model (ITM) relaxes the assumption that all tissues in one group have exactly the same reference allele read count frequency. Useful when read count (coverage) is large. More info is available in the Supplementary Note.

The Grouped Tissue Model (GTM) assumes all tissues in one group have exactly the same reference allele read count frequency. Useful when read count is small as RNA-seq read data where information is borrowed. More info is available in the main text and Supplementary Note.

The Hierarchical Grouped Tissue Model (GTM*) is a Hierarchical model that borrows information across tissues and across variants, useful for intepreting global properties of a class of variants such as those variants predicted to trigger nonsense-mediated decay. More info is available in the main text and Supplementary Note.

Simulated Dataset

In order to run ASE analyses we make available some simulated test data:

$ mamba --module ASE --fileout testASE --nt [1] --grep "include|exclude|freqthreshold" 

The option --grep allows the user to subset the variants studied. For example, we may be interested in obtaining a summary of the variants that are predicted to trigger NMD, i.e. what proportion have no ASE effects across all tissues, moderate ASE, or strong ASE. In addition, it also gives summary of the amount of heterogeneous effects that exist in the dataset.

To run ITM simply use

$ --indep 

stored as a boolean value.

To run the two-sided ASE model (as you would do for any other ASE analysis) simply use

$ --two_sided

stored as a boolean value.

Sample ASE dataset

To explore the application of the models download the GEUVADIS ASE file.

mamba --module ASE

Required Input.

Required input --fileout [asefile.gz] has the required headers/columns:

 INDIVIDUAL - individual id. 
 RSID - variant identifier.
 REF_COUNT - number of reads supporting reference allele.
 NONREF_COUNT - number of reads supporting alternate allele.
 TOTAL_COUNT - total number of reads at variant site.
 EXON_INFO - variant annotation label.
 TISSUE - tissue label.

Results Output.

Output files for sample dataset provided should include:

Parameters file

pr.beta 2000 2000 36 12 80 1
pr.intv 0 0.52 0.52 0.95 0.95 1
pr.p0 0.75
niter 100
two.sided 0 
independent 0 
group.distance 1 1 0.5

  • pr.beta - are the beta priors
  • pr.intv - truncated intervals
  • pr.p0 - prior for no ASE state
  • niter - number of Gibbs sampling iterations
  • two.sided - two.sided ASE model (boolean)
  • independent - ITM (boolean)
  • group.distance - prior distance between groups (no ase, moderate ase, strong ase)


BREAST 0.92 0.08 0
NERVET 0.9 0.1 0

Column 1 - Tissue label, Column 2,3,4 are posterior probabilities for variant belonging to the NO ASE, MODERATE ASE, or STRONG ASE group, respectively.


NOASE 0.91
SNGASE 4.48e-12
HET0 0.05
HET1 2.54e-08
TIS_SPE 0.05

State posterior probabilities for the variant including NOASE, MODASE, SNGASE (shared across all tissues), and HET0 (heterogeneous no ASE and moderate or strong ASE), HET1 (heterogeneous moderate and strong ASE), and TIS_SPE (tissue specific).

pdf file A pdf graphic file is generated for each variant analyzed. This figure contains a barplot for state classification, for individual group membership, and a representation of the non-ref read ratio. Below three examples are highlighted: i) no ASE across all tissues, ii) moderate ASE across all tissues, and iii) heterogeneous ASE effects.

To make results exactly reproducible use the --seed [int] option.

Application to simulated dataset.

We simulated ASE data for 300 variants and 3 tissues. RNA-seq coverage is simulated as a Poisson distribution with expected coverage of 30,20, and 40 for each tissue.

A 67%, 23%, 10% mixture dataset with   θ= .8, .5, .95 (binomial distribution) was generated.

To apply the models we run mamba as follows

mamba --module ASE --fileout asetest.txt.gz --nt 2 --seed 10 --grep ">,0,=STOP_GAINED|DFGFDG" --niter 100 --burn 10

where --niter [int] and --burn [int] are the number of iterations and burn-in period for the Gibbs sampler, respectively.

Results from the GTM and the GTM* (when multiple tissues are available) are generated. GTM estimates:

      NOASE    MODASE     SNGASE       HET0       HET1
1 0.2166671 0.6467365 0.08923985 0.02666694 0.02068966
Whereas GTM* estimates:
               est        low95       up95
NOASE  0.226645280 0.1887562079 0.27998550
MODASE 0.664520022 0.5992223252 0.71721846
SNGASE 0.094709155 0.0621824885 0.12861006
HET0   0.004789438 0.0001708784 0.01900929
HET1   0.009336103 0.0004127940 0.03151795
with tissue-specificity estimate:
                est        low95       up95
TIS_SPE 0.006466667	0.00	0.01553708

The estimates are not very different for the two models, but you can see that heterogeneity is lower in the GTM* results (we have not introduced any heterogeneous effects).

Assessing impact of rare variants proximal to splice junctions: Splice Disruption Model (SDM)

Genetic variants proximal to splice junctions have been shown to have an impact on splicing. These splicing alterations can have a profound impact on disease predisposition.

Signal sought using the Splice Disruption Model.

The Splice Disruption Model (SDM) is implemented in MAMBA for splicing analysis of rare variants. The intuition behind SDM is that for rare variants we will not have enough information for a single variant to assess their impact on splicing. Thus, we combine the normalized splice junction values of variant carriers and assume that the distribution is a mixture reflecting the notion that some variants will impact splicing and others will not. We use the mixture distribution and the SDM to obtain estimates of the proportion of variants disrupting splicing and the extent of the distribution.

Required Input.

Required input files for running the SDM approximation algorithm are:
  1. VCF file including individuals with RNA sequencing data,
  2. ANNOTATION file that contains information for variants in VCF,
  4. Reference transcript set model (e.g. GENCODEv12).

Results Output.

Output files include:

SDM summary file

            mu       var        pi         T
MEAN -1.841570 0.7277086 0.6926318  1.270284
LCI  -1.965539 0.4744203 0.6282431 14.083931
UCI  -1.717600 1.0515369 0.7570205  0.000000

  • mu - the estimated mean of the disruption mixture component
  • var - the estimated variants of the disruption mixture component
  • pi - the estimated proportion of variants in the disruption mixture component
  • T - empirical statistic(s)

empirical stats a file containing the empirical statistics (used to calculate p-value after several runs in parallel)

simulated stats a file containing the stats computed on the simulated (null) data (used to calculate a p-value).

png file containing MCMC summary plots (trace plots and histograms)

MCMC summary plots

posterior file

chr:offset splicejunction gene posteriorprob subject traitvalue reference splicejunctionid
16:734065 16_733909_734066 ENSG00000161999 0.8365 NA19144 -1.29 G ENSG00000161999.6_734066_734155 16_733910_734065
16:15102705 16_15102704_15103538 ENSG00000179889 0.6035 HG02215 -0.76 G ENSG00000179889.14_15102636_15102704 16_15102705_15103537

To apply SDM we run mamba as follows

 mamba --module SDM --vcfs @GEUVADIS.VCF.LIST --annot myannotfile --grep "SEVERE_SPLICE_ACCEPTOR_DONOR=DONOR,SEVERE_SPLICE_DIST=1;|RANDOMTEXT" --sj GD462.JunctionQuantCount.45N.50F.samplename.resk10.norm.txt.gz --perms 100 --fileout DONOR.1.1 --seed 1 --trnscmodel gencode.v12.genes.patched_contigs.gtf.gz > out.summary.DONOR.1.1

where --seed [int] is the seed used for initializing and reproducing results.

NMD predictions

A random forest algorithm was applied to the GTEx data set to train a model to predict NMD using the caret package in R.

We make available the predictive model, which will be updated with future releases of GTEx. In addition, we make available a sample input data set and apply using the commands below:
#Use 38 predictors
> predict(modelFit.RF,newdata=tmp.data,type="prob",na.action=na.roughfix)
  noASE someASE
1 0.79   0.21

To prepare the predictor input for variants of interest we make available an auxiliary python script (getpredictors.py), which takes in: 1) CADD input; 2) a PLINK/SEQ meta (annotation) file; and 3) a reference transcript set model (e.g. GENCODEv12 gtf).


Manuel A. Rivas
Wellcome Trust Centre for Human Genetics, University of Oxford
Oxford, UK OX3 7BN

Collaborators: Matti Pirinen, Juan Fernandez, Loukas Moutsianas, Kyle Gaulton, Andrew Rimmer, Shaun Purcell, Jeff Hammerbacher.