Table Of Contents¶
Omics Pipe: An Automated Framework for Next Generation Sequencing Analysis¶
Introduction¶
Welcome to the documentation for Omics Pipe! Omics pipe is an open-source, modular computational platform that automates ‘best practice’ multi-omics data analysis pipelines published in Nature Protocols and other commonly used pipelines, such as GATK. It currently automates and provides summary reports for two RNA-seq pipelines, variant calling from whole exome sequencing (WES), variant calling and copy number variation analysis from whole genome sequencing (WGS), two ChIP-seq pipelines and a custom RNA-seq pipeline for personalized genomic medicine reporting. It also provides automated support for interacting with the The Cancer Genome Atlas (TCGA) datasets, including automatic download and processing of the samples in this database.
Online Resources¶
Homepage: | http://sulab.org/tools/omics-pipe/ |
Repository: | https://bitbucket.org/sulab/omics_pipe |
Online Docs: | http://packages.python.org/omics_pipe |
Download & PyPI: | http://pypi.python.org/pypi/omics_pipe |
About Omics Pipe¶
Omics pipe is an open-source, modular computational platform that automates ‘best practice’ multi-omics data analysis pipelines published in Nature Protocols and other commonly used pipelines, such as GATK. It currently automates and provides summary reports for two RNA-seq pipelines, two miRNA-seq pipelines, variant calling from whole exome sequencing (WES), variant calling and copy number variation analysis from whole genome sequencing (WGS), two ChIP-seq pipelines and a custom RNA-seq pipeline for personalized genomic medicine reporting. It also provides automated support for interacting with the The Cancer Genome Atlas (TCGA) datasets, including automatic download and processing of the samples in this database.

Omics pipe is a Python package that can be installed on a compute cluster, a local installation or in the cloud. It can be downloaded directly from the Omics pipe website for local and cluster installation, or can be used on AWS in Amazon EC2. The modular nature of Omics pipe allows researchers to easily and efficiently add new analysis tools with Bash scripts in the form of modules that can then be used to assemble a new analysis pipeline. Omics pipe uses Ruffus to pipeline the various analysis modules together into a parallel, automated pipeline. The dependence of Omics pipe on Ruffus also allows for the restarting of only the steps in the pipeline that need updating in the event of an error. In addition, Sumatra is built into Omics pipe, which provides version control for each run of the pipeline, increasing the reproducibility and documentation of your analyses. Omics pipe interacts with the Distributed Resource Management Application API (DRMAA), which automatically submits, controls and monitors jobs to a Distributed Resource Management system, such as a compute cluster or Grid computing infrastructure. This allows you to run samples and steps in the pipeline in parallel in a computationally efficient distributed fashion, without the need to individually schedule and monitor individual jobs. For each supported pipeline in Omics pipe, results files from each step in the pipeline are generated, and an analysis summary report is generated as an HTML report using the R package knitr. The summary report provides quality control metrics and visualizations of the results for each sample to enable researchers to quickly and easily interpret the results of the pipeline.
Available Pipelines¶
- Omics Pipe Available Pipelines
- Pipelines supported by this version of omics pipe.

Users¶
Projects that have used Omics Pipe for solving biological problems. Please submit your story if you would like to share how you use the pipeline for your own research.
- The Scripps Research Institute, Lotz Lab: The Lotz Lab in the Department of Molecular and Experimental Medicine at TSRI uses Omics Pipe to perform RNA-seq and miRNA-seq analyses on human articular cartilage samples to elucidate molecular pathways dysregulated in Osteoarthritis.
- Avera Health: Researchers working in collaboration with Avera Health use Omics Pipe to analyze sequence data from multiple platforms to provide personalized medicine to breast cancer patients.
- Scripps Laboratories for tRNA Synthetase Research: Researchers working in collaboration with Scripps Laboratories for tRNA Synthetase Research use Omics Pipe to analyze ChIP-seq data to explore transcription factor binding sites under experimental conditions.
- Dorris Neuroscience Center: Researchers working in collaboration with The Maximov Lab in the Department of Molecular and Cellular Neuroscience at The Scripps Research Institute use Omics Pipe to analyze RNA-seq data to determine how extensive activity-dependent alternative mRNA splicing occurs in the transcriptome of a mouse model that is born and develops to adulthood without synaptic transmission in the forebrain.
- Sanford Burnham Medical Research Institute: Researchers working in collaboration with The Peterson Lab in the Bioinformatics and Structural Biology Program at Sanford Burnham Medical Research Institute are using Omics Pipe to perform RNA-seq based global gene expression analysis of dental plaque microbiota derived from twin pairs to identify functional networks of the dental microbiome in relation to dental health and disease.
Developers¶
Omics Pipe is developed by Kathleen Fisch, Tobias Meissner and Louis Gioia at The Su Lab in the Department of Molecular and Experimental Medicine at The Scripps Research Institute in beautiful La Jolla, CA.
Contact¶
Feedback, questions, bug reports, contributions, collaborations, etc. welcome!
Email: kfisch@scripps.edu
Twitter: @kathleenfisch
Using Omics Pipe¶
Omics Pipe is a Python framework for automating ‘best practice’ next generation sequencing pipelines. Omics Pipe can be run from the command-line by providing it with a YAML parameter file specifying your directory structure and software specific parameters. This executes a parallel automated pipeline on a Distributed Resource Management system (local cluster or Amazon Web Services (AWS)) that efficiently handles job resource allocation, monitoring and restarting. The goals of Omics Pipe are to provide researchers with an open-source computational solution to implement ‘best practice’ pipelines with minimal development overhead and providing visual outputs to aid the researcher in biological interpretation.
To install Omics Pipe, first determine if you are going to be using it on a local compute cluster or on AWS. If you are going to be installing it on your local cluster, follow the directions below (or have your system administrator install it globally). If you are going to create a local installation in your home directory on your cluster but you do not have administrative permissions, you can create a Python Virtual Environment and then follow the instructions below within the virtual environment.
Requirements¶
HPC Cluster or AWS Star Cluster (Resource Requirements)
Python >=2.6
- R >= 2.15
- R Packages (Third Party Software Dependencies)
Third Party Software Dependencies (Third Party Software Dependencies)
Reference Databases (Reference Databases Needed)
Installation¶
Option 1: Install from pypi using pip:
pip install omics_pipe
Option 2: Install from pypi using easy_install:
easy_install omics_pipe
Option 3: Install from source: Download/extract the source code and run:
python setup.py install
Option 4: Install the latest code directly from the repository:
pip install -e hg+https://bitbucket.org/sulab/omics_pipe#egg=omics_pipe
Option 5: If you do not have administrator privileges on your system:
Step 1: Set up a `Python Virtual Environment`_ Step 2: Use one of the Options (1-4) above to install Omics Pipe within your virtual environment.
Usage¶
Once you have successfully installed Omics Pipe, you can run a pipeline by typing the command:
omics_pipe [-h] [--custom_script_path CUSTOM_SCRIPT_PATH]
[--custom_script_name CUSTOM_SCRIPT_NAME]
[--compression {gzip, bzip}]
{RNAseq_Tuxedo, RNAseq_count_based, RNAseq_cancer_report, RNAseq_TCGA, RNAseq_TCGA_counts, Tumorseq_MUTECT, miRNAseq_count_based, miRNAseq_tuxedo, WES_GATK, WGS_GATK, SomaticInDels, ChIPseq_MACS, ChIPseq_HOMER, custom}
parameter_file
Running Omics Pipe on Amazon Web Services (AWS)¶
- AWS Installation Instructions
- Installation instructions for setting up the AWS Omics Pipe AMI
Tutorial¶
- Tutorial
- Step-by-step tutorial for running Omics Pipe
- Creating a custom pipeline
- Tutorial for creating and running a custom pipeline in Omics Pipe using existing modules
- Adding new modules/tools
- Tutorial for adding new modules to Omics Pipe be used in a custom pipeline
Version history¶
Documentation¶
- The latest copy of this documentation should always be available at:
- http://packages.python.org/omics_pipe
OmicsPipe on the Amazon Cloud (AWS EC2) Tutorial¶
OmicsPipe on AWS uses a custom StarCluster image, created with docker.io (which installs docker.io, environment-modules, and easybuild on an AWS EC2 cluster). All you have to do is get the docker image, upload your data, launch the Amazon cluster and run a single command to analyze all of your data according to published, best-practice methods.
Step 1: Create an AWS Account¶
- Create an AWS account by following the instructions at Amazon-AWS
- Note your AWS ACCESS KEY ID, AWS SECRET ACCESS KEY and AWS USER ID
Step 2 (Mac or Linux): Install StarCluster and download config/plugins¶
- Install StarCluster on your machine following the StarCluster instructions
- Download the template Omics Pipe StarCluster configuration file (config) and three plugin files (sge.py, sgeconfig.py, omicspipe_config_prebuilt.py) from Omics Pipe Bitbucket
- Move downloaded config file to ~/.starcluster/config
- Move downloaded plugin files to the ~/.starcluster/plugins/ folder.
- Go on to configure StarCluster by following directions below in Step 3.
Step 2 (Windows): Load the the OmicsPipe on AWS docker image on your machine¶
- Download docker.io following the instructions for your operating system at Get-Docker
From inside the Docker environment, run the command:
docker run -i -t omicspipe/aws_readymade /bin/bash
Note
If you want to share a file from your local computer with the docker container, follow the instructions for Docker Folder Sharing, put your desired file within the shared folder and run the command below (this is recommended for saving your /.starcluster/config file from the next step:
docker run -it --volumes-from NameofSharedDataFolder omicspipe/aws_readymade /bin/bash
- If you are on a local Ubuntu installation, skip this step and install the StarCluster client directly.
- If you are using Windows, it might be necessary to update your BIOS to enable virtualization before installing Docker
Step 3: Configure StarCluster¶
After running the omicspipe/aws_readymade Docker container, run the command below to edit the StarCluster configuration file:
nano ~/.starcluster/config Or if you prefer vim:: vim ~/.starcluster/config
Enter your “AWS ACCESS KEY ID”, “AWS SECRET ACCESS KEY”, and “AWS USER ID”
Change the AWS REGION NAME and AWS REGION HOST variables if you do not live in the AWS us-west region to the appropriate region AWS Regions.
Select your desired pre-configured cluster by editing the “DEFAULT_TEMPLATE” variable or creating a custom cluster. The default is a test cluster with 5 c3.large nodes.
Create your starcluster SSH key by running the command:
starcluster createkey omicspipe -o ~/.ssh/omicspipe.rsa
To remove a key from the AWS registry, run the command:
starcluster removekey omicspipe
For more information on editing the StarCluster configuration file, see the StarCluster website.
Step 4: Create AWS Volumes¶
Create AWS volumes to store the raw data and results of your analyses. From within the Docker environment, run:
starcluster createvolume --name=data -i ami-52112317 -d -s <volume size in GB> us-west-1a starcluster createvolume --name=results -i ami-52112317 -d -s <volume size in GB> us-west-1a
- Specify the <volume size in GB> as a number large enough to accomodate all of your raw data and ~4x that size for your results folder
- Change us-west-1b to your region as described in AWS Regions.
- Make a volume from the provided snapshot of reference databases (currently only supports H. sapiens)
- Go to the AWS-Console
- Click on the EC2 option
- Click on Volumes
- Click on “Create Volume”
- Set availability zone
- In Snapshot ID search for “omicspipe_db” and click on the resulting Snapshot ID
- Click Create
- From the Volumes tab, note the “VOLUME_ID” of the database snapshot
Edit your StarCluster configuration file to add your volume IDs. Run the command below and edit the VOLUME_ID variables for data, results, and database:
nano ~/.starcluster/config
Edit the fields below:
[volume results] VOLUME_ID = MOUNT_PATH = /data/results [volume data] VOLUME_ID = MOUNT_PATH = /data/data [volume database] VOLUME_ID = MOUNT_PATH = /data/database
Save your StarCluster configuration file to ~/.starcluster/config
Step 5: Launch the Cluster¶
From the Docker container, run the command below to start a new cluster with the name “mypipe”:
starcluster start mypipe
SSH into the cluster by running the command below:
starcluster sshmaster mypipe
Step 6: Upload data to the cluster¶
Now that you are in your cluster, you can use it like any other cluster. Before running omics pipe on your own data (you can skip this step if you are running the test data, you will want to upload your data, unless it is already present in your attached data volume. There are several options to upload your data:
Upload data from your local machine or cluster using StarCluster put:
starcluster put mypipe <myfile> /data/raw
Retrieve a file from an FTP server:
scp <localfile>username@tohostname:<remotefile>
Get a file from an S3 bucket with S3cmd:
s3cmd get s3://BUCKET/OBJECT LOCAL_FILE
Use Webmin to transfer files from your local system to the cluster (recommended for small files only, like parameter files).
- In the AWS Management Console go to “Security Groups”
- Select the “StarCluster-0_95_5” group associated with your cluster’s name
- On the Inbound tab click on “Edit”
- Click on “Add Rule” and a new “Custom TCP Rule” will apear. On “Port Range” enter “10000” and on “Source” select “My IP”
- Hit “Save”
- Selct Instances in the AWS managemnt console and note the “Public IP” of your instance
- In a Web browser, enter https://the_public_ip:10000. Type in the Login info when prompted: user: root password: sulab
- This will take a few seconds to load, and once it does, you can navigate your cluster’s file structure with the tabs on the left
- To upload a file from your local file system, click “upload” and specify the directory /data/data to upload your data.
Step 7: Run the test pipelines¶
Once you have successfully started the cluster, you may run Omics Pipe with the following commands for the different pipelines. *Note: Small .fastq files are provided on the instance for the tests below to demonstrate the functionality of the pipelines, but they may not generate meaningful results. Larger test files can be uploaded to the cluster by following the instructions in the documentation above.
RNA-seq Count Based Pipeline
omics_pipe RNAseq_count_based /root/src/omics-pipe/tests/test_params_RNAseq_counts_AWS.yaml
RNA-seq Tuxedo Pipeline
omics_pipe RNAseq_Tuxedo /root/src/omics-pipe/tests/test_params_RNAseq_Tuxedo_AWS.yaml
Whole Exome Sequencing:
omics_pipe WES_GATK /root/src/omics-pipe/tests/test_params_WES_GATK_AWS.yaml
ChIP-seq Homer
omics_pipe ChIPseq_HOMER /root/src/omics-pipe/tests/test_params_ChIPseq_HOMER_AWS.yaml
Installing extra software¶
Both the GATK and MuTect software are used by OmicsPipe, but they require licenses from The Broad Institute and cannot be distributed with the OmicsPipe software. GATK and MuTect are free to download after accepting the license agreement.
To install GATK:
Upload the GenomeAnalysisTK.jar file to the /root/.local/easybuild/software/gatk/3.2-2 using either Webmin or StarCluster put
Make the jar file executable by running the command:
chmod +x /root/.local/easybuild/software/gatk/3.2-2/GenomeAnalysisTK.jar
To install MuTect:
Upload the muTect-1.1.4.jar file to the /root/.local/easybuild/software/mutect/1.1.4 using either Webmin or StarCluster put
Make the jar file executable by running the command:
chmod +x /root/.local/easybuild/software/mutect/1.1.4/muTect-1.1.4.jar
Adding software that OmicsPipe was not built with might require a little more configuration, but OmicsPipe is designed as a foundation to which new software can be added. New software can obviously be added in any manner that the user prefers, but to follow the structure that was used to build OmicsPipe, please refer to the “custombuild” scripts.
Important
- If you configure software that you think extends the functionality of OmicsPipe, please create a pull request on our Bitbucket page.
To build your own docker image¶
Download docker.io following the instructions at Get-Docker
Run the command:
docker build -t <Repository Name> https://bitbucket.org/sulab/omics_pipe/downloads/Dockerfile_AWS_prebuiltAMI_public
This will store the dockercluster image in the Repository Name of your choice.
There is also an AWS_custombuild Dockerfile, which can be used to build an Amazon Machine Image from scratch
Add storage > 1TB to the cluster using LVM (for advanced users)¶
Within StarCluster create x new volumes by running:
nvolumes=2 #number of volumes vsize=1000 #in gb instance=`curl -s http://169.254.169.254/latest/meta-data/instance-id` akey=<AWS KEY> skey=<AWS SECRET KEY> region=us-west-1 zone=us-west-1a for x in $(seq 1 $nvolumes) do ec2-create-volume \ --aws-access-key $akey \ --aws-secret-key $skey \ --size $vsize \ --region $region \ --availability-zone $zone done > /tmp/vols.txt
Attach the volumes to the head node:
i=0 for vol in $(awk '{print $2}' /tmp/vols.txt) do i=$(( i + 1 )) ec2-attach-volume $vol \ -O $akey \ -W $skey \ -i $instance \ --region $region \ -d /dev/sdh${i} done > /tmp/attach.txt
Mark the EBS volumes as physical volumes:
for i in $(find /dev/xvdh*) do pvcreate $i done
Create a volume group:
vgcreate vg /dev/xvdh*
Create a logical volume:
lvcreate -l100%VG -n lv vg
Create the file system:
mkfs -t xfs /dev/vg/lv
Mount the file system:
mount /dev/vg/lv /data/data_large
Create mount point and mount the device:
mkdir /data/data_large mount /dev/md0 /data/data_large
Add new mountpoint to /etc/exports:
for x in $(qconf -sh | tail -n +2) do echo '/data/data_large' ${x}'(async,no_root_squash,no_subtree_check,rw)' >> /etc/exports done
Reload /etc/exports:
exportfs -a
Mount the new folder on all nodes:
for x in $(qconf -sh | tail -n +2) do ssh $x 'mkdir /data/data_large' ssh $x 'mount -t nfs master:/data/data_large /data/data_large' done
How to increase volume size?
Create and attach EBS volumes as described in steps 1. & 2. and then create the additional physical volumes:
for i in $(cat /tmp/attach.txt | cut -f 4 | sed 's/[^0-9]*//g') do pvcreate /dev/xvdh${i} done
Add new volumes to the volume group:
for i in $(cat /tmp/attach.txt | cut -f 4 | sed 's/[^0-9]*//g') do vgextend vg /dev/xvdh${i} done lvextend -l100%VG /dev/mapper/vg-lv
Grow the file system to the new size:
xfs_growfs /data/data_large
Add storage > 1TB to the cluster using RAID 0 (for advanced users)¶
Within StarCluster create x new volumes by running:
nvolumes=2 #number of volumes vsize=1000 #in gb instance=`curl -s http://169.254.169.254/latest/meta-data/instance-id` akey=<AWS KEY> skey=<AWS SECRET KEY> region=us-west-1 zone=us-west-1a for x in $(seq 1 $nvolumes) do ec2-create-volume \ --aws-access-key $akey \ --aws-secret-key $skey \ --size $vsize \ --region $region \ --availability-zone $zone done > /tmp/vols.txt
Attach the volumes to the head node:
i=0 for vol in $(awk '{print $2}' /tmp/vols.txt) do i=$(( i + 1 )) ec2-attach-volume $vol \ -O $akey \ -W $skey \ -i $instance \ --region $region \ -d /dev/sdh${i} done
Create a raid 0 volume:
mdadm --create -l 0 -n $nvolumes /dev/md0 /dev/xvdh*
Create a file system:
mkfs -t ext4 /dev/md0
Create mount point and mount the device:
mkdir /data/data_large mount /dev/md0 /data/data_large
Add new mountpoint to /etc/exports:
for x in $(qconf -sh | tail -n +2) do echo '/data/data_large' ${x}'(async,no_root_squash,no_subtree_check,rw)' >> /etc/exports done
Reload /etc/exports:
exportfs -a
Mount the new folder on all nodes:
for x in $(qconf -sh | tail -n +2) do ssh $x 'mkdir /data/data_large' ssh $x 'mount -t nfs master:/data/data_large /data/data_large' done
Backing up your data to S3¶
Run:
s3cmd --configure
and follow the instructions
Create a S3 bucket:
s3cmd mb s3://backup
Copy data to the bucket:
s3cmd put -r /data/results s3://backup
More info on s3cmd here: https://github.com/s3tools/s3cmd
Omics Pipe Tutorial¶
Installation¶
Test your installation by typing:
omics_pipe
on the command line. If you get the omics_pipe help readout, it has been installed correctly and you can continue.
Before Running Omics Pipe: Configuring Parameters File¶
Note
Before running omics_pipe, you must configure the parameters file, which is a YAML document. Follow the instructions here: Configuring the parameters file
Running Omics Pipe¶
When you are ready to run omics pipe, simply type the command:
omics_pipe RNAseq_count_based /path/to/parameter_file.yaml
To run the basic RNAseq_count_based pipeline with your parameter file. Additional usage instructions below and are available by typing omics_pipe –h.:
omics_pipe [-h] [--custom_script_path CUSTOM_SCRIPT_PATH]
[--custom_script_name CUSTOM_SCRIPT_NAME]
[--compression {gzip, bzip}]
{RNAseq_Tuxedo, RNAseq_count_based, RNAseq_cancer_report, RNAseq_TCGA, RNAseq_TCGA_counts,
Tumorseq_MUTECT, miRNAseq_count_based, miRNAseq_tuxedo, WES_GATK, WGS_GATK, SomaticInDels, ChIPseq_MACS, ChIPseq_HOMER, custom}
parameter_file
If your .fastq files are compressed, please use the compression option and indicate the type of compression used for your files. Currently supported compression types are gzip and bzip.
Running Omics Pipe with the Test Data and Parameters¶
To run Omics Pipe with the test parameter files and data, type the commands below to run each pipeline.
Note
Replace the ~ with the path to your Omics Pipe installation.
RNA-seq (Tuxedo):
omics_pipe RNAseq_Tuxedo ~/tests/test_params_RNAseq_Tuxedo.yaml
RNA-seq(Anders 2013):
omics_pipe RNAseq_count_based ~/tests/test_params_RNAseq_counts.yaml
Whole Exome Sequencing (GATK):
omics_pipe WES_GATK ~/tests/test_params_WES_GATK.yaml
Whole Genome Sequencing (GATK):
omics_pipe WGS_GATK ~/tests/test_params_WGS_GATK.yaml
Whole Genome Sequencing (MUTECT):
omics_pipe Tumorseq_MUTECT ~/tests/test_params_MUTECT.yaml
ChIP-seq (MACS):
omics_pipe ChIPseq_MACS ~/tests/test_params_MACS.yaml
ChIP-seq (HOMER):
omics_pipe ChIPseq_HOMER ~/tests/test_params_HOMER.yaml
Breast Cancer Personalized Genomics Report- RNAseq:
omics_pipe RNAseq_cancer_report ~/tests/test_params_RNAseq_cancer.yaml
TCGA Reanalysis Pipeline - RNAseq:
omics_pipe RNAseq_TCGA ~/tests/test_params_RNAseq_TCGA.yaml
TCGA Reanalysis Pipeline - RNAseq Counts:
omics_pipe RNAseq_TCGA_counts ~/tests/test_params_RNAseq_TCGA_counts.yaml
miRNAseq Counts (Anders 2013):
omics_pipe miRNAseq_count_based ~/tests/test_params_miRNAseq_counts.yaml
miRNAseq (Tuxedo):
omics_pipe miRNAseq_tuxedo ~/tests/test_params_miRNAseq_Tuxedo.yaml
Running Omics Pipe with your own data¶
Copy the test parameter file for the pipeline that you want to run into your home directory:
cp ~/tests/test_params_RNAseq_counts.yaml ~/my_params.yaml
Configure the parameter file to point to the path to your data (fastq files), results directories, correct software versions, third party software tool parameters and the correct genome/annotations as described here: Configuring the parameters file.
Ensure that your fastq files follow the naming convention Sample1_1.fastq Sample1_2.fastq for paired end samples.
Type the Omics Pipe command corresponding to your parameter file/pipeline of interest to run the pipeline:
omics_pipe RNAseq_count_based ~/my_params.yaml
Omics Pipe will log out to the screen as it is running through the steps in the pipeline.
*The pipeline will log out to the screen details regarding the progress of the analysis, including the analysis and status of each step in the pipeline.
*Individual log files for each job will be located in /data/results/logs (LOG_PATH parameter in parameter file)
*If there are flag files present in the /data/results/flags (FLAG_PATH parameter) folder, these steps in the pipeline will be skipped as they have completed successfully. To redo these steps on the next run of the pipeline, simply delete the flag files and rerun the pipeline.
*Monitor the progress of the pipeline through the standard output from the Omics Pipe command along with the individual log files for each job to ensure completion.
*Each job (step in the pipeline for each sample) will be completed on one of the slave nodes, and Omics Pipe (controller script) will be run on the master node.
*To check the status of your jobs, type the qstat command.
Wait for the pipeline to finish completely and check the results folders you specified for result files.
Omics Pipe Tutorial – Configuring the Parameter File¶
Before running Omics Pipe, you must configure the parameters file, which is a YAML document. Example parameters files are located within the omics_pipe/test folder for each pipeline. Copy one of these parameters files into your working directory, and edit the parameters to work with your sample names, directory structure, software options and software versions. Make sure to keep the formatting and parameter names exactly the same as in the example parameters files.
Note
Make sure to follow the YAML format exactly. Ensure that there is only one space after each colon.
Note
For parameters in quotes in the test parameters file, please make sure to keep them in quotes in your custom parameter file.
The STEP parameter should be the function name of the last step in the pipeline that you want to run (e.g. run_tophat). To run the pre-installed pipelines all the way through, this should be “last_function.”
Warning
Do not change the STEPS or STEPS_DE parameters for a pre-installed pipeline.
Note
Fastq files: paired end: 2 files, “Name_1.fastq” and “Name_2.fastq” representing read 1 and read 2. Have all fastq files in same raw data folder
Warning
Default parameters have been included for each third party software tool included in each of the pipelines. Before running, please view the documentation for each software tool to determine if these parameters are appropriate for your analysis. We do not advise using the default parameters included in Omics Pipe without a full understanding of the tools/parameters.
Example Omics Pipe Parameter File¶
test_params.yaml in omics_pipe/tests:
SAMPLE_LIST: [test1, test2, test3]
STEP: last_function
STEPS: [fastqc, star, htseq, last_function]
RAW_DATA_DIR: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests
FLAG_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/flags
HTSEQ_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/counts
LOG_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/logs
QC_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run
RESULTS_PATH: /gpfs/home/kfisch/test
STAR_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/star
WORKING_DIR: /gpfs/home/kfisch/scripts/omics_pipeline-devel/omics_pipe/scripts
REPORT_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run
ENDS: SE
FASTQC_VERSION: '0.10.1'
GENOME: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
HTSEQ_OPTIONS: -m intersection-nonempty -s no -t exon
PIPE_MULTIPROCESS: 100
PIPE_REBUILD: 'True'
PIPE_VERBOSE: 5
REF_GENES: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf
RESULTS_EMAIL: kfisch@scripps.edu
STAR_INDEX: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/star_genome
STAR_OPTIONS: --readFilesCommand cat --runThreadN 8 --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical
STAR_VERSION: '2.3.0'
TEMP_DIR: /scratch/kfisch
QUEUE: workq
USERNAME: kfisch
DRMAA_PATH: /opt/applications/pbs-drmaa/current/gnu/lib/libdrmaa.so
DPS_VERSION: '1.3.1111'
BAM_FILE_NAME: Aligned.out.bam
PARAMS_FILE: '/gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_params_RNAseq_counts.yaml'
DESEQ_META: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/counts_meta.csv
DESIGN: '~ condition'
PVAL: '0.05'
DESEQ_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/DESEQ
SUMATRA_DB_PATH: /gpfs/home/kfisch/sumatra
SUMATRA_RUN_NAME: test_counts_sumatra_project
REPOSITORY: https://kfisch@bitbucket.org/sulab/omics_pipe
HG_USERNAME: Kathleen Fisch <kfisch@scripps.edu>
Explanation of Variables in Omics Pipe Parameter File¶
Parameters vary by pipeline and the correct parameter file for each pipeline must be used. See examples in the /tests/ folder.
RNAseq Count Based Pipeline¶
test_params_RNAseq_counts.yaml in omics_pipe/tests:
#sample names ie “Name” for paired and single end reads. So, “Name” for paired-end would expect two fastq files named “Name_1.fastq” and Name_2.fastq”
SAMPLE_LIST: [test1, test2, test3]
#Function to be run within pipeline. If you want to run the whole pipeline, leave this as last_function
STEP: last_function
#All steps within the pipeline. DO NOT CHANGE this parameter for pre-installed pipelines. If you create your own pipeline, you will need to modify this by listing all of the steps in your pipeline.
STEPS: [fastqc, star, htseq, last_function]
#Directory where your raw .fastq files are located.
RAW_DATA_DIR: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests
#Directory where you would like to have the flag files created. Flag files are empty files that indicate if a step in the pipeline has completed successfully.
FLAG_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/flags
#Directory for HTSEQ results
HTSEQ_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/counts
#Directory where log files will be written
LOG_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/logs
#Directory for QC results
QC_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run
#Upper level results directory. Sumatra will check all subfolders of this directory for new files to add to the run tracking database.
RESULTS_PATH: /gpfs/home/kfisch/test
#Directory where STAR results will be written
STAR_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/star
#Where omics_pipe is installed, this path will be pointing to ~/omics_pipe/scripts.
WORKING_DIR: /gpfs/home/kfisch/scripts/omics_pipeline-devel/omics_pipe/scripts
#Directory for the summary report
REPORT_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run
#SE is single end, PE is paired-end sequencing reads
ENDS: SE
#Version number of FASTQC
FASTQC_VERSION: '0.10.1'
#Full path to Genome fasta file
GENOME: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
#Options for HTSEQ. Please see HTSEQ-count documentation for parameter options. http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#options
HTSEQ_OPTIONS: -m intersection-nonempty -s no -t exon
#Number of multiple processes you want Ruffus to spawn at once
PIPE_MULTIPROCESS: 100
#Ruffus parameter. No need to change.
PIPE_REBUILD: 'True'
#Ruffus parameter. No need to change.
PIPE_VERBOSE: 5
#Full path to reference gene annotations
REF_GENES: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf
#Your email.
RESULTS_EMAIL: kfisch@scripps.edu
#Directory pointing to STAR_INDEX (you may have to create this)
STAR_INDEX: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/star_genome
#Options for STAR. Please read parameter options https://code.google.com/p/rna-star/
STAR_OPTIONS: --readFilesCommand cat --runThreadN 8 --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical
#Version number of STAR
STAR_VERSION: '2.3.0'
#Path to temporary directory
TEMP_DIR: /scratch/kfisch
#Name of the queue on your local cluster you wish to use
QUEUE: workq
#Username for local cluster
USERNAME: kfisch
#Path to your local cluster installation of DRMAA (ask your sys admin for this)
DRMAA_PATH: /opt/applications/pbs-drmaa/current/gnu/lib/libdrmaa.so
#Name of created Bam file. Will be Aligned.out.bam if you are using STAR and accepted_hits.bam if you are using TopHat
BAM_FILE_NAME: Aligned.out.bam
#Full path to your parameter file. Make sure to include the single quotes.
PARAMS_FILE: '/gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_params_RNAseq_counts.yaml'
#Location of the meta data csv file for DESEQ. See tests/counts_meta.csv for an example. This file contains the Design file for your study. http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html
DESEQ_META: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/counts_meta.csv
#Design for DESEQ differential expression. Leave as is if you use the exact design as in the counts_meta.csv file.
DESIGN: '~ condition'
#P-value threshold
PVAL: '0.05'
#Directory for DESEQ results
DESEQ_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/DESEQ
#Directory where you want to store your Sumatra database. Once you run this once, you do not have to change this.
SUMATRA_DB_PATH: /gpfs/home/kfisch/sumatra
#Name of your project. You do not need to change this for subsequent runs of the pipeline, but you can if you wish.
SUMATRA_RUN_NAME: test_counts_sumatra_project
#Location of omics pipe repository (you can leave this)
REPOSITORY: https://kfisch@bitbucket.org/sulab/omics_pipe
#Your Mercurial username
HG_USERNAME: Kathleen Fisch <kfisch@scripps.edu>
#Version of Python installed on system
PYTHON_VERSION: 2.6.5
#Type of cluster scheduler (options: PBS, SGE)
SCHEDULER: PBS
#Full path to WORKING_DIR/reporting on your system
R_SOURCE_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/omics_pipe/scripts/reporting
#Are your raw fastq files compressed? If not, leave this parameter as none. If so, please type 'GZIP' and it will automatically process your gzip files.
COMPRESSION: none
RNAseq Tuxedo Pipeline¶
test_params_RNAseq_Tuxedo.yaml in omics_pipe/tests:
#sample names ie “Name” for paired and single end reads. So, “Name” for paired-end would expect two fastq files named “Name_1.fastq” and Name_2.fastq”
SAMPLE_LIST: [test1, test2]
#Function to be run within pipeline. If you want to run the whole pipeline, leave this as last_function
STEP: last_function
#All steps within the pipeline. DO NOT CHANGE this parameter for pre-installed pipelines. If you create your own pipeline, you will need to modify this by listing all of the steps in your pipeline.
STEPS: [fastqc, tophat, rseqc, cufflinks]
#All steps within the pipeline. DO NOT CHANGE this parameter for pre-installed pipelines. If you create your own pipeline, you will need to modify this by listing all of the steps in your pipeline.
STEPS_DE: [cuffmerge, cuffmergetocompare, cuffdiff, RNAseq_report_tuxedo, last_function]
#SE is single end, PE is paired-end sequencing reads
ENDS: SE
#Your email address
RESULTS_EMAIL: kfisch@scripps.edu
#Path to temporary directory (make sure this is a large, writable directory)
TEMP_DIR: /scratch/kfisch
#Name of the queue on your cluster
QUEUE: workq
#Your username on the cluster
USERNAME: kfisch
#Full path to your raw data files
RAW_DATA_DIR: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests
#Full path to where you want the Flag files written
FLAG_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run
#Full path to where you want the log files for each step written
LOG_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run
#Full path to where you want QC results written
QC_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run
#Full path to upper level results path
RESULTS_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run
#Where omics_pipe is installed, this path will be pointing to ~/omics_pipe/scripts
WORKING_DIR: /gpfs/home/kfisch/scripts/omics_pipeline-devel/omics_pipe/scripts
#Full path to where you want Tophat results written
TOPHAT_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/alignments
#Full path to where you want Cufflinks results written
CUFFLINKS_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/assemblies
#Full path to where you want Cuffmerge results written
CUFFMERGE_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/cuffmerge
#Full path to where you want Cuffdiff results written
CUFFDIFF_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/cuffdiff
#List of full paths to alignment files divided by condition. Each sample will have the path TOPHAT_RESULTS/SAMPLE_NAME/accepted_hits.bam. Divide these up into your two conditions for differential expression analysis. See http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/#cuffdiff-arguments for more details.
CUFFDIFF_INPUT_LIST_COND1: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/alignments/test1/accepted_hits.bam
CUFFDIFF_INPUT_LIST_COND2: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_run/alignments/test2/accepted_hits.bam
#Options for Tophat. Please read the TopHat documentation to set these options for your analysis. http://ccb.jhu.edu/software/tophat/manual.shtml#toph
TOPHAT_OPTIONS: -p 8 -a 5 --microexon-search --library-type fr-secondstrand
CUFFLINKS_OPTIONS: -u -N
CUFFMERGE_OPTIONS: -p 8
CUFFMERGETOCOMPARE_OPTIONS: -CG
CUFFDIFF_OPTIONS: -p 8 -FDR 0.01 -L Group1,Group2 -N --compatible-hits-norm
#Software versions installed on your system
FASTQC_VERSION: '0.10.1'
TOPHAT_VERSION: '2.0.9'
CUFFLINKS_VERSION: '2.1.1'
R_VERSION: '3.0.1'
BOWTIE_VERSION: 2.2.3
SAMTOOLS_VERSION: 0.1.19
PYTHON_VERSION: 2.6.5
BOWTIE_VERSION: 2.2.3
SAMTOOLS_VERSION: 0.1.19
#Ruffus specific parameters. See above or documentation for details. http://www.ruffus.org.uk/pipeline_functions.html#pipeline-functions-pipeline-run
PIPE_MULTIPROCESS: 100
PIPE_REBUILD: 'True'
PIPE_VERBOSE: 5
#Full path to gene annotation gtf file
REF_GENES: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf
#Full path to genome file
GENOME: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
#Full path to BOWTIE index
BOWTIE_INDEX: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome
#Location of chromosomes folder
CHROM: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Sequence/Chromosomes
#Path to your local cluster installation of DRMAA (ask your sys admin for this)
DRMAA_PATH: /opt/applications/pbs-drmaa/current/gnu/lib/libdrmaa.so
#Full path to directory where you want report results written
REPORT_RESULTS: /gpfs/home/kfisch/scripts/omics_pipeline-devel/tests
#Full path to your parameter file. Make sure to include the single quotes.
PARAMS_FILE: '/gpfs/home/kfisch/scripts/omics_pipeline-devel/tests/test_params_RNAseq_Tuxedo.yaml'
#Gene IDs of interest for visualization with CummeRbund
GENEIDS: [GAPDH, COL2A1, BRCA2]
#Name of samples in Condition 1
COND1: Group1
#Name of samples in Condition 2
COND2: Group2
#Name of your project. You do not need to change this for subsequent runs of the pipeline, but you can if you wish.
NAME: Test_run_date
#Directory where you want to store your Sumatra database. Once you run this once, you do not have to change this.
SUMATRA_DB_PATH: /gpfs/home/kfisch/sumatra
#Name of your project. You do not need to change this for subsequent runs of the pipeline, but you can if you wish.
SUMATRA_RUN_NAME: default_sumatra_project
#Location of omics pipe repository (you can leave this)
REPOSITORY: https://kfisch@bitbucket.org/sulab/omics_pipe
#Your Mercurial username
HG_USERNAME: Kathleen Fisch <kfisch@scripps.edu>
#Type of cluster scheduler (options: PBS, SGE)
SCHEDULER: PBS
#Full path to WORKING_DIR/reporting on your system
R_SOURCE_PATH: /gpfs/home/kfisch/scripts/omics_pipeline-devel/omics_pipe/scripts/reporting
#Are your raw fastq files compressed? If not, leave this parameter as none. If so, please type 'GZIP' and it will automatically process your gzip files.
COMPRESSION: none
Omics Pipe Tutorial – Creating a Custom Pipeline Script¶
A pipeline script is a .py file that has the steps in the pipeline that you want to run in your analysis. To create a custom pipeline, you will create a new Python script (*.py) file and place it in your working directory (or wherever you want). The available analysis steps built-in to omics_pipe are available in the (List of currently available omics_pipe analysis steps).
You may add new modules directly to the module directory (see Adding Custom Modules), and they will become available steps that you can use in your custom pipeline.
- There are three steps to creating a custom pipeline:
- Designing the structure of your pipeline
- Creating the script
- Updating your parameters file
The section below details each of these steps.
Designing the structure of the pipeline¶
Omics_pipe depends upon the pipelining module Ruffus to handle the automation. Please read the documentation at the Ruffus website http://www.ruffus.org.uk/ for more information. To design your pipeline, you need to decide
- What analysis modules you want to run,
- What order you want the analysis modules to run in,
- Which, if any, of the analysis modules depend upon the results from another analysis module.
For example, we will create a custom pipeline that runs fastqc, star and htseq (depends on output from star).
Creating the script¶
To create the script, create a text file name custom_script.py (or whatever name you choose). At the top of the file, cut and copy this text:
#!/usr/bin/env python
from ruffus import *
import sys
import os
import time
import datetime
import drmaa
from omics_pipe.utils import *
from omics_pipe.parameters.default_parameters import default_parameters
p = Bunch(default_parameters)
os.chdir(p.WORKING_DIR)
now = datetime.datetime.now()
date = now.strftime("%Y-%m-%d %H:%M")
print p
for step in p.STEPS:
vars()['inputList_' + step] = []
for sample in p.SAMPLE_LIST:
vars()['inputList_' + step].append([sample, "%s/%s_%s_completed.flag" % (p.FLAG_PATH, step, sample)])
print vars()['inputList_' + step]
After this has been completed, you will need to import each of the analysis modules that you will use in your pipeline. For each analysis module, write the line below (fill in analysis_name with the name of the analysis module):
from omics_pipe.modules.analysis_name import analysis_name
In our example, you will have three lines (see below):
from omics_pipe.modules.fastqc import fastqc
from omics_pipe.modules.star import star
from omics_pipe.modules.htseq import htseq
Now you are ready to write the functions to run each of these steps in the analysis. For each step in our analysis pipeline, we will need to write a function. You can cut and copy these from the pre-packaged analysis pipeline scripts (or eventually a function reference) or create them. Each function needs to have two decorators from Ruffus: @parallel(inputList_analysis_name) to specify that the pipeline should run in parallel for more than one sample and @check_if_uptodate(check_file_exists) to call a function to check if that step in the pipeline is up to date. Name each function with the name of the analysis prefixed by “run_.” The function input should always be (sample, analysis_name_flag). Within the function, you will call the analysis module that you loaded above. If you want an analysis module to run only after a module it depends upon finishes, you must add the @follows() Ruffus decorator before the function, with the name of the step that it depends upon. For example, if htseq needs to run after star, you would put @follows(run_star) above the run_htseq function. If you have steps that do not have functions that are dependent upon them, you can create a more complex pipeline structure by creating a “Last Function” that ties together all steps of your pipeline. The last function below is an example of such a function, and it also produces a PDF diagram of your pipeline when it completes. The functions for our example are below.
@parallel(inputList_fastqc)
@check_if_uptodate(check_file_exists)
def run_fastqc(sample, fastqc_flag):
fastqc(sample, fastqc_flag)
return
@parallel(inputList_star)
@check_if_uptodate(check_file_exists)
def run_star(sample, star_flag):
star(sample, star_flag)
return
@parallel(inputList_htseq)
@check_if_uptodate(check_file_exists)
@follows(run_star)
def run_htseq(sample, htseq_flag):
htseq(sample, htseq_flag)
return
@parallel(inputList_last_function)
@check_if_uptodate(check_file_exists)
@follows(run_fastqc, run_htseq)
def last_function(sample, last_function_flag):
print "PIPELINE HAS FINISHED SUCCESSFULLY!!! YAY!"
pipeline_graph_output = p.FLAG_PATH + "/pipeline_" + sample + "_" + str(date) + ".pdf"
pipeline_printout_graph (pipeline_graph_output,'pdf', step, no_key_legend=False)
stage = "last_function"
flag_file = "%s/%s_%s_completed.flag" % (p.FLAG_PATH, stage, sample)
open(flag_file, 'w').close()
return
Once you have created all of the functions for each step of your pipeline, cut and copy the code below to the bottom of your script:
if __name__ == '__main__':
pipeline_run(p.STEP, multiprocess = p.PIPE_MULTIPROCESS, verbose = p.PIPE_VERBOSE, gnu_make_maximal_rebuild_mode = p.PIPE_REBUILD)
At this point, please save your script and move on to step 3.
Updating your parameters file¶
In order for your script to run successfully, you need to configure your parameter file so that each analysis module has the necessary parameters to execute successfully. The full list of parameters for all modules in the current version of omics_pipe are located in the omics_pipe/parameters/default_parameters.py file (and eventually organized somewhere). You can view the list of necessary parameters for each analysis module by importing the analysis module into an interactive python session (from omics_pipe.modules.analysis_module import analysis_module) and typing analysis_module.__doc__. The parameters necessary for that analysis module will be listed under “parameters from parameters file.” These parameters must be put into your parameters.yaml file and spelled exactly as shown (including all caps). Below is the list of parameters that are necessary to run omics_pipe in addition to the module specific parameters.
SAMPLE_LIST: [test, test1]
STEP: run_last_function
STEPS: [fastqc, star, htseq, last_function]
RAW_DATA_DIR: /gpfs/group/sanford/patient/SSKT/test_patient/RNA/RNA_seq/data
FLAG_PATH: /gpfs/group/sanford/patient/SSKT/test_patient/RNA/RNA_seq/logs/flags
LOG_PATH: /gpfs/group/sanford/patient/SSKT/test_patient/RNA/RNA_seq/logs
WORKING_DIR: /gpfs/home/kfisch/virt_env/virt2/lib/python2.6/site-packages/omics_pipe-1.0.7-py2.6.egg/omics_pipe/scripts
ENDS: PE
PIPE_MULTIPROCESS: 100
PIPE_REBUILD: 'True'
PIPE_VERBOSE: 5
RESULTS_EMAIL: kfisch@scripps.edu
TEMP_DIR: /scratch/kfisch
DPS_VERSION: '1.3.1111'
QUEUE: bigmem
PARAMS_FILE: /gpfs/home/kfisch/omics_pipe_docs/test_params.yaml
USERNAME: kfisch
GENOME: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
CHROM: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Sequence/Chromosomes
REF_GENES: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf
STAR_INDEX: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/star_genome
BOWTIE_INDEX: /gpfs/group/databases/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome
Once you have all of the necessary parameters in your parameters.yaml file, for your custom script you will need to change the STEP and STEPS parameters. In the STEP parameter, you will write the name of the last function in your pipeline that you want to run, which should be configured so that it captures all steps in the pipeline (as in the example above). Make sure to put run_ in front of this, since you are calling the function, not the analysis module. In order for omics_pipe to know what steps you have in your pipeline, you need to list each analysis module name in the STEPS parameter separated with commas (without run_ in the prefix). You are now ready to run your custom script.
Running omics_pipe with a custom pipeline script When you call the omics_pipe function, you will specify the path to your custom script using the command
omics_pipe custom --custom_script_path ~/path/to/the/script –custom_script_name customscript /path/to/parameters.yaml.
This will automatically load your custom script and run through the steps in your pipeline using the default modules available in omics_pipe.
Omics Pipe Tutorial – Adding a New Module (Tool)¶
Users can easily create new analysis modules for use within omics_pipe. The user has two options for creating new analysis modules: - Adding analysis modules directly within the omics_pipe/scripts installation directory - Creating a new working directory where all analysis modules scripts are located (this can be changed in the parameters file by changing the WORKING_DIR parameter to the desired location). If you want to use option 2, in order to use pre-installed analysis modules, for the time being you must copy these analysis modules to your new working directory. If you choose option 1, you can simply add additional analysis modules and they will be accessible along with the pre-installed analysis modules.
To create a new analysis module, you need to perform four steps: 1. Create a Bash script with the command to be sent to the cluster 2. Create a Python module that calls the Bash script 3. Add your module to your custom pipeline 4. Add new module parameters to parameters file
The section below details each of these steps.
1. Create a Bash script¶
The first step in creating your custom module is to create the Bash script with the command you would like to run. If you are unsure how to write a Bash script, you can look at the examples in omics_pipe/scripts or work through this tutorial (http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html). In many cases, this will be a simple script with a one line command to call the analysis program. You should name your script something that will be easily identifiable and it should have the suffix .sh (e.g. analysis_script.sh). At the beginning of your analysis script, you should put the following lines:
#!/bin/bash
set -x
#Source modules for current shell
source $MODULESHOME/init/bash
#Make output directory if it doesn't exist
mkdir -p ${variable} #RESULTS_DIR
#Move tmp dir to scratch
export TMPDIR=${variable} #TEMP_DIR
#Load specified software version
module load fastqc/${variable} #VERSION
The ${variable} will be changed to ${number} (e.g. $1) based on the location of the variable in the input script (more on this below). These settings are assuming you are working on a cluster with a modular structure. If not, “module load” may not be appropriate to load the software, so please ask your system administrator to provide assistance with this if your cluster has a different system. After you specify the software and other configuration variables, you can write the commands for the software you would like to use. When you are finished with the commands, exit the script with ‘exit 0.’ An example script for running the software program FASTQC is below.
#Runs fastqc with $1=SAMPLE, $2=RAW_DATA_DIR, $3=QC_PATH
fastqc -o $3 $2/$1.fastq
exit 0
Substitute all variables that you would like to change from the parameter file with a variable notation, in the form of $1, $2, $3, etc for the first, second, third, etc input parameter that will be passed to the script. Once you have appropriately parameterized the script, save the script either in your working directory (along will all the other scripts you will need, possibly copied from omics_pipe/scripts) or in the omics_pipe/scripts directory.
2. Create a Python module¶
Now that you have created your custom script, you can create the Python module that will handle that script and schedule a job on the compute cluster using DRMAA (https://code.google.com/p/drmaa-python/wiki/Tutorial). You should name the Python module the same name as your custom analysis module, but with the extension .py. In this example, your Python module would be named analysis_script.py and the function within it would also be called analysis_script. Save your custom Python module within the same directory as your custom pipeline script. At the top of your Python module, cut and copy the text below.
#!/usr/bin/env python
import drmaa
from omics_pipe.parameters.default_parameters import default_parameters
from omics_pipe.utils import *
p = Bunch(default_parameters)
You will then write a simple Python function that take the form of the function below. You can directly cut and copy
this function and then change the necessary names/parameters to fit your custom analysis. ::
def fastqc(sample, fastqc_flag):
'''QC check of raw .fastq files using FASTQC
input: .fastq file
output: folder and zipped folder containing html, txt and image files
citation: Babraham Bioinformatics
link: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
parameters from parameters file: RAW_DATA_DIR,QC_PATH, FASTQC_VERSION'''
spawn_job(jobname = 'fastqc', SAMPLE = sample, LOG_PATH = p.LOG_PATH, RESULTS_EMAIL = p.RESULTS_EMAIL, walltime = "12:00:00", queue = p.QUEUE, nodes = 1, ppn = 8, memory = "16gb", script = "/fastqc_drmaa.sh", args_list = [sample, p.RAW_DATA_DIR,p.QC_PATH, p.FASTQC_VERSION])
job_status(jobname = 'fastqc', resultspath = p.QC_PATH, SAMPLE = sample, outputfilename = sample + "_fastq/" + "fastqc_data.txt", FLAG_PATH = p.FLAG_PATH)
return
Name your function the same as the names of both the Bash and Python scripts you just created for consistency. In our example, the first line would look like: “def analysis_script(sample, analysis_script_flag):”. As you can see, I changed the name of the function as well as the name of the flag input file. The document string should be change to describe what your analysis module does, what type of input file it takes, a citation and link to the tool that you are calling, as well as the parameters that are needed in the parameters file that will be passed to the Bash script that you created. After you are done documenting your function, you will change a few items within the spawn_job and job_status functions that are called from the omics_pipe.utils module. In the spawn_job function, you should change the job name to match the name of your function, you can customize the resources your job will request from the compute cluster, you will need to change the name of the script to match that of the Bash script that you just created, and then you will change the parameters listed in the variable “args_list.” The variable “sample” is lower case because it is passed to this function from omics_pipe, but input parameters coming from the parameters file must be prefixed with “p.” List the parameters that you need to feed into your custom analysis script in the order that you numbered them in the Bash script. In the example above, $1 corresponds to ‘sample’ $2 corresponds to p.RAW_DATA_DIR, etc. Once you have the spawn_job function updated, you will update the job_status function with the job name, results path and a name of an output file that will be produced from your Bash script. This can be any file that is created. This function will check that this file exists in the specified results directory, check that its size is greater than zero, and then it will create a flag file if it exists. Once you complete this, you are finished creating your custom Python module.
3. Add custom Python module to your custom pipeline¶
In order to use your custom analysis module, you will need to create a custom pipeline with your custom analysis module included as a step in the pipeline. For a tutorial on how to create a custom pipeline, see Section “Creating a Custom Pipeline Script.” Once you have a custom pipeline script, please make sure your custom analysis module and custom pipeline script are in the same directory.
4. Add new parameters to parameters file¶
The final step in custom analysis module creation is to add the parameters necessary for your custom analysis module to run into the parameters file. Simply add the parameters to your parameters script, save it, and then run your custom pipeline.
Omics Pipe Available Pipelines¶
RNA-seq (Tuxedo)¶
- RNA-seq Tuxedo Modules
- Modules included in the Tuxedo RNA-seq pipeline.
- FASTQC
- TopHat
- Cufflinks
- Cuffmerge
- Cuffmergetocompare
- Cuffdiff
- R Summary Report - CummeRbund
RNA-seq(Anders 2013)¶
- RNA-seq Count Based Modules
- Modules included in the count-based RNA-seq pipeline.
- FASTQC
- STAR
- HTSEQ
- R Summary Report - DESEQ2
Whole Exome Sequencing (GATK)¶
- Whole Genome and Whole Exome Sequencing Modules
- Modules included in the whole exome sequencing pipeline.
- FASTQC
- BWA-MEM
- PICARD Mark Duplicates
- GATK Preprocessing
- GATK Variant Discovery
- GATK Variant Filtering
Whole Genome Sequencing (GATK)¶
- Whole Genome and Whole Exome Sequencing Modules
- Modules included in the whole genome sequencing pipeline.
- FASTQC
- BWA-MEM
- PICARD Mark Duplicates
- GATK Preprocessing
- GATK Variant Discovery
- GATK Variant Filtering
Whole Genome Sequencing (MUTECT)¶
- Whole Genome Sequencing (MUTECT)
- Modules included in the cancer (paired tumor/normal) whole genome sequencing pipeline.
- FASTQC
- BWA-MEM
- MUTECT
ChIP-seq (MACS)¶
- ChIP-seq Modules – MACS
- Modules included in the ChIP-seq MACS pipeline.
- FASTQC
- Homer ChIP Trim
- Bowtie
- MACS
ChIP-seq (HOMER)¶
- ChIP-seq Modules – HOMER
- Modules included in the ChIP-seq HOMER pipeline.
- FASTQC
- Homer ChIP Trim
- Bowtie
- Homer Read Density
- Homer Peaks
- Homer Peak Track
- Homer Annotate Peaks
- Homer Find Motifs
Breast Cancer Personalized Genomics Report- RNAseq¶
- Breast Cancer Personalized Genomics Report- RNAseq
- Modules included in the RNAseq Cancer pipeline.
- FASTQC
- STAR
- RSEQC
- Fusion Catcher
- BWA/SNPiR
- Filter Variants
- HTseq
- Intogen
- OncoRep Cancer Report
TCGA Reanalysis Pipeline - RNAseq¶
- TCGA Reanalysis Pipeline - RNAseq
- Modules included in the RNAseq Cancer pipeline.
- TCGA Download (GeneTorrent)
- FASTQC
- STAR
- RSEQC
- Fusion Catcher
- BWA/SNPiR
- Filter Variants
- HTseq
- Intogen
- OncoRep Cancer Report
TCGA Reanalysis Pipeline - RNAseq Counts¶
- RNA-seq Count Based Modules- TCGA
- Modules included in the RNAseq counts pipeline for TCGA reanalysis.
- TCGA Download (GeneTorrent)
- FASTQC
- STAR
- HTSEQ
- Report
miRNAseq Counts (Anders 2013)¶
- miRNA-seq Count Based Modules
- Modules included in the miRNAseq counts pipeline.
- Cutadapt
- FASTQ Length Filter
- FASTQC
- STAR
- HTSEQ
- Report
miRNAseq (Tuxedo)¶
- miRNA-seq Tuxedo Modules
- Modules included in the miRNAseq Tuxedo pipeline.
- Cutadapt
- FASTQ Length Filter
- TopHat
- Cufflinks
- Cuffmerge
- Cuffmergetocompare
- Cuffdiff
- R Summary Report
All Available Modules¶
Reference Databases Needed¶
To run the pipelines, you will need to have reference databases installed on your cluster. If you are using the AWS installation, these databases are provided for you. If you need to install your references, please install the ones below. Omics Pipe is compatible with all species genome files. Examples below are for hg19, but you can substitute them for the equivalent files from other species.
All Pipelines¶
Genome¶
- .fa file can be downloaded from: http://cufflinks.cbcb.umd.edu/igenomes.html
Reference Annotation Files¶
You can use any reference annotations you would like, as long as they are GTF files.
Examples include:
- gencode.v18.annotation.gtf
- UCSC genes.gtf
Reference Data for Cancer Reporting Scripts (RNAseq cancer, TCGA pipelines)¶
For the cancer pipelines, please download the file from the link below, extract it and put the files in the respective directories. Reporting_data
In your omics pipe installation directory under omics_pipe/scripts/reporting/ref place the files.¶
In your omics pipe installation directory under omics_pipe/scripts/reporting/data place the remaining files.¶
- brca_mol_class/*
- DoG/*
- geneLists/*
- SPIA
- deseq.tcga_brca.Rdata
- loggeoameansBRCA.Rdata
References for Variants (RNA-seq cancer, RNA-seq cancer TCGA, WES and WGS pipelines)¶
For pipelines performing variant calling, please download the references below and put them in the specified directories.
You can put these files in any directory. You will point to their location in the parameters file.¶
Available within the GATK recource bundle v.2.5:
- dbsnp_137.hg19.vcf
- Mills_and_1000G_gold_standard.indels.hg19.vcf
- 1000G_phase1.indels.hg19.vcf
- hapmap_3.3.hg19.vcf
- 1000G_omni2.5.hg19.vcf
In your omics pipe installation directory under omics_pipe/scripts/reporting/ref place these files.¶
- cadd.tsv.gz from http://cadd.gs.washington.edu/download
- drugbank.tsv
- cosmic.tsv
- clinvar.txt
from PharmGKB:
- pharmgkbAllele.tsv
- pharmgkbRSID.csv
WES Pipeline¶
ChIP-seq Pipelines¶
SNPiR Pipelines (RNA-seq cancer and RNA-seq cancer TCGA pipelines)¶
- BWA Index
- RNA editing sites (Human_AG_all_hg19.bed)
- RepeatMasker.bed
- anno_combined_sorted
- knowngene.bed
Third Party Software Dependencies¶
Omics Pipe is dependent upon several third-party software packages. Before running Omics Pipe, please install all of the required tools for the pipeline you will be running (see below) as Modules on your local cluster. If you are running the AWS distribution, all third party software is already installed.
R Packages Needed¶
In R, you can cut and copy this to install all required packages:
install.packages(c("bibtex", "AnnotationDbi", "cluster", "cummeRbund", "data.table", "DBI", "DESeq2", "devtools", "dplyr", "gdata",
"ggplot2", "graphite", "igraph", "KEGGREST","knitr", "knitrBootstrap", "lattice", "locfit", "pamr", "pander", "pathview",
"plyr","RColorBrewer","Rcpp", "RcppArmadillo", "RCurl", "ReactomePA", "RefManageR","RJSONIO","RSQLite",
"stringr","survival", "XML", "xtable", "yaml"))
RNA-seq (Tuxedo)¶
FASTQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ TOPHAT: http://tophat.cbcb.umd.edu/ CUFFLINKS: http://cufflinks.cbcb.umd.edu/
RNA-seq (Anders 2013)¶
Whole Exome Sequencing (GATK)¶
Whole Genome Sequencing (GATK)¶
Whole Genome Sequencing (MUTECT)¶
ChIP-seq (MACS)¶
ChIP-seq (HOMER)¶
Breast Cancer Personalized Genomics Report- RNAseq¶
FASTQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ STAR: http://code.google.com/p/rna-star/ SAMTOOLS: http://samtools.sourceforge.net/ HTSEQ: http://www-huber.embl.de/users/anders/HTSeq/doc/index.html RSEQC: http://rseqc.sourceforge.net/ PICARD: http://picard.sourceforge.net/ GATK: https://www.broadinstitute.org/gatk/download FusionCatcher: https://code.google.com/p/fusioncatcher/ Oncofuse: http://www.unav.es/genetica/oncofuse.html BWA: http://bio-bwa.sourceforge.net/ DNANEXUS SAMTOOLS: https://github.com/dnanexus/samtools BEDTOOLS: https://github.com/arq5x/bedtools2 BLAT: https://genome.ucsc.edu/FAQ/FAQblat.html#blat3 SNPiR: http://lilab.stanford.edu/SNPiR/ SNPEFF: http://snpeff.sourceforge.net/ SNPSIFT: http://snpeff.sourceforge.net/SnpSift.html VCFTOOLS: http://vcftools.sourceforge.net/
TCGA Reanalysis Pipeline - RNAseq¶
GeneTorrent: https://cghub.ucsc.edu/docs/user/software.html FASTQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ STAR: http://code.google.com/p/rna-star/ SAMTOOLS: http://samtools.sourceforge.net/ HTSEQ: http://www-huber.embl.de/users/anders/HTSeq/doc/index.html RSEQC: http://rseqc.sourceforge.net/ PICARD: http://picard.sourceforge.net/ GATK: https://www.broadinstitute.org/gatk/download FusionCatcher: https://code.google.com/p/fusioncatcher/ Oncofuse: http://www.unav.es/genetica/oncofuse.html BWA: http://bio-bwa.sourceforge.net/ DNANEXUS SAMTOOLS: https://github.com/dnanexus/samtools BEDTOOLS: https://github.com/arq5x/bedtools2 BLAT: https://genome.ucsc.edu/FAQ/FAQblat.html#blat3 SNPiR: http://lilab.stanford.edu/SNPiR/ SNPEFF: http://snpeff.sourceforge.net/ SNPSIFT: http://snpeff.sourceforge.net/SnpSift.html VCFTOOLS: http://vcftools.sourceforge.net/
TCGA Reanalysis Pipeline - RNAseq Counts¶
miRNAseq Counts (Anders 2013)¶
miRNAseq (Tuxedo)¶
CutAdapt: http://code.google.com/p/cutadapt/ FASTQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ TOPHAT: http://tophat.cbcb.umd.edu/ CUFFLINKS: http://cufflinks.cbcb.umd.edu/
System Requirements¶
If you are running Omics Pipe on a local high performance compute cluster, please ensure that you have the following minimum resource requirements.
- A minimum of 2 processors (nodes) with at least 32GB of memory (the more nodes you have available, the more the pipeline can be parallelized)
- Scratch space that is at least 3x the size of your expected results files.
- Storage space available for ~200 GB of reference-related data
- Storage space available for raw data files
- Storage space for results files (~10x that of raw data)
RNA-seq Tuxedo Modules¶
Modules available in the RNA-seq Tuxedo Pipeline.
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
TopHat¶
- omics_pipe.modules.tophat.tophat(sample, tophat_flag)[source]¶
Runs TopHat to align .fastq files.
- input:
- .fastq file
- output:
- accepted_hits.bam
- citation:
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120
- link:
- http://tophat.cbcb.umd.edu/
- parameters from parameters file:
RAW_DATA_DIR:
REF_GENES:
TOPHAT_RESULTS:
BOWTIE_INDEX:
TOPHAT_VERSION:
TOPHAT_OPTIONS:
BOWTIE_VERSION:
SAMTOOLS_VERSION:
Cufflinks¶
- omics_pipe.modules.cufflinks.cufflinks(sample, cufflinks_flag)[source]¶
Runs cufflinks to assemble .bam files from TopHat.
- input:
- accepted_hits.bam
- output:
- transcripts.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
TOPHAT_RESULTS:
CUFFLINKS_RESULTS:
REF_GENES:
GENOME:
CUFFLINKS_OPTIONS:
CUFFLINKS_VERSION:
Cuffmerge¶
- omics_pipe.modules.cuffmerge.cuffmerge(step, cuffmerge_flag)[source]¶
Runs cuffmerge to merge .gtf files from Cufflinks.
- input:
- assembly_GTF_list.txt
- output:
- merged.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFMERGE_RESULTS:
REF_GENES:
GENOME:
CUFFMERGE_OPTIONS:
CUFFLINKS_VERSION:
Cuffmergetocompare¶
- omics_pipe.modules.cuffmergetocompare.cuffmergetocompare(step, cuffmergetocompare_flag)[source]¶
Runs cuffcompare to annotate merged .gtf files from Cufflinks.
- input:
- assembly_GTF_list.txt
- output:
- merged.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFMERGE_RESULTS:
REF_GENES:
GENOME:
CUFFMERGETOCOMPARE_OPTIONS:
CUFFLINKS_VERSION:
Cuffdiff¶
- omics_pipe.modules.cuffdiff.cuffdiff(step, cuffdiff_flag)[source]¶
Runs Cuffdiff to perform differential expression. Runs after Cufflinks. Part of Tuxedo Suite.
- input:
- .bam files
- output:
- differential expression results
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFDIFF_RESULTS:
GENOME:
CUFFDIFF_OPTIONS:
CUFFMERGE_RESULTS:
CUFFDIFF_INPUT_LIST_COND1:
CUFFDIFF_INPUT_LIST_COND2:
CUFFLINKS_VERSION:
R Summary Report¶
- omics_pipe.modules.RNAseq_report_tuxedo.RNAseq_report_tuxedo(sample, RNAseq_report_tuxedo_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
WORKING_DIR:
R_VERSION:
REPORT_RESULTS:
DPS_VERSION:
PARAMS_FILE:
RNA-seq Count Based Modules¶
Modules available in the count-based RNA-seq Pipeline.
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
STAR Aligner¶
- omics_pipe.modules.star.star(sample, star_flag)[source]¶
Runs STAR to align .fastq files.
- input:
- .fastq file
- output:
- Aligned.out.bam
- citation:
- Dobin et al, Bioinformatics 2012; doi: 10.1093/bioinformatics/bts635 “STAR: ultrafast universal RNA-seq aligner”
- link:
- https://code.google.com/p/rna-star/
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
STAR_INDEX:
STAR_OPTIONS:
STAR_RESULTS:
SAMTOOLS_VERSION:
STAR_VERSION:
COMPRESSION:
REF_GENES:
HTSEQ-count¶
- omics_pipe.modules.htseq.htseq(sample, htseq_flag)[source]¶
Runs htseq-count to get raw count data from alignments.
- input:
- Aligned.out.sort.bam
- output:
- counts.txt
- citation:
- Simon Anders, EMBL
- link:
- http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
- parameters from parameters file:
STAR_RESULTS:
HTSEQ_OPTIONS:
REF_GENES:
HTSEQ_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BAM_FILE_NAME:
PYTHON_VERSION:
R Summary Report - DESEQ2¶
- omics_pipe.modules.RNAseq_report_counts.RNAseq_report_counts(sample, RNAseq_report_counts_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
WORKING_DIR:
R_VERSION:
REPORT_RESULTS:
PARAMS_FILE:
Breast Cancer Personalized Genomics Report- RNAseq¶
Modules included in the RNAseq Cancer pipeline.
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
STAR Aligner¶
- omics_pipe.modules.star.star(sample, star_flag)[source]¶
Runs STAR to align .fastq files.
- input:
- .fastq file
- output:
- Aligned.out.bam
- citation:
- Dobin et al, Bioinformatics 2012; doi: 10.1093/bioinformatics/bts635 “STAR: ultrafast universal RNA-seq aligner”
- link:
- https://code.google.com/p/rna-star/
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
STAR_INDEX:
STAR_OPTIONS:
STAR_RESULTS:
SAMTOOLS_VERSION:
STAR_VERSION:
COMPRESSION:
REF_GENES:
HTSEQ-count¶
- omics_pipe.modules.htseq.htseq(sample, htseq_flag)[source]¶
Runs htseq-count to get raw count data from alignments.
- input:
- Aligned.out.sort.bam
- output:
- counts.txt
- citation:
- Simon Anders, EMBL
- link:
- http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
- parameters from parameters file:
STAR_RESULTS:
HTSEQ_OPTIONS:
REF_GENES:
HTSEQ_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BAM_FILE_NAME:
PYTHON_VERSION:
RSEQC¶
- omics_pipe.modules.rseqc.rseqc(sample, rseqc_flag)[source]¶
Runs rseqc to determine insert size as QC for alignment.
- input:
- .bam
- output:
- pdf plot
- link:
- http://rseqc.sourceforge.net/
- parameters from parameters file:
STAR_RESULTS:
QC_PATH:
BAM_FILE_NAME:
RSEQC_REF:
RSEQC_VERSION:
TEMP_DIR:
Fusion Catcher¶
- omics_pipe.modules.fusion_catcher.fusion_catcher(sample, fusion_catcher_flag)[source]¶
Detects fusion genes in paired-end RNAseq data.
- input:
- paired end .fastq files
- output:
- list of candidate fusion genes
- citation:
- Kangaspeska, S. Hultsch, H. Edgren, D. Nicorici, A. Murumgi, O.P. Kallioniemi, Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms, PLOS One, Oct. 2012. http://dx.plos.org/10.1371/journal.pone.0048745
- link:
- https://code.google.com/p/fusioncatcher
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
FUSION_RESULTS:
FUSIONCATCHERBUILD_DIR:
TEMP_DIR:
SAMTOOLS_VERSION:
FUSIONCATCHER_VERSION:
FUSIONCATCHER_OPTIONS:
TISSUE:
PYTHON_VERSION:
BWA/SNPiR¶
BWA¶
- omics_pipe.modules.bwa.bwa1(sample, bwa1_flag)[source]¶
BWA aligner for read1 of paired_end reads.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
BWA_INDEX:
RAW_DATA_DIR:
GATK_READ_GROUP_INFO:
COMPRESSION:
- omics_pipe.modules.bwa.bwa2(sample, bwa2_flag)[source]¶
BWA aligner for read2 of paired_end reads.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
BWA_INDEX:
RAW_DATA_DIR:
GATK_READ_GROUP_INFO:
COMPRESSION:
- omics_pipe.modules.bwa.bwa_RNA(sample, bwa_flag)[source]¶
BWA aligner for single end reads.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
BWA_INDEX:
RAW_DATA_DIR:
GATK_READ_GROUP_INFO:
COMPRESSION:
SNPiR¶
- omics_pipe.modules.snpir_variants.snpir_variants(sample, snpir_variants_flag)[source]¶
Calls variants using SNPIR pipeline.
- input:
- Aligned.out.sort.bam or accepted_hits.bam
- output:
- final_variants.vcf file
- citation:
- Piskol, R., et al. (2013). “Reliable Identification of Genomic Variants from RNA-Seq Data.” The American Journal of Human Genetics 93(4): 641-651.
- link:
- http://lilab.stanford.edu/SNPiR/
- parameters from parameters file:
VARIANT_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
PICARD_VERSION:
GATK_VERSION:
BEDTOOLS_VERSION:
UCSC_TOOLS_VERSION:
GENOME:
REPEAT_MASKER:
SNPIR_ANNOTATION:
RNA_EDIT:
DBSNP:
MILLS:
G1000:
WORKING_DIR:
BWA_RESULTS:
SNPIR_VERSION:
SNPIR_CONFIG:
SNPIR_DIR:
ENCODING:
Filter Variants¶
- omics_pipe.modules.filter_variants.filter_variants(sample, filter_variants_flag)[source]¶
Filters variants to remove common variants.
- input:
- .bam or .sam file
- output:
- .vcf file
- citation:
- Piskol et al. 2013. Reliable identification of genomic variants from RNA-seq data. The American Journal of Human Genetics 93: 641-651.
- link:
- http://lilab.stanford.edu/SNPiR/
- parameters from parameters file:
VARIANT_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
PICARD_VERSION:
GATK_VERSION:
BEDTOOLS_VERSION:
UCSC_TOOLS_VERSION:
GENOME:
REPEAT_MASKER:
SNPIR_ANNOTATION:
RNA_EDIT:
DBSNP:
MILLS:
G1000:
WORKING_DIR:
BWA_RESULTS:
SNPIR_VERSION:
SNPIR_CONFIG:
SNPIR_DIR:
SNPEFF_VERSION:
dbNSFP:
VCFTOOLS_VERSION:
WORKING_DIR:
SNP_FILTER_OUT_REF:
Intogen¶
- omics_pipe.modules.intogen.intogen(sample, intogen_flag)[source]¶
Runs Intogen to rank mutations and implication for cancer phenotype. Follows variant calling.
- input:
- .vcf
- output:
- variant list
- citation:
- Gonzalez-Perez et al. 2013. Intogen mutations identifies cancer drivers across tumor types. Nature Methods 10, 1081-1082.
- link:
- http://www.intogen.org/
- parameters from parameter file:
VCF_FILE:
INTOGEN_OPTIONS:
INTOGEN_RESULTS:
INTOGEN_VERSION:
USERNAME:
WORKING_DIR:
TEMP_DIR:
SCHEDULER:
VARIANT_RESULTS:
OncoRep Cancer Report¶
- omics_pipe.modules.BreastCancer_RNA_report.BreastCancer_RNA_report(sample, BreastCancer_RNA_report_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
WORKING_DIR:
R_VERSION:
REPORT_RESULTS:
PARAMS_FILE:
TABIX_VERSION:
TUMOR_TYPE:
GENELIST:
COSMIC:
CLINVAR:
PHARMGKB_rsID:
PHARMGKB_Allele:
DRUGBANK:
CADD:
TCGA Reanalysis Pipeline - RNAseq¶
Modules included in the TCGA RNAseq Cancer pipeline.
TCGA Download¶
- omics_pipe.modules.TCGA_download.TCGA_download(sample, TCGA_download_flag)[source]¶
Downloads and unzips TCGA data from Manifest.xml downloaded from CGHub. input:
TGCA XML file- output:
- downloaded files from TCGA
- citation:
- The Cancer Genome Atlas
- link:
- https://cghub.ucsc.edu/software/downloads.html
- parameters from parameters file:
TCGA_XML_FILE:
TCGA_KEY:
TCGA_OUTPUT_PATH:
CGATOOLS_VERSION:
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
STAR Aligner¶
- omics_pipe.modules.star.star(sample, star_flag)[source]¶
Runs STAR to align .fastq files.
- input:
- .fastq file
- output:
- Aligned.out.bam
- citation:
- Dobin et al, Bioinformatics 2012; doi: 10.1093/bioinformatics/bts635 “STAR: ultrafast universal RNA-seq aligner”
- link:
- https://code.google.com/p/rna-star/
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
STAR_INDEX:
STAR_OPTIONS:
STAR_RESULTS:
SAMTOOLS_VERSION:
STAR_VERSION:
COMPRESSION:
REF_GENES:
HTSEQ-count¶
- omics_pipe.modules.htseq.htseq(sample, htseq_flag)[source]¶
Runs htseq-count to get raw count data from alignments.
- input:
- Aligned.out.sort.bam
- output:
- counts.txt
- citation:
- Simon Anders, EMBL
- link:
- http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
- parameters from parameters file:
STAR_RESULTS:
HTSEQ_OPTIONS:
REF_GENES:
HTSEQ_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BAM_FILE_NAME:
PYTHON_VERSION:
RSEQC¶
- omics_pipe.modules.rseqc.rseqc(sample, rseqc_flag)[source]¶
Runs rseqc to determine insert size as QC for alignment.
- input:
- .bam
- output:
- pdf plot
- link:
- http://rseqc.sourceforge.net/
- parameters from parameters file:
STAR_RESULTS:
QC_PATH:
BAM_FILE_NAME:
RSEQC_REF:
RSEQC_VERSION:
TEMP_DIR:
Fusion Catcher¶
- omics_pipe.modules.fusion_catcher.fusion_catcher(sample, fusion_catcher_flag)[source]¶
Detects fusion genes in paired-end RNAseq data.
- input:
- paired end .fastq files
- output:
- list of candidate fusion genes
- citation:
- Kangaspeska, S. Hultsch, H. Edgren, D. Nicorici, A. Murumgi, O.P. Kallioniemi, Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms, PLOS One, Oct. 2012. http://dx.plos.org/10.1371/journal.pone.0048745
- link:
- https://code.google.com/p/fusioncatcher
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
FUSION_RESULTS:
FUSIONCATCHERBUILD_DIR:
TEMP_DIR:
SAMTOOLS_VERSION:
FUSIONCATCHER_VERSION:
FUSIONCATCHER_OPTIONS:
TISSUE:
PYTHON_VERSION:
BWA/SNPiR¶
BWA¶
- omics_pipe.modules.bwa.bwa1(sample, bwa1_flag)[source]¶
BWA aligner for read1 of paired_end reads.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
BWA_INDEX:
RAW_DATA_DIR:
GATK_READ_GROUP_INFO:
COMPRESSION:
- omics_pipe.modules.bwa.bwa2(sample, bwa2_flag)[source]¶
BWA aligner for read2 of paired_end reads.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
BWA_INDEX:
RAW_DATA_DIR:
GATK_READ_GROUP_INFO:
COMPRESSION:
- omics_pipe.modules.bwa.bwa_RNA(sample, bwa_flag)[source]¶
BWA aligner for single end reads.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
BWA_INDEX:
RAW_DATA_DIR:
GATK_READ_GROUP_INFO:
COMPRESSION:
SNPiR¶
- omics_pipe.modules.snpir_variants.snpir_variants(sample, snpir_variants_flag)[source]¶
Calls variants using SNPIR pipeline.
- input:
- Aligned.out.sort.bam or accepted_hits.bam
- output:
- final_variants.vcf file
- citation:
- Piskol, R., et al. (2013). “Reliable Identification of Genomic Variants from RNA-Seq Data.” The American Journal of Human Genetics 93(4): 641-651.
- link:
- http://lilab.stanford.edu/SNPiR/
- parameters from parameters file:
VARIANT_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
PICARD_VERSION:
GATK_VERSION:
BEDTOOLS_VERSION:
UCSC_TOOLS_VERSION:
GENOME:
REPEAT_MASKER:
SNPIR_ANNOTATION:
RNA_EDIT:
DBSNP:
MILLS:
G1000:
WORKING_DIR:
BWA_RESULTS:
SNPIR_VERSION:
SNPIR_CONFIG:
SNPIR_DIR:
ENCODING:
Filter Variants¶
- omics_pipe.modules.filter_variants.filter_variants(sample, filter_variants_flag)[source]¶
Filters variants to remove common variants.
- input:
- .bam or .sam file
- output:
- .vcf file
- citation:
- Piskol et al. 2013. Reliable identification of genomic variants from RNA-seq data. The American Journal of Human Genetics 93: 641-651.
- link:
- http://lilab.stanford.edu/SNPiR/
- parameters from parameters file:
VARIANT_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
PICARD_VERSION:
GATK_VERSION:
BEDTOOLS_VERSION:
UCSC_TOOLS_VERSION:
GENOME:
REPEAT_MASKER:
SNPIR_ANNOTATION:
RNA_EDIT:
DBSNP:
MILLS:
G1000:
WORKING_DIR:
BWA_RESULTS:
SNPIR_VERSION:
SNPIR_CONFIG:
SNPIR_DIR:
SNPEFF_VERSION:
dbNSFP:
VCFTOOLS_VERSION:
WORKING_DIR:
SNP_FILTER_OUT_REF:
Intogen¶
- omics_pipe.modules.intogen.intogen(sample, intogen_flag)[source]¶
Runs Intogen to rank mutations and implication for cancer phenotype. Follows variant calling.
- input:
- .vcf
- output:
- variant list
- citation:
- Gonzalez-Perez et al. 2013. Intogen mutations identifies cancer drivers across tumor types. Nature Methods 10, 1081-1082.
- link:
- http://www.intogen.org/
- parameters from parameter file:
VCF_FILE:
INTOGEN_OPTIONS:
INTOGEN_RESULTS:
INTOGEN_VERSION:
USERNAME:
WORKING_DIR:
TEMP_DIR:
SCHEDULER:
VARIANT_RESULTS:
OncoRep Cancer Report¶
- omics_pipe.modules.BreastCancer_RNA_report.BreastCancer_RNA_report(sample, BreastCancer_RNA_report_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
WORKING_DIR:
R_VERSION:
REPORT_RESULTS:
PARAMS_FILE:
TABIX_VERSION:
TUMOR_TYPE:
GENELIST:
COSMIC:
CLINVAR:
PHARMGKB_rsID:
PHARMGKB_Allele:
DRUGBANK:
CADD:
RNA-seq Count Based Modules- TCGA¶
Modules available in the TCGA count-based RNA-seq Pipeline.
TCGA Download¶
- omics_pipe.modules.TCGA_download.TCGA_download(sample, TCGA_download_flag)[source]¶
Downloads and unzips TCGA data from Manifest.xml downloaded from CGHub. input:
TGCA XML file- output:
- downloaded files from TCGA
- citation:
- The Cancer Genome Atlas
- link:
- https://cghub.ucsc.edu/software/downloads.html
- parameters from parameters file:
TCGA_XML_FILE:
TCGA_KEY:
TCGA_OUTPUT_PATH:
CGATOOLS_VERSION:
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
STAR Aligner¶
- omics_pipe.modules.star.star(sample, star_flag)[source]¶
Runs STAR to align .fastq files.
- input:
- .fastq file
- output:
- Aligned.out.bam
- citation:
- Dobin et al, Bioinformatics 2012; doi: 10.1093/bioinformatics/bts635 “STAR: ultrafast universal RNA-seq aligner”
- link:
- https://code.google.com/p/rna-star/
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
STAR_INDEX:
STAR_OPTIONS:
STAR_RESULTS:
SAMTOOLS_VERSION:
STAR_VERSION:
COMPRESSION:
REF_GENES:
HTSEQ-count¶
- omics_pipe.modules.htseq.htseq(sample, htseq_flag)[source]¶
Runs htseq-count to get raw count data from alignments.
- input:
- Aligned.out.sort.bam
- output:
- counts.txt
- citation:
- Simon Anders, EMBL
- link:
- http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
- parameters from parameters file:
STAR_RESULTS:
HTSEQ_OPTIONS:
REF_GENES:
HTSEQ_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BAM_FILE_NAME:
PYTHON_VERSION:
R Summary Report - DESEQ2¶
- omics_pipe.modules.RNAseq_report_counts.RNAseq_report_counts(sample, RNAseq_report_counts_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
WORKING_DIR:
R_VERSION:
REPORT_RESULTS:
PARAMS_FILE:
miRNA-seq Tuxedo Modules¶
Modules available in the miRNA-seq Tuxedo Pipeline.
CutAdapt¶
- omics_pipe.modules.cutadapt_miRNA.cutadapt_miRNA(sample, cutadapt_miRNA_flag)[source]¶
Runs Cutadapt to trim adapters from reads.
- input:
- .fastq
- output:
- .fastq
- citation:
- Martin 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17: 10-12.
- link:
- https://code.google.com/p/cutadapt/
- parameters from parameters file:
RAW_DATA_DIR:
ADAPTER:
TRIMMED_DATA_PATH:
PYTHON_VERSION
Fastq Length Filter¶
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
TopHat¶
- omics_pipe.modules.tophat.tophat(sample, tophat_flag)[source]¶
Runs TopHat to align .fastq files.
- input:
- .fastq file
- output:
- accepted_hits.bam
- citation:
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120
- link:
- http://tophat.cbcb.umd.edu/
- parameters from parameters file:
RAW_DATA_DIR:
REF_GENES:
TOPHAT_RESULTS:
BOWTIE_INDEX:
TOPHAT_VERSION:
TOPHAT_OPTIONS:
BOWTIE_VERSION:
SAMTOOLS_VERSION:
Cufflinks¶
- omics_pipe.modules.cufflinks.cufflinks(sample, cufflinks_flag)[source]¶
Runs cufflinks to assemble .bam files from TopHat.
- input:
- accepted_hits.bam
- output:
- transcripts.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
TOPHAT_RESULTS:
CUFFLINKS_RESULTS:
REF_GENES:
GENOME:
CUFFLINKS_OPTIONS:
CUFFLINKS_VERSION:
Cuffmerge¶
- omics_pipe.modules.cuffmerge.cuffmerge(step, cuffmerge_flag)[source]¶
Runs cuffmerge to merge .gtf files from Cufflinks.
- input:
- assembly_GTF_list.txt
- output:
- merged.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFMERGE_RESULTS:
REF_GENES:
GENOME:
CUFFMERGE_OPTIONS:
CUFFLINKS_VERSION:
Cuffmergetocompare¶
- omics_pipe.modules.cuffmergetocompare.cuffmergetocompare(step, cuffmergetocompare_flag)[source]¶
Runs cuffcompare to annotate merged .gtf files from Cufflinks.
- input:
- assembly_GTF_list.txt
- output:
- merged.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFMERGE_RESULTS:
REF_GENES:
GENOME:
CUFFMERGETOCOMPARE_OPTIONS:
CUFFLINKS_VERSION:
Cuffdiff¶
- omics_pipe.modules.cuffdiff.cuffdiff(step, cuffdiff_flag)[source]¶
Runs Cuffdiff to perform differential expression. Runs after Cufflinks. Part of Tuxedo Suite.
- input:
- .bam files
- output:
- differential expression results
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFDIFF_RESULTS:
GENOME:
CUFFDIFF_OPTIONS:
CUFFMERGE_RESULTS:
CUFFDIFF_INPUT_LIST_COND1:
CUFFDIFF_INPUT_LIST_COND2:
CUFFLINKS_VERSION:
R Summary Report¶
- omics_pipe.modules.RNAseq_report_tuxedo.RNAseq_report_tuxedo(sample, RNAseq_report_tuxedo_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
WORKING_DIR:
R_VERSION:
REPORT_RESULTS:
DPS_VERSION:
PARAMS_FILE:
miRNA-seq Count Based Modules¶
Modules available in the count-based miRNA-seq Pipeline.
CutAdapt¶
- omics_pipe.modules.cutadapt_miRNA.cutadapt_miRNA(sample, cutadapt_miRNA_flag)[source]¶
Runs Cutadapt to trim adapters from reads.
- input:
- .fastq
- output:
- .fastq
- citation:
- Martin 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17: 10-12.
- link:
- https://code.google.com/p/cutadapt/
- parameters from parameters file:
RAW_DATA_DIR:
ADAPTER:
TRIMMED_DATA_PATH:
PYTHON_VERSION
Fastq Length Filter¶
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
STAR Aligner¶
- omics_pipe.modules.star.star(sample, star_flag)[source]¶
Runs STAR to align .fastq files.
- input:
- .fastq file
- output:
- Aligned.out.bam
- citation:
- Dobin et al, Bioinformatics 2012; doi: 10.1093/bioinformatics/bts635 “STAR: ultrafast universal RNA-seq aligner”
- link:
- https://code.google.com/p/rna-star/
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
STAR_INDEX:
STAR_OPTIONS:
STAR_RESULTS:
SAMTOOLS_VERSION:
STAR_VERSION:
COMPRESSION:
REF_GENES:
HTSEQ¶
- omics_pipe.modules.htseq.htseq(sample, htseq_flag)[source]¶
Runs htseq-count to get raw count data from alignments.
- input:
- Aligned.out.sort.bam
- output:
- counts.txt
- citation:
- Simon Anders, EMBL
- link:
- http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
- parameters from parameters file:
STAR_RESULTS:
HTSEQ_OPTIONS:
REF_GENES:
HTSEQ_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BAM_FILE_NAME:
PYTHON_VERSION:
R Summary Report - DESEQ2¶
- omics_pipe.modules.RNAseq_report_counts.RNAseq_report_counts(sample, RNAseq_report_counts_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
WORKING_DIR:
R_VERSION:
REPORT_RESULTS:
PARAMS_FILE:
ChIP-seq Modules – HOMER¶
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
ChIP trim¶
- omics_pipe.modules.ChIP_trim.ChIP_trim(sample, ChIP_trim_flag)[source]¶
Runs Homer Tools to trim adapters from .fastq files.
- input:
- .fastq file
- output:
- .fastq file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
HOMER_TRIM_OPTIONS:
TRIMMED_DATA_PATH:
HOMER_VERSION:
Bowtie¶
- omics_pipe.modules.bowtie.bowtie(sample, bowtie_flag)[source]¶
Runs Bowtie to align .fastq files.
- input:
- .fastq file
- output:
- sample.bam
- citation:
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25
- link:
- http://bowtie-bio.sourceforge.net/index.shtml
- parameters from parameters file:
ENDS:
TRIMMED_DATA_PATH:
BOWTIE_OPTIONS:
BOWTIE_INDEX:
BOWTIE_RESULTS:
BOWTIE_VERSION:
SAMTOOLS_VERSION:
BEDTOOLS_VERSION:
TEMP_DIR:
Read Density -HOMER¶
- omics_pipe.modules.read_density.read_density(sample, read_density_flag)[source]¶
Runs HOMER to visualize read density from ChIPseq data.
- input:
- .bam file
- output:
- .txt file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
BOWTIE_RESULTS:
CHROM_SIZES:
HOMER_RESULTS:
HOMER_VERSION:
TEMP_DIR:
Peak Detection - HOMER¶
- omics_pipe.modules.homer_peaks.homer_peaks(step, homer_peaks_flag)[source]¶
Runs HOMER to call peaks from ChIPseq data.
- input:
- .tag input file
- output:
- .txt file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
PAIR_LIST:
HOMER_RESULTS:
HOMER_PEAKS_OPTIONS:
HOMER_VERSION:
TEMP_DIR:
Peak Annotation & Visualization - HOMER¶
- omics_pipe.modules.peak_track.peak_track(step, peak_track_flag)[source]¶
Runs HOMER to create peak track from ChIPseq data.
- input:
- .tag input file
- output:
- .txt file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
PAIR_LIST:
HOMER_RESULTS:
HOMER_VERSION:
TEMP_DIR:
- omics_pipe.modules.annotate_peaks.annotate_peaks(step, annotate_peaks_flag)[source]¶
Runs HOMER to annotate peak track from ChIPseq data.
- input:
- .tag input file
- output:
- .txt file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
PAIR_LIST:
HOMER_RESULTS:
HOMER_VERSION:
TEMP_DIR:
HOMER_GENOME:
HOMER_ANNOTATE_OPTIONS:
Find Motifs - HOMER¶
- omics_pipe.modules.find_motifs.find_motifs(step, find_motifs_flag)[source]¶
Runs HOMER to find motifs from ChIPseq data.
- input:
- .txt peak file from Homer
- output:
- .txt file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
PAIR_LIST:
HOMER_RESULTS:
HOMER_VERSION:
TEMP_DIR:
HOMER_GENOME:
HOMER_MOTIFS_OPTIONS:
ChIP-seq Modules – MACS¶
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
ChIP trim¶
- omics_pipe.modules.ChIP_trim.ChIP_trim(sample, ChIP_trim_flag)[source]¶
Runs Homer Tools to trim adapters from .fastq files.
- input:
- .fastq file
- output:
- .fastq file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
HOMER_TRIM_OPTIONS:
TRIMMED_DATA_PATH:
HOMER_VERSION:
Bowtie¶
- omics_pipe.modules.bowtie.bowtie(sample, bowtie_flag)[source]¶
Runs Bowtie to align .fastq files.
- input:
- .fastq file
- output:
- sample.bam
- citation:
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25
- link:
- http://bowtie-bio.sourceforge.net/index.shtml
- parameters from parameters file:
ENDS:
TRIMMED_DATA_PATH:
BOWTIE_OPTIONS:
BOWTIE_INDEX:
BOWTIE_RESULTS:
BOWTIE_VERSION:
SAMTOOLS_VERSION:
BEDTOOLS_VERSION:
TEMP_DIR:
MACS¶
- omics_pipe.modules.macs.macs(step, macs_flag)[source]¶
Runs MACS to call peaks from ChIPseq data. input:
.fastq file- output:
- peaks and .bed file
- citation:
- Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137
- link:
- http://liulab.dfci.harvard.edu/MACS/
- parameters from parameters file:
PAIR_LIST:
BOWTIE_RESULTS:
CHROM_SIZES:
MACS_RESULTS:
MACS_VERSION:
TEMP_DIR:
BEDTOOLS_VERSION:
PYTHON_VERSION:
Whole Genome and Whole Exome Sequencing Modules¶
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
BWA-MEM¶
- omics_pipe.modules.bwa.bwa_mem(sample, bwa_mem_flag)[source]¶
BWA aligner with BWA-MEM algorithm.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
GENOME:
RAW_DATA_DIR:
BWA_OPTIONS:
COMPRESSION:
PICARD Mark Duplicates¶
- omics_pipe.modules.picard_mark_duplicates.picard_mark_duplicates(sample, picard_mark_duplicates_flag)[source]¶
Picard tools Mark Duplicates.
- input:
- sorted.bam
- output:
- _sorted.rg.md.bam
- citation:
- http://picard.sourceforge.net/
- link:
- http://picard.sourceforge.net/
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
PICARD_VERSION:
SAMTOOLS_VERSION:
GATK Preprocessing¶
WES¶
- omics_pipe.modules.GATK_preprocessing_WES.GATK_preprocessing_WES(sample, GATK_preprocessing_WES_flag)[source]¶
GATK preprocessing steps for whole exome sequencing.
- input:
- sorted.rg.md.bam
- output:
- .ready.bam
- citation:
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
- link:
- http://www.broadinstitute.org/gatk/
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
MILLS:
G1000:
CAPTURE_KIT_BED:
SAMTOOLS_VERSION:
WGS¶
- omics_pipe.modules.GATK_preprocessing_WGS.GATK_preprocessing_WGS(sample, GATK_preprocessing_WGS_flag)[source]¶
GATK preprocessing steps for whole genome sequencing.
- input:
- sorted.rg.md.bam
- output:
- .ready.bam
- citation:
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
- link:
- http://www.broadinstitute.org/gatk/
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
MILLS:
G1000:
SAMTOOLS_VERSION:
GATK Variant Discovery¶
- omics_pipe.modules.GATK_variant_discovery.GATK_variant_discovery(sample, GATK_variant_discovery_flag)[source]¶
GATK_variant_discovery.
- input:
- sorted.rg.md.bam
- output:
- .ready.bam
- citation:
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
- link: GATK_variant_discovery
- http://www.broadinstitute.org/gatk/
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
VARIANT_RESULTS:
GATK Variant Filtering¶
- omics_pipe.modules.GATK_variant_filtering.GATK_variant_filtering(sample, GATK_variant_filtering_flag)[source]¶
GATK_variant_filtering.
- input:
- sorted.rg.md.bam
- output:
- .ready.bam
- citation:
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
- link: GATK_variant_filtering
- http://www.broadinstitute.org/gatk/
- parameters from parameters file:
VARIANT_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
MILLS:
OMNI:
HAPMAP:
R_VERSION:
G1000_SNPs:
G1000_Indels:
- omics_pipe.modules.GATK_variant_filtering.GATK_variant_filtering_group(sample, GATK_variant_filtering_group_flag)[source]¶
GATK_variant_filtering.
- input:
- sorted.rg.md.bam
- output:
- .ready.bam
- citation:
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
- link: GATK_variant_filtering
- http://www.broadinstitute.org/gatk/
parameters from parameters file:
VARIANT_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
MILLS_G1000:
OMNI:
HAPMAP:
R_VERSION:
G1000:
Whole Genome Sequencing (MUTECT)¶
FASTQC¶
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
BWA-MEM¶
- omics_pipe.modules.bwa.bwa_mem(sample, bwa_mem_flag)[source]¶
BWA aligner with BWA-MEM algorithm.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
GENOME:
RAW_DATA_DIR:
BWA_OPTIONS:
COMPRESSION:
MUTECT¶
- omics_pipe.modules.mutect.mutect(sample, mutect_flag)[source]¶
Runs MuTect on paired tumor/normal samples to detect somatic point mutations in cancer genomes.
- input:
- .bam
- output:
- call_stats.txt
- citation:
- Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnology (2013).doi:10.1038/nbt.2514
- link:
- http://www.broadinstitute.org/cancer/cga/mutect
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
MILLS:
G1000:
CAPTURE_KIT_BED:
All Available Modules¶
Below are all available modules in the current release of Omics Pipe in alphabetical order. When creating a custom pipeline, you can choose from these modules or create your own.
- omics_pipe.modules.annotate_peaks.annotate_peaks(step, annotate_peaks_flag)[source]¶
Runs HOMER to annotate peak track from ChIPseq data.
- input:
- .tag input file
- output:
- .txt file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
PAIR_LIST:
HOMER_RESULTS:
HOMER_VERSION:
TEMP_DIR:
HOMER_GENOME:
HOMER_ANNOTATE_OPTIONS:
- omics_pipe.modules.annotate_variants.annotate_variants(sample, annotate_variants_flag)[source]¶
Annotates variants with ANNOVAR variant annotator. Follows VarCall. input:
.vcf- output:
- .vcf
- citation:
- Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data Nucleic Acids Research, 38:e164, 2010
- link:
- http://www.openbioinformatics.org/annovar/
- parameters from parameters file:
VARIANT_RESULTS:
ANNOVARDB:
ANNOVAR_OPTIONS:
ANNOVAR_OPTIONS2:
TEMP_DIR:
ANNOVAR_VERSION:
VCFTOOLS_VERSION:
- omics_pipe.modules.bowtie.bowtie(sample, bowtie_flag)[source]¶
Runs Bowtie to align .fastq files.
- input:
- .fastq file
- output:
- sample.bam
- citation:
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25
- link:
- http://bowtie-bio.sourceforge.net/index.shtml
- parameters from parameters file:
ENDS:
TRIMMED_DATA_PATH:
BOWTIE_OPTIONS:
BOWTIE_INDEX:
BOWTIE_RESULTS:
BOWTIE_VERSION:
SAMTOOLS_VERSION:
BEDTOOLS_VERSION:
TEMP_DIR:
- omics_pipe.modules.BreastCancer_RNA_report.BreastCancer_RNA_report(sample, BreastCancer_RNA_report_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
WORKING_DIR:
R_VERSION:
REPORT_RESULTS:
PARAMS_FILE:
TABIX_VERSION:
TUMOR_TYPE:
GENELIST:
COSMIC:
CLINVAR:
PHARMGKB_rsID:
PHARMGKB_Allele:
DRUGBANK:
CADD:
- omics_pipe.modules.bwa.bwa1(sample, bwa1_flag)[source]¶
BWA aligner for read1 of paired_end reads.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
BWA_INDEX:
RAW_DATA_DIR:
GATK_READ_GROUP_INFO:
COMPRESSION:
- omics_pipe.modules.bwa.bwa2(sample, bwa2_flag)[source]¶
BWA aligner for read2 of paired_end reads.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
BWA_INDEX:
RAW_DATA_DIR:
GATK_READ_GROUP_INFO:
COMPRESSION:
- omics_pipe.modules.bwa.bwa_RNA(sample, bwa_flag)[source]¶
BWA aligner for single end reads.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
BWA_INDEX:
RAW_DATA_DIR:
GATK_READ_GROUP_INFO:
COMPRESSION:
- omics_pipe.modules.bwa.bwa_mem(sample, bwa_mem_flag)[source]¶
BWA aligner with BWA-MEM algorithm.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
GENOME:
RAW_DATA_DIR:
BWA_OPTIONS:
COMPRESSION:
- omics_pipe.modules.bwa.bwa_mem_pipe(sample, bwa_mem_pipe_flag)[source]¶
BWA aligner with BWA-MEM algorithm.
- input:
- .fastq
- output:
- .sam
- citation:
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- link:
- http://bio-bwa.sourceforge.net/bwa.shtml
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
GENOME:
RAW_DATA_DIR:
BWA_OPTIONS:
COMPRESSION:
SAMBAMBA_VERSION:
SAMBLASTER_VERSION:
SAMBAMBA_OPTIONS:
- omics_pipe.modules.call_variants.call_variants(sample, call_variants_flag)[source]¶
Calls variants from alignment .bam files using Varcall.
- input:
- Aligned.out.sort.bam or accepted_hits.bam
- output:
- .vcf file
- citation:
- Erik Aronesty (2011). ea-utils : “Command-line tools for processing biological sequencing data”;
- link:
- https://code.google.com/p/ea-utils/wiki/Varcall
- parameters from parameters file:
STAR_RESULTS:
GENOME:
VARSCAN_PATH:
VARSCAN_OPTIONS:
VARIANT_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
ANNOVAR_VERSION:
VCFTOOLS_VERSION:
VARSCAN_VERSION:
SAMTOOLS_OPTIONS:
- omics_pipe.modules.ChIP_trim.ChIP_trim(sample, ChIP_trim_flag)[source]¶
Runs Homer Tools to trim adapters from .fastq files.
- input:
- .fastq file
- output:
- .fastq file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
HOMER_TRIM_OPTIONS:
TRIMMED_DATA_PATH:
HOMER_VERSION:
- omics_pipe.modules.cuffdiff.cuffdiff(step, cuffdiff_flag)[source]¶
Runs Cuffdiff to perform differential expression. Runs after Cufflinks. Part of Tuxedo Suite.
- input:
- .bam files
- output:
- differential expression results
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFDIFF_RESULTS:
GENOME:
CUFFDIFF_OPTIONS:
CUFFMERGE_RESULTS:
CUFFDIFF_INPUT_LIST_COND1:
CUFFDIFF_INPUT_LIST_COND2:
CUFFLINKS_VERSION:
- omics_pipe.modules.cuffdiff_miRNA.cuffdiff_miRNA(step, cuffdiff_miRNA_flag)[source]¶
Runs Cuffdiff to perform differential expression. Runs after Cufflinks. Part of Tuxedo Suite.
- input:
- .bam files
- output:
- differential expression results
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFDIFF_RESULTS:
GENOME:
CUFFDIFF_OPTIONS:
CUFFMERGE_RESULTS:
CUFFDIFF_INPUT_LIST_COND1:
CUFFDIFF_INPUT_LIST_COND2:
CUFFLINKS_VERSION:
- omics_pipe.modules.cufflinks.cufflinks(sample, cufflinks_flag)[source]¶
Runs cufflinks to assemble .bam files from TopHat.
- input:
- accepted_hits.bam
- output:
- transcripts.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
TOPHAT_RESULTS:
CUFFLINKS_RESULTS:
REF_GENES:
GENOME:
CUFFLINKS_OPTIONS:
CUFFLINKS_VERSION:
- omics_pipe.modules.cufflinks_miRNA.cufflinks_miRNA(sample, cufflinks_miRNA_flag)[source]¶
Runs cufflinks to assemble .bam files from TopHat. Takes parameter MIRNA_GTF.
- input:
- accepted_hits.bam
- output:
- transcripts.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
TOPHAT_RESULTS:
CUFFLINKS_RESULTS:
miRNA_GTF:
GENOME:
CUFFLINKS_OPTIONS:
CUFFLINKS_VERSION:
- omics_pipe.modules.cufflinks_ncRNA.cufflinks_ncRNA(sample, cufflinks_ncRNA_flag)[source]¶
Runs cufflinks to assemble .bam files from TopHat. Takes parameters LNCRNA_GTF and NONCODE_FASTA.
- input:
- accepted_hits.bam
- output:
- transcripts.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
TOPHAT_RESULTS:
CUFFLINKS_RESULTS:
LNCRNA_GTF:
NONCODE_FASTA:
CUFFLINKS_OPTIONS:
CUFFLINKS_VERSION:
- omics_pipe.modules.cuffmerge.cuffmerge(step, cuffmerge_flag)[source]¶
Runs cuffmerge to merge .gtf files from Cufflinks.
- input:
- assembly_GTF_list.txt
- output:
- merged.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFMERGE_RESULTS:
REF_GENES:
GENOME:
CUFFMERGE_OPTIONS:
CUFFLINKS_VERSION:
- omics_pipe.modules.cuffmerge_miRNA.cuffmerge_miRNA(step, cuffmerge_miRNA_flag)[source]¶
Runs cuffmerge to merge .gtf files from Cufflinks.
- input:
- assembly_GTF_list.txt
- output:
- merged.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFMERGE_RESULTS:
miRNA_GTF:
GENOME:
CUFFMERGE_OPTIONS:
CUFFLINKS_VERSION:
- omics_pipe.modules.cuffmergetocompare.cuffmergetocompare(step, cuffmergetocompare_flag)[source]¶
Runs cuffcompare to annotate merged .gtf files from Cufflinks.
- input:
- assembly_GTF_list.txt
- output:
- merged.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFMERGE_RESULTS:
REF_GENES:
GENOME:
CUFFMERGETOCOMPARE_OPTIONS:
CUFFLINKS_VERSION:
- omics_pipe.modules.cuffmergetocompare_miRNA.cuffmergetocompare_miRNA(step, cuffmergetocompare_miRNA_flag)[source]¶
Runs cuffcompare to annotate merged .gtf files from Cufflinks.
- input:
- assembly_GTF_list.txt
- output:
- merged.gtf
- citation:
- Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
- link:
- http://cufflinks.cbcb.umd.edu/
- parameters from parameters file:
CUFFMERGE_RESULTS:
miRNA_GTF:
GENOME:
CUFFMERGETOCOMPARE_OPTIONS:
CUFFLINKS_VERSION:
- omics_pipe.modules.custom_R_report.custom_R_report(sample, custom_R_report_flag)[source]¶
Runs R script with knitr to produce report from omics pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
REPORT_SCRIPT:
R_VERSION:
REPORT_RESULTS:
R_MARKUP_FILE:
DPS_VERSION:
PARAMS_FILE:
- omics_pipe.modules.cutadapt_miRNA.cutadapt_miRNA(sample, cutadapt_miRNA_flag)[source]¶
Runs Cutadapt to trim adapters from reads.
- input:
- .fastq
- output:
- .fastq
- citation:
- Martin 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17: 10-12.
- link:
- https://code.google.com/p/cutadapt/
- parameters from parameters file:
RAW_DATA_DIR:
ADAPTER:
TRIMMED_DATA_PATH:
PYTHON_VERSION
- omics_pipe.modules.fastq_length_filter_miRNA.fastq_length_filter_miRNA(sample, fastq_length_filter_miRNA_flag)[source]¶
Runs custom Python script to filter miRNA reads by length.
- input:
- .fastq
- output:
- .fastq
- parameters from parameter file:
- TRIMMED_DATA_PATH:
- omics_pipe.modules.fastqc.fastqc(sample, fastqc_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
COMPRESSION:
- omics_pipe.modules.fastqc_miRNA.fastqc_miRNA(sample, fastqc_miRNA_flag)[source]¶
QC check of raw .fastq files using FASTQC.
- input:
- .fastq file
- output:
- folder and zipped folder containing html, txt and image files
- citation:
- Babraham Bioinformatics
- link:
- http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- parameters from parameters file:
RAW_DATA_DIR:
QC_PATH:
FASTQC_VERSION:
- omics_pipe.modules.filter_variants.filter_variants(sample, filter_variants_flag)[source]¶
Filters variants to remove common variants.
- input:
- .bam or .sam file
- output:
- .vcf file
- citation:
- Piskol et al. 2013. Reliable identification of genomic variants from RNA-seq data. The American Journal of Human Genetics 93: 641-651.
- link:
- http://lilab.stanford.edu/SNPiR/
- parameters from parameters file:
VARIANT_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
PICARD_VERSION:
GATK_VERSION:
BEDTOOLS_VERSION:
UCSC_TOOLS_VERSION:
GENOME:
REPEAT_MASKER:
SNPIR_ANNOTATION:
RNA_EDIT:
DBSNP:
MILLS:
G1000:
WORKING_DIR:
BWA_RESULTS:
SNPIR_VERSION:
SNPIR_CONFIG:
SNPIR_DIR:
SNPEFF_VERSION:
dbNSFP:
VCFTOOLS_VERSION:
WORKING_DIR:
SNP_FILTER_OUT_REF:
- omics_pipe.modules.find_motifs.find_motifs(step, find_motifs_flag)[source]¶
Runs HOMER to find motifs from ChIPseq data.
- input:
- .txt peak file from Homer
- output:
- .txt file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
PAIR_LIST:
HOMER_RESULTS:
HOMER_VERSION:
TEMP_DIR:
HOMER_GENOME:
HOMER_MOTIFS_OPTIONS:
- omics_pipe.modules.fusion_catcher.fusion_catcher(sample, fusion_catcher_flag)[source]¶
Detects fusion genes in paired-end RNAseq data.
- input:
- paired end .fastq files
- output:
- list of candidate fusion genes
- citation:
- Kangaspeska, S. Hultsch, H. Edgren, D. Nicorici, A. Murumgi, O.P. Kallioniemi, Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms, PLOS One, Oct. 2012. http://dx.plos.org/10.1371/journal.pone.0048745
- link:
- https://code.google.com/p/fusioncatcher
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
FUSION_RESULTS:
FUSIONCATCHERBUILD_DIR:
TEMP_DIR:
SAMTOOLS_VERSION:
FUSIONCATCHER_VERSION:
FUSIONCATCHER_OPTIONS:
TISSUE:
PYTHON_VERSION:
- omics_pipe.modules.GATK_preprocessing_WES.GATK_preprocessing_WES(sample, GATK_preprocessing_WES_flag)[source]¶
GATK preprocessing steps for whole exome sequencing.
- input:
- sorted.rg.md.bam
- output:
- .ready.bam
- citation:
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
- link:
- http://www.broadinstitute.org/gatk/
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
MILLS:
G1000:
CAPTURE_KIT_BED:
SAMTOOLS_VERSION:
- omics_pipe.modules.GATK_preprocessing_WGS.GATK_preprocessing_WGS(sample, GATK_preprocessing_WGS_flag)[source]¶
GATK preprocessing steps for whole genome sequencing.
- input:
- sorted.rg.md.bam
- output:
- .ready.bam
- citation:
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
- link:
- http://www.broadinstitute.org/gatk/
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
MILLS:
G1000:
SAMTOOLS_VERSION:
- omics_pipe.modules.GATK_variant_discovery.GATK_variant_discovery(sample, GATK_variant_discovery_flag)[source]¶
GATK_variant_discovery.
- input:
- sorted.rg.md.bam
- output:
- .ready.bam
- citation:
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
- link: GATK_variant_discovery
- http://www.broadinstitute.org/gatk/
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
VARIANT_RESULTS:
- omics_pipe.modules.GATK_variant_filtering.GATK_variant_filtering(sample, GATK_variant_filtering_flag)[source]¶
GATK_variant_filtering.
- input:
- sorted.rg.md.bam
- output:
- .ready.bam
- citation:
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
- link: GATK_variant_filtering
- http://www.broadinstitute.org/gatk/
- parameters from parameters file:
VARIANT_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
MILLS:
OMNI:
HAPMAP:
R_VERSION:
G1000_SNPs:
G1000_Indels:
- omics_pipe.modules.GATK_variant_filtering.GATK_variant_filtering_group(sample, GATK_variant_filtering_group_flag)[source]¶
GATK_variant_filtering.
- input:
- sorted.rg.md.bam
- output:
- .ready.bam
- citation:
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297-303.
- link: GATK_variant_filtering
- http://www.broadinstitute.org/gatk/
parameters from parameters file:
VARIANT_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
MILLS_G1000:
OMNI:
HAPMAP:
R_VERSION:
G1000:
- omics_pipe.modules.homer_peaks.homer_peaks(step, homer_peaks_flag)[source]¶
Runs HOMER to call peaks from ChIPseq data.
- input:
- .tag input file
- output:
- .txt file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
PAIR_LIST:
HOMER_RESULTS:
HOMER_PEAKS_OPTIONS:
HOMER_VERSION:
TEMP_DIR:
- omics_pipe.modules.htseq.htseq(sample, htseq_flag)[source]¶
Runs htseq-count to get raw count data from alignments.
- input:
- Aligned.out.sort.bam
- output:
- counts.txt
- citation:
- Simon Anders, EMBL
- link:
- http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
- parameters from parameters file:
STAR_RESULTS:
HTSEQ_OPTIONS:
REF_GENES:
HTSEQ_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BAM_FILE_NAME:
PYTHON_VERSION:
- omics_pipe.modules.htseq_gencode.htseq_gencode(sample, htseq_flag)[source]¶
Runs htseq-count to get raw count data from alignments.
- input:
- Aligned.out.sort.bam
- output:
- counts.txt
- citation:
- Simon Anders, EMBL
- link:
- http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
- parameters from parameters file:
STAR_RESULTS:
HTSEQ_OPTIONS:
REF_GENES_GENCODE:
HTSEQ_GENCODE_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BAM_FILE_NAME:
- omics_pipe.modules.htseq_miRNA.htseq_miRNA(sample, htseq_miRNA_flag)[source]¶
Runs htseq-count to get raw count data from alignments.
- input:
- Aligned.out.sort.bam
- output:
- counts.txt
- citation:
- Simon Anders, EMBL
- link:
- http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
- parameters from parameters file:
TOPHAT_RESULTS:
HTSEQ_OPTIONS:
miRNA_GFF:
HTSEQ_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BAM_FILE_NAME:
- omics_pipe.modules.intogen.intogen(sample, intogen_flag)[source]¶
Runs Intogen to rank mutations and implication for cancer phenotype. Follows variant calling.
- input:
- .vcf
- output:
- variant list
- citation:
- Gonzalez-Perez et al. 2013. Intogen mutations identifies cancer drivers across tumor types. Nature Methods 10, 1081-1082.
- link:
- http://www.intogen.org/
- parameters from parameter file:
VCF_FILE:
INTOGEN_OPTIONS:
INTOGEN_RESULTS:
INTOGEN_VERSION:
USERNAME:
WORKING_DIR:
TEMP_DIR:
SCHEDULER:
VARIANT_RESULTS:
- omics_pipe.modules.macs.macs(step, macs_flag)[source]¶
Runs MACS to call peaks from ChIPseq data. input:
.fastq file- output:
- peaks and .bed file
- citation:
- Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137
- link:
- http://liulab.dfci.harvard.edu/MACS/
- parameters from parameters file:
PAIR_LIST:
BOWTIE_RESULTS:
CHROM_SIZES:
MACS_RESULTS:
MACS_VERSION:
TEMP_DIR:
BEDTOOLS_VERSION:
PYTHON_VERSION:
- omics_pipe.modules.mutect.mutect(sample, mutect_flag)[source]¶
Runs MuTect on paired tumor/normal samples to detect somatic point mutations in cancer genomes.
- input:
- .bam
- output:
- call_stats.txt
- citation:
- Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnology (2013).doi:10.1038/nbt.2514
- link:
- http://www.broadinstitute.org/cancer/cga/mutect
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
GATK_VERSION:
GENOME:
DBSNP:
MILLS:
G1000:
CAPTURE_KIT_BED:
- omics_pipe.modules.peak_track.peak_track(step, peak_track_flag)[source]¶
Runs HOMER to create peak track from ChIPseq data.
- input:
- .tag input file
- output:
- .txt file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
PAIR_LIST:
HOMER_RESULTS:
HOMER_VERSION:
TEMP_DIR:
- omics_pipe.modules.picard_mark_duplicates.picard_mark_duplicates(sample, picard_mark_duplicates_flag)[source]¶
Picard tools Mark Duplicates.
- input:
- sorted.bam
- output:
- _sorted.rg.md.bam
- citation:
- http://picard.sourceforge.net/
- link:
- http://picard.sourceforge.net/
- parameters from parameters file:
BWA_RESULTS:
TEMP_DIR:
PICARD_VERSION:
SAMTOOLS_VERSION:
- omics_pipe.modules.read_density.read_density(sample, read_density_flag)[source]¶
Runs HOMER to visualize read density from ChIPseq data.
- input:
- .bam file
- output:
- .txt file
- citation:
- Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432
- link:
- http://homer.salk.edu/homer/
- parameters from parameters file:
BOWTIE_RESULTS:
CHROM_SIZES:
HOMER_RESULTS:
HOMER_VERSION:
TEMP_DIR:
- omics_pipe.modules.RNAseq_QC.RNAseq_QC(sample, RNAseq_QC_flag)[source]¶
Runs rseqc to determine insert size as QC for alignment.
- input:
- .bam
- output:
- pdf plot
- link:
- http://rseqc.sourceforge.net/
- parameters from parameters file:
STAR_RESULTS:
QC_PATH:
BAM_FILE_NAME:
RSEQC_REF:
TEMP_DIR:
PICARD_VERSION:
R_VERSION:
- omics_pipe.modules.RNAseq_report.RNAseq_report(sample, RNAseq_report_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
REPORT_SCRIPT:
R_VERSION:
REPORT_RESULTS:
R_MARKUP_FILE:
DPS_VERSION:
PARAMS_FILE:
- omics_pipe.modules.RNAseq_report_counts.RNAseq_report_counts(sample, RNAseq_report_counts_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
WORKING_DIR:
R_VERSION:
REPORT_RESULTS:
PARAMS_FILE:
- omics_pipe.modules.RNAseq_report_tuxedo.RNAseq_report_tuxedo(sample, RNAseq_report_tuxedo_flag)[source]¶
Runs R script with knitr to produce report from RNAseq pipeline.
- input:
- results from other steps in RNAseq pipelines
- output:
- html report
- citation:
- Meissner
- parameters from parameter file:
WORKING_DIR:
R_VERSION:
REPORT_RESULTS:
DPS_VERSION:
PARAMS_FILE:
- omics_pipe.modules.rseqc.rseqc(sample, rseqc_flag)[source]¶
Runs rseqc to determine insert size as QC for alignment.
- input:
- .bam
- output:
- pdf plot
- link:
- http://rseqc.sourceforge.net/
- parameters from parameters file:
STAR_RESULTS:
QC_PATH:
BAM_FILE_NAME:
RSEQC_REF:
RSEQC_VERSION:
TEMP_DIR:
- omics_pipe.modules.snpir_variants.snpir_variants(sample, snpir_variants_flag)[source]¶
Calls variants using SNPIR pipeline.
- input:
- Aligned.out.sort.bam or accepted_hits.bam
- output:
- final_variants.vcf file
- citation:
- Piskol, R., et al. (2013). “Reliable Identification of Genomic Variants from RNA-Seq Data.” The American Journal of Human Genetics 93(4): 641-651.
- link:
- http://lilab.stanford.edu/SNPiR/
- parameters from parameters file:
VARIANT_RESULTS:
TEMP_DIR:
SAMTOOLS_VERSION:
BWA_VERSION:
PICARD_VERSION:
GATK_VERSION:
BEDTOOLS_VERSION:
UCSC_TOOLS_VERSION:
GENOME:
REPEAT_MASKER:
SNPIR_ANNOTATION:
RNA_EDIT:
DBSNP:
MILLS:
G1000:
WORKING_DIR:
BWA_RESULTS:
SNPIR_VERSION:
SNPIR_CONFIG:
SNPIR_DIR:
ENCODING:
- omics_pipe.modules.star.star(sample, star_flag)[source]¶
Runs STAR to align .fastq files.
- input:
- .fastq file
- output:
- Aligned.out.bam
- citation:
- Dobin et al, Bioinformatics 2012; doi: 10.1093/bioinformatics/bts635 “STAR: ultrafast universal RNA-seq aligner”
- link:
- https://code.google.com/p/rna-star/
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
STAR_INDEX:
STAR_OPTIONS:
STAR_RESULTS:
SAMTOOLS_VERSION:
STAR_VERSION:
COMPRESSION:
REF_GENES:
- omics_pipe.modules.star_piRNA.star_piRNA(sample, star_flag)[source]¶
Runs STAR to align .fastq files.
- input:
- .fastq file
- output:
- Aligned.out.bam
- citation:
- Dobin et al, Bioinformatics 2012; doi: 10.1093/bioinformatics/bts635 “STAR: ultrafast universal RNA-seq aligner”
- link:
- https://code.google.com/p/rna-star/
- parameters from parameters file:
ENDS:
RAW_DATA_DIR:
STAR_INDEX:
STAR_OPTIONS:
STAR_RESULTS:
SAMTOOLS_VERSION:
STAR_VERSION:
- omics_pipe.modules.TCGA_download.TCGA_download(sample, TCGA_download_flag)[source]¶
Downloads and unzips TCGA data from Manifest.xml downloaded from CGHub. input:
TGCA XML file- output:
- downloaded files from TCGA
- citation:
- The Cancer Genome Atlas
- link:
- https://cghub.ucsc.edu/software/downloads.html
- parameters from parameters file:
TCGA_XML_FILE:
TCGA_KEY:
TCGA_OUTPUT_PATH:
CGATOOLS_VERSION:
- omics_pipe.modules.tophat.tophat(sample, tophat_flag)[source]¶
Runs TopHat to align .fastq files.
- input:
- .fastq file
- output:
- accepted_hits.bam
- citation:
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120
- link:
- http://tophat.cbcb.umd.edu/
- parameters from parameters file:
RAW_DATA_DIR:
REF_GENES:
TOPHAT_RESULTS:
BOWTIE_INDEX:
TOPHAT_VERSION:
TOPHAT_OPTIONS:
BOWTIE_VERSION:
SAMTOOLS_VERSION:
- omics_pipe.modules.tophat_miRNA.tophat_miRNA(sample, tophat_miRNA_flag)[source]¶
Runs TopHat to align .fastq files.
- input:
- .fastq file
- output:
- accepted_hits.bam
- citation:
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120
- link:
- http://tophat.cbcb.umd.edu/
- parameters from parameters file:
RAW_DATA_DIR:
miRNA_GTF:
TOPHAT_RESULTS:
miRNA_BOWTIE_INDEX:
TOPHAT_VERSION:
TOPHAT_OPTIONS:
BOWTIE_VERSION:
SAMTOOLS_VERSION:
- omics_pipe.modules.tophat_ncRNA.tophat_ncRNA(sample, tophat_ncRNA_flag)[source]¶
Runs TopHat to align .fastq files.
- input:
- .fastq file
- output:
- accepted_hits.bam
- citation:
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120
- link:
- http://tophat.cbcb.umd.edu/
- parameters from parameters file:
RAW_DATA_DIR:
REF_GENES:
TOPHAT_RESULTS:
NONCODE_BOWTIE_INDEX:
TOPHAT_VERSION:
TOPHAT_OPTIONS:
BOWTIE_VERSION:
SAMTOOLS_VERSION:
Version History¶
1.1.2b (2014/08/05)¶
New Features¶
- Added support for latest GATK version
- Added GATK Group Variant Calling pipeline
- Added noncoding RNA HTseq module
Bug Fixes¶
- AMI memory handling
- Fixed Sumatra parameters file handling
- RNAseq count based pipeline produces single report for all samples
1.1.0 (2014/07/09)¶
First public release!
Copyright & License¶
Omics Pipe¶
MIT License (MIT)
Copyright (c) 2013 Kathleen Marie Fisch
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.