How to Use
GRAPE requires a single multi-sample VCF file as input and has a separate workflow for downloading and checking the integrity of reference datasets.
The pipeline has three main steps:
Downloading of the Reference Datasets.
Quality Control and Data Preprocessing.
Relatedness Inference Workflow.
The pipeline is implemented with the Snakemake framework and can be accessible through the snakemake command.
All the pipeline functionality is embedded into the launcher (launcher.py), which invokes Snakemake workflows under the hood. By default, all the commands just check the accessibility of the data and build the computational graph.
To actually perform computation one should add –real-run flag to all the commands.
Read the sections below to run GRAPE on your data
Quality Control and Data Preprocessing
GRAPE have a versatile and configurable preprocessing workflow. One part of the preprocessing is required and must be performed before the relationship inference workflow.
Preprocessing is launched by the preprocess command of the GRAPE.
Preprocessing steps
Along with some necessary technical procedures, preprocessing includes the following steps:
- [Required] SNPs quality control by minor allele frequency (MAF) and the missingness rate.
We discovered that blocks of rare SNPs with low MAF value in genotype arrays may produce false positive IBD segments. To address this problem, we filter SNPs by minor allele frequency. We remove SNPs with MAF value less than 0.02. Additionally, we remove multiallelic SNPs, insertions / deletions, and SNPs with the high missingness rate, because such SNPs are inconsistent with IBD detection tools.
- [Required] Per-sample quality control, using missingness and heterozygosity.
Extensive testing revealed that samples with an unusually low level of heterozygosity could produce many false relatives matches among individuals. GRAPE excludes such samples from the analysis and creates a report file with the description of the exclusion reason.
- [Required] Control for strands and SNP IDs mismatches.
During this step GRAPE fixes inconsistencies in strands and reference alleles.
- [Optional] LiftOver from hg38 to hg37.
Currently GRAPE uses hg37 build version of the human genome reference. The pipeline supports input in hg38 and hg37 builds (see –assembly flag of the pipeline launcher). If hg38 build is selected (–assembly h38), then GRAPE applies liftOver tool to the input data in order to match the hg37 reference assembly.
- [Optional] Phasing and imputation.
GRAPE supports phasing and genotype imputation using 1000 Genomes Project reference panel. GERMLINE IBD detection tool requires phased data. So, if input data is unphased, one should include phasing (Eagle 2.4.1) (–phase flag) into the preprocessing before running the GERMLINE workflow. If input data is highly _heterogeneous_ in a sense of available SNPs positions, one can also add imputation (Minimac4) procedure to the preprocessing (–impute flag).
- [Optional] Removal of imputed SNPs.
We found, that if input data is _homogeneous_ in a sense of SNPs positions, the presence of imputed SNPs does not affect the overall IBD detection accuracy of the IBIS tool, but it significantly slows down the overall performance. For this particular case, when input data initially contains a lot of imputed SNPs, we recommend to remove them by specifying –remove-imputation flag to the GRAPE launcher. GRAPE removes all SNPs which are marked with IMPUTED flag in the input VCF file.
Workflows Usage
- Preprocessing for the IBIS + KING relatedness inference workflow.
Input file is located at /media/input.vcf.gz.
GRAPE working directory is /media/data.
Assembly of the input VCF file is hg37.
docker run --rm -it -v /media:/media -v /etc/localtime:/etc/localtime:ro \
grape:latest launcher.py preprocess --ref-directory /media/ref \
--vcf-file /media/input.vcf.gz --directory /media/data --assembly hg37 --real-run
- Preprocessing for the GERMLINE + KING workflow.
Assembly of the input VCF file is hg38.
GERMLINE can work with phased data only, so we add phasing procedure to the preprocessing.
Genotype imputation is also added.
docker run --rm -it -v /media:/media -v /etc/localtime:/etc/localtime:ro \
grape:latest launcher.py preprocess --ref-directory /media/ref \
--vcf-file /media/input.vcf.gz --directory /media/data \
--assembly hg38 --phase --impute --real-run
IBD Segments Weighting
Distribution of IBD segments among non-related (background) individuals within a population may be quite heterogeneous. There may exist genome regions with extremely high rates of overall matching, which are not inherited from the recent common ancestors. Instead, these regions more likely reflect other demographic factors of the population.
The implication is that IBD segments detected in such regions are expected to be less useful for estimating recent relationships. Moreover, such regions potentially prone to false-positive IBD segments.
GRAPE provides two options to address this issue:
Genome Regions Exclusion Mask
Some regions are completely excluded from the consideration. This approach is implemented in ERSA and is used by GRAPE by default.
IBD Segments Weighing
The key idea is to down-weight IBD segment, i.e. reduce the IBD segment length, if the segment cross regions with high rate of matching. Down-weighted segments are then passed to the ERSA algorithm.
GRAPE provides an ability to compute the weight mask from the VCF file with presumably unrelated individuals.
This mask is used during the relatedness detection by specifying the –weight-mask flag of the launcher.
See more information in the GRAPE paper.
Computation of the IBD segments weighing mask
docker run --rm -it -v /media:/media \
grape:latest launcher.py compute-weight-mask \
--directory /media/background --assembly hg37 \
--real-run --ibis-seg-len 5 --ibis-min-snp 400
The resulting files consist of a weight mask file in JSON format and a visualization of the mask stored in /media/background/weight-mask/ folder.
Usage of the IBD segments weighing mask
docker run --rm -it -v /media:/media \
grape:latest launcher.py find --flow ibis --ref-directory /media/ref \
--weight-mask /media/background/weight-mask/mask.json \
--directory /media/data --assembly hg37 \
--real-run --ibis-seg-len 5 --ibis-min-snp 400
Execution by Scheduler
The pipeline can be executed using lightweight scheduler Funnel, which implements Task Execution Schema developed by GA4GH.
During execution, incoming data for analysis can be obtained in several ways: locally, FTP, HTTPS, S3, Google, etc. The resulting files can be uploaded in the same ways.
It is possible to add other features such as writing to the database, and sending to the REST service.
The scheduler itself can work in various environments from a regular VM to a Kubernetes cluster with resource quotas support.
For more information see the Funnel Documentation.
How to use Funnel:
# Start the Funnel server
/path/to/funnel server run
# Use Funnel as client
/path/to/funnel task create funnel/grape-real-run.json