注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Bioinformatics home

 
 
 

日志

 
 

用gatk找SNP  

2012-05-04 05:04:21|  分类: 生物信息编程 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

The Unified Genotyper

The GATK Unified Genotyper is a multiple-sample, technology-aware SNP and indel caller. It uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples, emitting an accurate posterior probability of there being a segregating variant allele at each locus as well as for the genotype of each sample. The system can either emit just the variant sites or complete genotypes (which includes homozygous reference calls) satisfying some phred-scaled confidence value. The genotyper can make accurate calls on both single sample data and multi-sample data.

GATK Unified Genotype genotype likelihoods
The Broad Unified Genotyper SNP caller multiple-sample allele frequency and genotype estimates

Recent Changes

  • One can now emit SNP and indel calls simultaneously. Use -glm SNP (default) or -glm INDEL to only output a single class of variants or -glm BOTH to do both.
  • We no longer use one of the intrinsic GATK filters (previously used to determine whether to include bases in a pileup: the -mm40 (number of mismatches in a 40 base window in the read around the current locus) has been removed.
  • When enabled, the BAQ calculation is now only used when calculating genotype likelihoods. Annotation values such as strand bias and haplotype score are now using the unBAQed quality scores and are now much more empowered as error covariates.
  • The site-wide depth (DP) annotation has been disabled by default. Sample-specific depth is still emitted.
  • Some of the input arguments have changes recently. Specifically, we now use the --genotyping_mode and --output_mode arguments instead of the now deprecated -genotype and -all_bases arguments. Furthermore, we no longer support the use of "trigger tracks."


Inputs

Genotype Calculation Model

The -pnrm,--p_nonref_model option controls the model used to calculate the probability that a site is variant plus the various sample genotypes in the data at a given locus.

  • EXACT is the default model with the best performance in all cases. For posterity we have kept around the older GRID_SEARCH model, but this gives inferior results and shouldn't be used.

Heterozygosity

You can use the --heterozygosity argument to change the expected heterozygosity value used to compute prior likelihoods for any locus. The default priors are:

    • het = 1e-3
    • P(hom-ref genotype) = 1 - 3 * het / 2
    • P(het genotype) = het
    • P(hom-var genotype) = het / 2

Minimum Confidence Threshold

  • -stand_call_conf,--standard_min_confidence_threshold_for_calling is the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. Only genotypes with confidence >= this threshold are emitted as called sites. A reasonable threshold is 30 for high-pass calling (this is the default). Note that the confidence (QUAL) values for multi-sample low-pass (e.g. 4x per sample) calling might be significantly smaller with the new EXACT model than with our older GRID_SEARCH model, as the latter tended to over-estimate the confidence; for low-pass calling we tend to use much smaller thresholds (e.g. 4).
  • -stand_emit_conf,--standard_min_confidence_threshold_for_emitting is the minimum phred-scaled Qscore threshold to emit low confidence calls. Genotypes with confidence >= this but less than the calling threshold are emitted but marked as filtered. The default value is 30.

Arguments Specifying Outputs

  • -o specifies the output file, e.g. -o my.calls.vcf
  • -sites_only,--sites_only writes just the sites (without genotypes) to the output VCF (i.e. just the first 8 columns).
  • -out_mode,--output_mode specifies which sites to emit; possible values are EMIT_VARIANTS_ONLY (the default), EMIT_ALL_CONFIDENT_SITES (include confident reference sites), or EMIT_ALL_SITES (any callable site regardless of confidence).
  • -gt_mode,--genotyping_mode specifies how to determine the alternate allele to use for genotyping; possible values are DISCOVERY (the default; the Unified Genotyper will choose the most likely alternate allele) or GENOTYPE_GIVEN_ALLELES (only the alleles passed in from a VCF rod bound to the name "alleles" will be used for genotyping).
  • -glm, --genotype_likelihoods_model specifies which class of variants to call. By default only SNPs are called, but one can state that only indels (-glm INDEL) or both (-glm BOTH) should be called.
  • -deletions,--max_deletion_fraction specifies the maximum fraction of reads with deletions spanning a locus tolerated by the genotyper (the genotyper won't make calls at loci with too many deletions). Default is 0.05.
  • -debug_file,--debug_file specifies the output file used to print debugging information.
  • -metrics,--metrics_file specifies the output file used to print various metrics about callability.

Arguments Describing Good Bases

  • -mbq,--min_base_quality_score specifies the minimum base quality required to consider a base for calling. Default is 17.
  • -mmq,--min_mapping_quality_score specifies the minimum read mapping quality required to consider a read for calling. Only used in indel calling. Default is 20.

Arguments for Annotations

  • -nsl,--noSLOD instructs the Genotyper not to calculate the SLOD.
  • -A,--annotations tells the Genotyper to apply one or more annotations to variant calls (e.g. '-A DepthOfCoverage -A AlleleBalance' would include those 2 annotations); see theVariantAnnotator for more details.
  • -G,--groups instructs the Genotyper to apply one or more annotation interfaces/groups to variant calls (e.g. '-G Standard' would apply all annotations which implement the StandardAnnotation interface); see the VariantAnnotator for more details.

Miscellaneous

  • -dcov, --downsample_to_coverage specifies the maximum coverage at a locus for any given sample. Downsampled reads are randomly selected from all possible reads at a locus. The default is 250 reads. This replaces the --max_reads_at_locus (-mrl) argument as of v1.0.4168 (2010/08/31).


The Allele Frequency Calculation

The allele frequency calculation model is now the EXACT model (instead of the heuristic GRID_SEARCH model used previously). The exact model computes a mathematically precise estimation of the allele frequency at a site given the read data. The mathematical derivation is similar to the one used by Samtools' mpileup tool. More info on Samtools available here: [1]Note that the confidence (QUAL) values for multi-sample low-pass calling might be significantly smaller with the EXACT model than with GRID_SEARCH as the latter tended to over-estimate the confidence. A presentation detailing the performance gains obtained by using this model is available here: [2]


Explanation of the VCF Output

See Understanding the Unified Genotyper's VCF files.


The Unified Genotyper didn't call my SNP! What happened?

It is important to point out that just because something looks like a SNP in IGV, it doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs), so before you post this issue to our support forum you will first need to do a little investigation on your own. To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here. Here are some pointers to get you started in the right direction:

  • What do the base qualities look like for the non-reference bases? Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base.
  • What do the mapping qualities look like for the reads with the non-reference bases? A base's quality is capped by the mapping quality of its read as low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome.
  • Are you working with SOLiD data? SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site?
  • Remember that the depth reported in the VCF is the unfiltered depth - the Unified Genotyper ignores bases if they don't look good. Also, keep in mind that currently the Unified Genotyper will only call a single alternate allele (for SNPs) at a position (based on the most likely one).


How do I make tumor vs. normal calls with the Unified Genotyper?

See Cancer-specific Variant Discovery Tools.


Example commands

Generic command for calling single or multiple samples, whole genome or whole exome

java -jar GenomeAnalysisTK.jar \ -R resources/Homo_sapiens_assembly18.fasta \ -T UnifiedGenotyper \ -I sample1.bam [-I sample2.bam ...] \ -B:dbsnp,VCF dbSNP.vcf \ -o snps.raw.vcf \ -stand_call_conf [50.0] \ -stand_emit_conf 10.0 \ -dcov [50] \ [-L targets.interval_list]

The above command will call all of the samples in your provided BAM files [-I arguments] together and produce a VCF file with sites and genotypes for all samples. The easiest way to get the dbSNP file is from the GATK resource bundle. Several arguments have parameters that should be chosen based on the average coverage per sample in your data:

-dcov 
downsample term per sample. The UG sees only this amount of coverage per sample. It's used to avoid overwhelming the engine at sites where misalignments lead to 10-10,000,000 fold increases in coverage per sample. A safe value here is 10x the average coverage, so for a 100x data set, use 1000x, but for a 4x data set 50x is fine. The engine defaults to around 1000x.
-stand_call_conf and -stand_emit_conf 
we recommend keeping stand_emit_conf at 10, so that you will always see variant sites with at least Q10 confidence that they are non-reference, but they will be filter out as LowQual unless they exceed the stand_call_conf. The stand_call_conf term needs to be chosen based again on the expected coverage per sample. If you have deep data (10x or better) a stand_call_conf threshold of 50 is fine. If you have very low-pass data, such as the 1000 Genomes low-coverage wing with 4x average and you want to call everything possibly variant, resulting in lots of FPs to be sorted out later, you can set this to a lower value of Q30 or even Q10.
-L targets.interval_list 
if you have an target capture data set, it's best to call only the targeted intervals (for a variety of reasons, beyond the scope of this document). You should provide the target list in one of the interval file formats.

Indel Calling with the Unified Genotyper

While the indel calling capabilities of the Unified Genotyper are still under active development, they are now in a stable state and are supported for use by external users. Please note that, as with SNPs, the Unified Genotyper is fairly aggressive in making a call and, consequently, the false positive rate will be high in the raw call set. We expect users to properly filter these results as per our best practices (which will be changing continually).

Note that it is critical for the correct operation of the indel calling that the BAM file to be called is previously indel-realigned (see the IndelRealigner section on details). We strongly recommend doing joint Smith-Waterman alignment and not only per-lane or per-sample alignment at known sites. This is important because the caller is only empowered to genotype indels which are already present in reads.

Indel-specific inputs to the Unified Genotyper

The following arguments affect operation of the Indel calling ability of the Unified Genotyper:

-indelHeterozygosity, --indel_heterozygosity (optional, default value = 1/8000)

This argument informs the prior probability of having an indel at a site.

-minIndelCnt, --min_indel_count_for_genotyping (optional, default value = 5)

A candidate indel is genotyped (and potentially called) if there are this number of reads with a consensus indel at a site. Decreasing this value will increase sensitivity but at the cost of larger calling time and a larger number of false positives.

Other Examples

1000 genomes pilot 2

Example command using a data file from the 1000 genomes archive with the empirical calling table for the first 10K of chr1 starting at 11Mb.

java -jar /path/to/GenomeAnalysisTK.jar \ -l INFO \ -L 1:11,000,000-11,010,000 \ -R /broad/1KG/reference/human_b36_both.fasta \ -T UnifiedGenotyper \ -I /broad/1KG/DCC/ftp/pilot_data/NA12878/alignment/NA12878.chrom1.SLX.SRP000032.2009_07.bam \ -o my.vcf \ -stand_call_conf 30.0 \ -stand_emit_conf 10.0

Generating calls at all sites

The following example command generates calls for the low coverage 1000 genomes pilot 1 bam file using the empirical confusion matrix at all sites.

java -jar /path/to/GenomeAnalysisTK.jar \ -l INFO \ -R /broad/1KG/reference/human_b36_both.fasta \ -T UnifiedGenotyper \ -I /broad/1KG/DCC/ftp/pilot_data/data/NA12878/alignment/NA12878.SLX.maq.SRP000031.2009_08.bam \ -o my.vcf \ --output_mode EMIT_ALL_SITES


It's common to want to operate only over a part of the genome and to output SNP calls to standard output, rather than a file. The -L option lets you specify the region to process. If you set -o to /dev/stdout (or leave it out completely), output will be sent to the standard output of the console. Turn off logging completely (-l OFF) so that the GATK operates in silent mode.

Caveats

  • The system is under active and continuous development. All outputs, the underlying likelihood model, arguments, and file formats are liable to change.
  • The system can be very aggressive in calling variants. In the 1000 genomes project for pilot 2 (deep coverage of ~35x) we expect the raw Qscore > 50 variants to contain at least ~10% FP calls. We use extensive post-calling filters to eliminate most of these FPs. Variant Quality Score Recalibration is a tool to perform this filtering.
  • We only handle diploid genotypes

Callable bases

At the end of a GATK UG run, you should see if you have -l INFO enabled a report that looks like:

INFO 00:23:29,795 UnifiedGenotyper - Visited bases 247249719 INFO 00:23:29,796 UnifiedGenotyper - Callable bases 219998386 INFO 00:23:29,796 UnifiedGenotyper - Confidently called bases 219936125 INFO 00:23:29,796 UnifiedGenotyper - % callable bases of all loci 88.978 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - Actual calls made 303126

Here are the meaning of the lines:

  • The visited bases is the total number of reference bases that were visited.
  • Callable bases are visited bases minus reference Ns and places with no coverage, which we never try to call.
  • Confidently called bases are callable bases that exceed the emit confidence threshold, either for being non-reference or reference. That is, if T is the min confidence, this is the count of bases where QUAL > T for the site being reference in all samples and or QUAL > T for the site being non-reference in at least one sample

Note a subtle implication of the last statement, with all samples vs. any sample: calling multiple samples tends to reduce the % of confidently callable bases, as in order to be confidently reference one has to be able to establish that all samples are reference, which is hard because of the stochastic coverage drops in each sample.

Note that confidently called bases will rise with additional data per sample, so if you don't dedup your reads, include lots of poorly mapped reads, the numbers will increase. Of course, just because you confidently call the site doesn't mean that the data processing resulted in high-quality output, just that we had sufficient statistical evident based on your input data to called ref / non-ref.

Calling sex chromosomes

The GATK can be used to call the sex (X and Y) chromosomes, without explicit knowledge of the gender of the samples. In an ideal world, with perfect upfront data processing, we would get perfect genotypes on the sex chromosomes without knowledge of who is diploid on X and has no Y, and who is hemizygous on both. However, misalignment and mismapping contributes especially to these chromosomes, as their reference sequence is clearly of lower quality than the autosomal regions of the genome. Nevertheless, it is possible to get reasonably good SNP calls, even with simple data processing and basic filtering. Results with proper, full data processing as per the best practices in the GATK should lead to very good calls. You can view a presentation "The GATK Unified Genotyper on chrX and chrY" in GSA Public Drop Box

  评论这张
 
阅读(3737)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017