首页手记 GATK - Read groups

GATK - Read groups

标签：

Python

使用 ValidateSamFile确定bam文件是否有问题？

Picard ValidateSamFile

Validates a SAM or BAM file.

ValidateSamFile下有两种mode:
VERBOSE : 检查到100个错误之后退出，并且输出错误到终端；
SUMMARYL: 输出结果是一个表格，展示errors和warnings的数目；

$ java -jar picard.jar ValidateSamFile \
   I=input.bam \
   MODE=SUMMARY   
Error Type  Count
ERROR:MISSING_READ_GROUP    1WARNING:RECORD_MISSING_READ_GROUP   1287903

问题在bam文件 MISSING_READ_GROUP；请自动屏蔽WARNING结果；

获取GATK格式bam文件

方法法1

bwa 比对使用参数-R

-R STR: Complete read group header line. ’\t’ can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’. [null]

bwa -R 参数的使用

# 不加参数Rbwa mem ref.fa read1.fq read2.fq > aln-pe.sam# 加参数Rbwa mem -R '@RG\tID:group_n\tLB:library_n\tPL:illumina\tPU:unit1\tSM:sample_n' ref.fa read1.fq read2.fq > aln-pe.R.sam'@RG\tID:group_n\tLB:library_n\tPL:illumina\tPU:unit1\tSM:sample_n'# ID:每一个Read group独有的ID；# LB:DNA preparation library identifier# PL:测序使用的平台： ILLUMINA, SOLID, LS454, HELICOS and PACBIO;# PU:Platform Unit,由三部分组成： {FLOWCELL_BARCODE}.{LANE};# SM:样品名read group参数详细信息见本文后半部分

注：GATK 使用时，PU不是必须要求的；但是PU与ID同时存在时，PU优先级高于ID。# 查看文件aln-pe.sam与aln-pe.R.sam中的groupsamtools view -H aln-pe.sam  | grep '@RG'# 不会有输出samtools view -H aln-pe.R.sam  | grep '@RG'@RG ID:group_n  LB:library_n    PL:illumina PU:unit1    SM:sample_n# 多个比对的时候，使用循环ls *.gRNA.fastq | while read id ;dobwa mem -R "@RG\tID:group_${id}\tLB:library_${id}\tPL:illumina\tSM:sample_${id}" ref.fa $id > ${id}.R.samdone

方法2
Picard AddOrReplaceReadGroups

Replace read groups in a BAM file.This tool enables the user to replace all read groups in the INPUT file with a single new read group and assign all reads to this read group in the OUTPUT BAM file.

$ java -jar picard.jar AddOrReplaceReadGroups \
       I=input.bam \
       O=output.bam \
       RGID=4 \
       RGLB=library1 \
       RGPL=illumina \
       RGPU=unit1 \
       RGSM=sample1
Option  Description
INPUT (String)  Input file (BAM or SAM or a GA4GH url). Required.
OUTPUT (File)   Output file (BAM or SAM). Required.
SORT_ORDER (SortOrder)  Optional sort order to output in. If not supplied OUTPUT is in the same order as INPUT. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
RGID (String)   Read Group ID Default value: 1. This option can be set to 'null' to clear the default value.
RGLB (String)   Read Group library Required.
RGPL (String)   Read Group platform (e.g. illumina, solid) Required.
RGPU (String)   Read Group platform unit (eg. run barcode) Required.
RGSM (String)   Read Group sample name Required.

Read groups详细介绍

英文原版查看

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

测序时：
样本建一个库在同一条lane上完成测序产生reads sets可定义为一个Read group;
样本建成分开独立的库测序得到的reads sets也就分属于不同的Read groups;
存在于SAM/BAM /CRAM 文件中Read groups是由一系列标签组成；这些标签代表着测序中的一些技术特征；有了这些数据之后，大家就可以对bam文件进行重复序列标识和碱基质量重新矫正。
GATK要求输入的bam文件包含Read groups，如果没有就会报错。

例子：

samtools view -H sample.bam | grep '@RG'@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

GATk 要求read group的格式
Read group是@RG开始。

ID = Read group identifier
每一个Read group独有的ID；
Illumina 测序数据中，read group IDs由flowcell ，lane name 和number组成。
在矫正碱基质量时，read group IDs对区分技术批次效应是必须的；在这过程中，同一read group的reads假定为有一样的技术误差。

PU = Platform Unit
Platform Unit由三部分组成： {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}
{FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell;
The {LANE} indicates the lane of the flow cell ;
The {SAMPLE_BARCODE} is a sample/library-specific identifier;
GATK 使用时，PU不是必须要求的；但是PU与ID同时存在时，PU优先级高于ID。

SM = Sample
reads属于的样品名；SM要设定正确，因为GATK产生的VCF文件也使用这个名字。

PL= Platform/technology used to produce the read
测序使用的平台： ILLUMINA, SOLID, LS454, HELICOS and PACBIO。

LB = DNA preparation library identifier
对一个read group的reads进行重复序列标记时，需要使用LB来区分reads来自那条lane;有时候，同一个库可能在不同的lane上完成测序;为了加以区分，同一个或不同库只要是在不同的lane产生的reads都要单独给一个ID。

从read names中提取ID与PU

H0164ALXX140820:2:1101:10003:23460H0164ALXX140820:2:1101:15118:25288H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

例子-Multi-sample and multiplexed example
三个样品：MOM, DAD, KID；
建库：每个样品分别建两个库，一个insert为200bp，一个insert为400bp；
上机：每个测序库使用Illumina HiSeq的两条lane；
reads：来自 3 x 2 x 2 = 12条lane,可以产生12个bam文件，结果如下：

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

#解释
样本DAD建了两个库LIB-DAD-1(200bp)和LIB-DAD-2(400bp)，使用了FLOWCELL1上面的4条lane;

Mom's data:@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

作者：_eason_
链接：https://www.jianshu.com/p/9a29bfc87a50

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

慕姐8265434

手记
篇

粉丝

224

获赞与收藏

1070

关注作者，订阅最新文章

阅读免费教程

Python 办公自动化教程

17个小节 27548 930

Python 算法入门教程

15个小节 30288 1172

Python 进阶应用教程

38个小节 72909 1146

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

GATK - Read groups

使用 ValidateSamFile确定bam文件是否有问题？

获取GATK格式bam文件

Read groups详细介绍

阅读免费教程