写在前面
要帮朋友做RNA-seq的分析,cases vs. controls总共4个样本(2 vs. 2),看到文献(实验设计比较类似)里用的是EBSeq较为频繁,所以用这个来做。
但从文献来看,EBSeq对样本量依赖较大。
EBSeq
TPR relatively independent of sample size and presence of outliers.
Poor FDR control in most situations, relatively unaffected by outliers.
Medium computational time requirement, increases slightly with sample size.
EBSeq的输入数据是原始的read count,可以通过featureCounts、HTSeq-count等软件包获得。
安装
数据不算大,PC应该够用了,所以在window下安装。EBSeq依赖的安装包blockmodeling
一直无法成功安装,隔了两天再看,才想到要去仔细看报错文件,发现其中有特殊字符无法识别,索性直接改了,就能直接加载啦。
source("http://bioconductor.org/biocLite.R") biocLite("EBSeq")require(EBSeq) 载入需要的程辑包:blockmodelingError: package or namespace load failed for ‘blockmodeling’: attachNamespace()里算'blockmodeling'时.onAttach失败了,详细内容: 调用: parse(con) 错误: 16:28: 意外的INCOMPLETE_STRING15: journal = "Social Networks",16: author = as.person("Ale ^ 错误: 无法载入程辑包‘blockmodeling’ In addition: Warning message: In parse(con) : invalid input found on input connection 'C:/Users/cc/Documents/R/win-library/3.5/blockmodeling/CITATION' library(blockmodeling) #出现同样的报错
修改文件'C:/Users/cc/Documents/R/win-library/3.5/blockmodeling/CITATION'
的内容,其中对应报错的行有Aleš Žiberna
——带帽子的字符,删掉或者改成别的常见字符就OK了。当然如果担心以后出类似问题,改起来琐碎,可以直接设置R的识别语言。
Note:搜索时使用关键词INCOMPLETE_STRING
,而非最后的invalid input found on input connection
。搜索方向决定了解决问题的效率。
library(EBSeq) 载入需要的程辑包:gplots 载入程辑包:‘gplots’ The following object is masked from ‘package:stats’: lowess 载入需要的程辑包:testthat
实战
Example
#exampledata(GeneMat) head(GeneMat)#[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]#Gene_1 1879 2734 2369 2636 2188 9743 9932 10099 9829 9831#Gene_2 24 40 22 27 31 118 108 144 117 113Sizes=MedianNorm(GeneMat) #EBSeq requires the library size factor ls for each sample s. Here, ls may be obtained via the function MedianNorm, which reproduces the median normalization approach in DESeqConditions=as.factor(rep(c("C1","C2"),each=5)) #设置样本类型Conditions#[1] C1 C1 C1 C1 C1 C2 C2 C2 C2 C2#Levels: C1 C2EBOut = EBTest(GeneMat,Conditions=as.factor(rep(c("C1","C2"),each=5)),sizeFactors=Sizes, maxround=5) EBDERes=GetDEResults(EBOut, FDR=0.05) summary(EBDERes) str(EBDERes$DEfound) #a list of genes identified with 5% FDRhead(EBDERes$PPMat) # two columns PPEE and PPDE, corresponding to the posterior probabilities of being EE or DE for each genestr(EBDERes$Status) #EBDERes$Status contains each gene’s status called by EBSeq
实例
paste KD_In1.count KD_In2.count EGFP_In1.count EGFP_In2.count |awk '{print $1"\t"$2"\t"$4"\t"$6"\t"$8}'|sort -k 1 >combined.count head combined.count tail combined.count sed -i '1,2d'combined.count #删除1-2行sed -i '53800,$d' combined.count #删除53800-last 行sed -i '1igene\tKD1\tKD2\tEGFP1\tEGFP2' combined.count #增加首行data=read.table("combined.count",header=T,row.names=1) head(data)# EGFP1 EGFP2 KD1 KD2#ENSMUSG00000000001 281 243 295 233#ENSMUSG00000000003 0 0 0 0#ENSMUSG00000000028 20 25 25 12data=as.matrix(data)#pre-filtering 这里仅保留了readcounts总和>=10的基因keep <- rowSums(data[,1:4])>=10data <- data[keep,]#跑出结果require(EBSeq) Sizes=MedianNorm(data) EBOut = EBTest(data,Conditions=as.factor(rep(c("EGFP","KD"),each=2)),sizeFactors=Sizes, maxround=15) #maxround迭代次数,一般要求最后结果趋于收敛比较可信EBDERes=GetDEResults(EBOut, FDR=0.05) summary(EBDERes) str(EBDERes$DEfound) #a list of genes identified with 5% FDRhead(EBDERes$PPMat) # two columns PPEE and PPDE, corresponding to the posterior probabilities of being EE or DE for each gene# PPEE PPDE#ENSMUSG00000000001 1.0000000 3.203313e-17#ENSMUSG00000000003 NA NAstr(EBDERes$Status) #EBDERes$Status contains each gene’s status called by EBSeq#输出DEG(FDR<0.05)write.table(EBDERes$DEfound,file="DEG_FDR005.txt",quote=F,sep="\t",row.names=F,col.names=F)#计算 Fold Change (FC) of the raw data as well as the posterior FC of the normalized data.GeneFC=PostFC(EBOut) str(GeneFC) PlotPostVsRawFC(EBOut,GeneFC) #FC vs.Posterior FC的图write.table(GeneFC,file="FC.txt",quote=F,sep="\t") #注意$Direction
如何确定maxround?
查看参数值:EBOut$Alpha
,EBOut$P
,EBOut$Beta
,最后两个值相差<0.01
就差不多了。
关于差异表达(RNA-seq)
标准化方法不同+检验不同=多种组合/软件,用之前需要结合自己的样本量来考虑,多参考有相似实验设计的文献,常用的方法都跑一下,自己评估下结果差异,再做定夺。(研究本来就是充满了不确定性,一切都只能用“可能性”来定义,所以,采用同样参数仍然无法完全重复出文献中的结果也是常见。即使你心有芥蒂...)
针对配合meRIP-seq的RNA-seq,因为朋友需要看了几篇文献,基本都是2 vs. 2的样本量,文献里的作法也有差异。
直接取了fold change >2 (2017Cancer Cell-m6A Demethylase ALKBH5 Maintains Tumorigenicity of Glioblastoma Stem-like Cells by Sustaining FOXM1 Expression and Cell Proliferation Program)
EBSeq (2017Cancer Cell-FTO Plays an Oncogenic Role in Acute Myeloid Leukemia as a N6-Methyladenosine RNA Demethylase)
DESeq2
作者:fatlady
链接:https://www.jianshu.com/p/364650e8bd3e
共同学习,写下你的评论
评论加载中...
作者其他优质文章