使用BICseq2提取拷贝数变异
介绍BICseq2的用法。
实际上总共有三大步骤,针对无匹配样本的情况:
-
Get the uniquely mapped reads from the bam file (you may use the modified samtools as provided here).
-
Use BICseq2-norm to remove the biases in the data.
-
Use BICseq2-seg to detect CNVs based on the normalized data.
有匹配样本的情况:
-
Get the uniquely mapped reads from the case and the control genome bam files, respectively
-
Normalize the case and control genome individually using BICseq2-norm
-
Detect CNV in the case genome based on the normalized data of the case genome and the conrol genome.
1. 安装
# 创建一个新的环境
conda create -n BICseq2
conda activate BICseq2
# 下载
conda install -c bioconda bicseq2-norm
conda install -c bioconda bicseq2-seg
安装成功后,直接输入NBICseq-norm.pl
或NBICseq-seg.pl
会出现使用说明。
2. 准备seq文件
使用官方提供的修改后的samtools,下载:
wget https://compbio.med.harvard.edu/BIC-seq/BICseq2/samtools-0.1.7a_getUnique-0.1.3.tar.gz
gunzip samtools-0.1.7a_getUnique-0.1.3.tar.gz
tar -xvf samtools-0.1.7a_getUnique-0.1.3.tar.gz
注意,如果已经安装过samtools,使用修改后的samtools要写明新工具的路径。
创建一个存放结果的文件夹seq_file,运行时在该路径下运行:
cd /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/seq_file
/home/tzy/tools/samtools-0.1.7a_getUnique-0.1.3/samtools view -U BWA,ERR174341.hg19.1X,N,N /home/data/ERR174341.hg19.1.bam
基本语法:
samtools view [options] <in.bam>|<in.sam> [region1 [...]]
- -U :字符串,通常是<Aligner,OutputPrefix,ChromNameReport?,StrandReport?> or <Aligner,OutputPrefix,ChromNameReport?,StrandReport?minLen,maxLen> ,比如这里的
BWA,ERR174341.hg19,N,N
,aligner为BWA,输出的前缀是ERR174341.hg19.1X,染色体名和链不会报告。
输出文件:
ERR174341.hg19.1Xchr10.seq
ERR174341.hg19.1Xchr11.seq
ERR174341.hg19.1Xchr12.seq
...
3. 拆分染色体
该工具针对单条染色体处理,对下载的ucsc.hg19.fasta进行拆分,创建chromosome文件夹存储拆分后的fasta文件:
cd /home/data/sde/tzy/NA12878/ucsc.hg19/chromosome
faidx -x /home/data/sde/tzy/NA12878/ucsc.hg19/ucsc.hg19.fasta
4. 下载mappability文件
# download from https://compbio.med.harvard.edu/BIC-seq/
# Mappability files
# Human hg19 100mer
wget https://compbio.med.harvard.edu/BIC-seq/Mappability/hg19.CRG.100bp.tar.gz
gunzip hg19.CRG.100bp.tar.gz
tar -xvf hg19.CRG.100bp.tar
目录下文件:
hg19.CRC.100mer.chr10.txt
hg19.CRC.100mer.chr11.txt
hg19.CRC.100mer.chr12.txt
...
5. 生成configfile
生成一个faname文件【这里我只分析1-22条染色体】:
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr1
chr20
chr21
chr22
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
生成configFile,configFile.txt为输出的文件名:
file=/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname
configFile=/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/configFile.txt
# 检查file文件是否存在
if [ ! -f "$file" ]; then
echo "File '$file' does not exist."
exit 1
fi
# 生成config文件
echo -e "chromName\tfaFile\tMapFile\treadPosFile\tbinFileNorm" > "$configFile"
# 读取file文件中的每行,并生成config文件
while IFS= read -r prefix || [ -n "$prefix" ]; do
chrom="$prefix"
faFile="/home/data/sde/tzy/NA12878/ucsc.hg19/chromosome/${chrom}.fasta"
mapFile="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/hg19CRG.100bp/hg19.CRC.100mer.${chrom}.txt"
readPosFile="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/seq_file/ERR174341.hg19.1X${chrom}.seq"
binFileNorm="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/binFileNorm/${chrom}.norm.bin"
echo -e "${chrom}\t${faFile}\t${mapFile}\t${readPosFile}\t${binFileNorm}" >> "$configFile"
done < "$file"
echo "Config file generated: $configFile"
生成的configFile文件:
chromName faFile MapFile readPosFile binFileNorm
chr10 /home/data/sde/tzy/NA12878/ucsc.hg19/chromosome/chr10.fasta /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/hg19CRG.100bp/hg19.CRC.100mer.chr10.txt /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/seq_file/ERR174341.hg19.1Xchr10.seq /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/binFileNorm/chr10.norm.bin
...
官方文档对config文件的要求:

6. normalize
start=$(date +%s) # 记录脚本开始时间
cd /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script
NBICseq-norm.pl -l 100 -s 100 --tmp $file/tmp configFile.txt BICseq2-norm.txt
end=$(date +%s) # 记录脚本结束时间
runtime=$((end - start)) # 计算脚本运行时间
echo "脚本运行时间为:$runtime 秒"
参数:
-l=<int>: read length
-s=<int>: fragment size
-p=<float>: a subsample percentage: default 0.0002.
-b=<int>: bin the expected and observed as <int> bp bins; Default 100.
--gc_bin: if specified, report the GC-content in the bins
--NoMapBin: if specified, do NOT bin the reads according to the mappability
--bin_only: only bin the reads without normalization
--fig=<string>: plot the read count VS GC figure in the specified file (in pdf format)
--title=<string>: title of the figure
--tmp=<string>: the tmp directory;
normalize的输出BICseq2-norm.txt:
#FragmentLength 100
#ReadLen 100
#NumberOfExtendedBp 5
#Parameter estimates from the negative binomial model
# 265 lines
estimate std zvalue pvalue
(Intercept) -4.08729958461365 0.0814275642617771 -50.1955280336472 0
G_Start1 -0.0786228026868843 0.0396487170352376 -1.98298478654451 0.0473691289124137
C_Start1 -0.102595852102073 0.0406397501756252 -2.52451975365752 0.0115856493521211
...
7. 提取拷贝数
生成该步骤需要的config文件configFile2.txt
,提取的是configFile.txt
的第一列(染色体名称)和最后一列(normalize步骤生成的norm.bin文件),并使用分隔符\t
分开:
awk -v OFS='\t' '{print $1, $NF}' configFile.txt > configFile2.txt
cd /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script
NBICseq-seg.pl --bootstrap configFile2.txt CNV.txt
语法:
BICseq2-seg.pl [options] <configFile> <output>
<configFile> stores the necessary information for BICseq2-seg to detect CNV
<output> stores the final CNV detection results.
官方文档对config文件的要求:

附:第5步的复杂版
针对不同样本想要call的染色体不同的情况:根据hg19CRG.100bp目录下的文件名,来自动生成configfile:
#!/bin/bash
# 设置file文件名和config文件名
# 删除之前存在的文件
if [ -f "/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname" ]; then
rm /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname
fi
if [ -f "/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname.txt" ]; then
rm /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname.txt
fi
# 提取染色体名
ls /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/hg19CRG.100bp | grep hg19.CRC.100mer* > /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname.txt
filepre=/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname.txt
# 提取每一行的 chr[num]
while IFS= read -r line || [ -n "$line" ]; do
# chr=$(echo "$line" | sed 's/.*\.chr\([0-9]*\)\.txt/\1/')
chr=$(echo "$line" | sed -n 's/.*\.chr\([1-9][0-9]*\)\.txt/\1/p')
echo "Extracted chr: $chr"
echo -e "chr$chr" >> "/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname"
done < "$filepre"
file=/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname
configFile=/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/configFile.txt
# 检查file文件是否存在
if [ ! -f "$file" ]; then
echo "File '$file' does not exist."
exit 1
fi
# 生成config文件
echo -e "chromName\tfaFile\tMapFile\treadPosFile\tbinFileNorm" > "$configFile"
# 读取file文件中的每行,并生成config文件
while IFS= read -r prefix || [ -n "$prefix" ]; do
chrom="$prefix"
faFile="/home/data/sde/tzy/NA12878/ucsc.hg19/chromosome/${chrom}.fasta"
mapFile="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/hg19CRG.100bp/hg19.CRC.100mer.${chrom}.txt"
readPosFile="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/seq_file/ERR174341.hg19.1X${chrom}.seq"
binFileNorm="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/binFileNorm/${chrom}.norm.bin"
echo -e "${chrom}\t${faFile}\t${mapFile}\t${readPosFile}\t${binFileNorm}" >> "$configFile"
done < "$file"
echo "Config file generated: $configFile"
这里hg19CRG.100bp目录下,我把以下几个文件单独放置了一个文件夹drop
,因为我这里分析只需要1-22条染色体:
hg19.CRC.100mer.chrM.txt
hg19.CRC.100mer.chrX.txt
hg19.CRC.100mer.chrY.txt