使用BICseq2提取拷贝数变异

介绍BICseq2的用法。


实际上总共有三大步骤,针对无匹配样本的情况:

  • Get the uniquely mapped reads from the bam file (you may use the modified samtools as provided here).

  • Use BICseq2-norm to remove the biases in the data.

  • Use BICseq2-seg to detect CNVs based on the normalized data.

有匹配样本的情况:

  • Get the uniquely mapped reads from the case and the control genome bam files, respectively

  • Normalize the case and control genome individually using BICseq2-norm

  • Detect CNV in the case genome based on the normalized data of the case genome and the conrol genome.

1. 安装

# 创建一个新的环境
conda create -n BICseq2 
conda activate BICseq2

# 下载
conda install -c bioconda bicseq2-norm
conda install -c bioconda bicseq2-seg

安装成功后,直接输入NBICseq-norm.plNBICseq-seg.pl会出现使用说明。

2. 准备seq文件

使用官方提供的修改后的samtools,下载:

wget https://compbio.med.harvard.edu/BIC-seq/BICseq2/samtools-0.1.7a_getUnique-0.1.3.tar.gz
gunzip samtools-0.1.7a_getUnique-0.1.3.tar.gz
tar -xvf samtools-0.1.7a_getUnique-0.1.3.tar.gz

注意,如果已经安装过samtools,使用修改后的samtools要写明新工具的路径。

创建一个存放结果的文件夹seq_file,运行时在该路径下运行:

cd /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/seq_file
/home/tzy/tools/samtools-0.1.7a_getUnique-0.1.3/samtools view -U BWA,ERR174341.hg19.1X,N,N /home/data/ERR174341.hg19.1.bam

基本语法:

samtools view [options] <in.bam>|<in.sam> [region1 [...]]
  • -U :字符串,通常是<Aligner,OutputPrefix,ChromNameReport?,StrandReport?> or <Aligner,OutputPrefix,ChromNameReport?,StrandReport?minLen,maxLen> ,比如这里的BWA,ERR174341.hg19,N,N,aligner为BWA,输出的前缀是ERR174341.hg19.1X,染色体名和链不会报告。

输出文件:

ERR174341.hg19.1Xchr10.seq
ERR174341.hg19.1Xchr11.seq
ERR174341.hg19.1Xchr12.seq
...

3. 拆分染色体

该工具针对单条染色体处理,对下载的ucsc.hg19.fasta进行拆分,创建chromosome文件夹存储拆分后的fasta文件:

cd /home/data/sde/tzy/NA12878/ucsc.hg19/chromosome
faidx -x /home/data/sde/tzy/NA12878/ucsc.hg19/ucsc.hg19.fasta

4. 下载mappability文件

# download from https://compbio.med.harvard.edu/BIC-seq/
# Mappability files
# Human hg19 100mer
wget https://compbio.med.harvard.edu/BIC-seq/Mappability/hg19.CRG.100bp.tar.gz
gunzip hg19.CRG.100bp.tar.gz
tar -xvf hg19.CRG.100bp.tar

目录下文件:

hg19.CRC.100mer.chr10.txt  
hg19.CRC.100mer.chr11.txt  
hg19.CRC.100mer.chr12.txt
...

5. 生成configfile

生成一个faname文件【这里我只分析1-22条染色体】:

chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr1
chr20
chr21
chr22
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9

生成configFile,configFile.txt为输出的文件名:

file=/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname
configFile=/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/configFile.txt

# 检查file文件是否存在
if [ ! -f "$file" ]; then
    echo "File '$file' does not exist."
    exit 1
fi

# 生成config文件
echo -e "chromName\tfaFile\tMapFile\treadPosFile\tbinFileNorm" > "$configFile"

# 读取file文件中的每行,并生成config文件
while IFS= read -r prefix || [ -n "$prefix" ]; do
    chrom="$prefix"
    faFile="/home/data/sde/tzy/NA12878/ucsc.hg19/chromosome/${chrom}.fasta"
    mapFile="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/hg19CRG.100bp/hg19.CRC.100mer.${chrom}.txt"
    readPosFile="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/seq_file/ERR174341.hg19.1X${chrom}.seq"
    binFileNorm="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/binFileNorm/${chrom}.norm.bin"

    echo -e "${chrom}\t${faFile}\t${mapFile}\t${readPosFile}\t${binFileNorm}" >> "$configFile"
done < "$file"

echo "Config file generated: $configFile"

生成的configFile文件:

chromName       faFile  MapFile readPosFile     binFileNorm
chr10   /home/data/sde/tzy/NA12878/ucsc.hg19/chromosome/chr10.fasta     /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/hg19CRG.100bp/hg19.CRC.100mer.chr10.txt /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/seq_file/ERR174341.hg19.1Xchr10.seq   /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/binFileNorm/chr10.norm.bin
...

官方文档对config文件的要求:

6. normalize

start=$(date +%s) # 记录脚本开始时间

cd /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script

NBICseq-norm.pl -l 100 -s 100 --tmp $file/tmp configFile.txt BICseq2-norm.txt

end=$(date +%s) # 记录脚本结束时间
runtime=$((end - start)) # 计算脚本运行时间

echo "脚本运行时间为:$runtime 秒"

参数:

        -l=<int>: read length
        -s=<int>: fragment size
        -p=<float>: a subsample percentage: default 0.0002.
        -b=<int>: bin the expected and observed as <int> bp bins; Default 100.
        --gc_bin: if specified, report the GC-content in the bins
        --NoMapBin: if specified, do NOT bin the reads according to the mappability
        --bin_only: only bin the reads without normalization
        --fig=<string>: plot the read count VS GC figure in the specified file (in pdf format)
        --title=<string>: title of the figure
        --tmp=<string>: the tmp directory;

normalize的输出BICseq2-norm.txt:

#FragmentLength 100
#ReadLen 100
#NumberOfExtendedBp 5
#Parameter estimates from the negative binomial model
# 265 lines
estimate        std     zvalue  pvalue
(Intercept)     -4.08729958461365       0.0814275642617771      -50.1955280336472       0
G_Start1        -0.0786228026868843     0.0396487170352376      -1.98298478654451       0.0473691289124137
C_Start1        -0.102595852102073      0.0406397501756252      -2.52451975365752       0.0115856493521211
...

7. 提取拷贝数

生成该步骤需要的config文件configFile2.txt,提取的是configFile.txt的第一列(染色体名称)和最后一列(normalize步骤生成的norm.bin文件),并使用分隔符\t分开:

awk -v OFS='\t' '{print $1, $NF}' configFile.txt > configFile2.txt

cd /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script
NBICseq-seg.pl --bootstrap configFile2.txt CNV.txt

语法:

BICseq2-seg.pl [options] <configFile> <output>

<configFile> stores the necessary information for BICseq2-seg to detect CNV
<output> stores the final CNV detection results.

官方文档对config文件的要求:

附:第5步的复杂版

针对不同样本想要call的染色体不同的情况:根据hg19CRG.100bp目录下的文件名,来自动生成configfile:

#!/bin/bash

# 设置file文件名和config文件名
# 删除之前存在的文件
if [ -f "/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname" ]; then
    rm /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname
fi

if [ -f "/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname.txt" ]; then
    rm /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname.txt
fi

# 提取染色体名
ls /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/hg19CRG.100bp | grep hg19.CRC.100mer* > /home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname.txt 
filepre=/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname.txt

# 提取每一行的 chr[num]
while IFS= read -r line || [ -n "$line" ]; do
#    chr=$(echo "$line" | sed 's/.*\.chr\([0-9]*\)\.txt/\1/')
    chr=$(echo "$line" | sed -n 's/.*\.chr\([1-9][0-9]*\)\.txt/\1/p')
    echo "Extracted chr: $chr"
    echo -e "chr$chr" >> "/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname"
done < "$filepre"

file=/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/faname
configFile=/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/configFile.txt

# 检查file文件是否存在
if [ ! -f "$file" ]; then
    echo "File '$file' does not exist."
    exit 1
fi

# 生成config文件
echo -e "chromName\tfaFile\tMapFile\treadPosFile\tbinFileNorm" > "$configFile"

# 读取file文件中的每行,并生成config文件
while IFS= read -r prefix || [ -n "$prefix" ]; do
    chrom="$prefix"
    faFile="/home/data/sde/tzy/NA12878/ucsc.hg19/chromosome/${chrom}.fasta"
    mapFile="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/script/hg19CRG.100bp/hg19.CRC.100mer.${chrom}.txt"
    readPosFile="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/seq_file/ERR174341.hg19.1X${chrom}.seq"
    binFileNorm="/home/data/sde/tzy/NA12878/raw_data/ERR174341/BICseq2/binFileNorm/${chrom}.norm.bin"

    echo -e "${chrom}\t${faFile}\t${mapFile}\t${readPosFile}\t${binFileNorm}" >> "$configFile"
done < "$file"

echo "Config file generated: $configFile"

这里hg19CRG.100bp目录下,我把以下几个文件单独放置了一个文件夹drop,因为我这里分析只需要1-22条染色体:

hg19.CRC.100mer.chrM.txt  
hg19.CRC.100mer.chrX.txt  
hg19.CRC.100mer.chrY.txt

参考资料:

comments powered by Disqus