使用score-client下载ICGC数据

  1. 打开ICGC的DATA REPOSITORIES筛选和查找自己需要的数据。比如,我想要下载PCAWG中icgc_specimen_id为SP117655的BAM文件,在数据库中进行检索是搜不到的,但是会出来对应的donor名,可以根据donor名再检索一次:


题外话:如果想要确认Donor id,或者查看其他的id比如aliquot_id等等,可以在该页面点击DCC DATA RELEASES > PCAWG > data_releases > latest > pcawg_sample_sheet.v1.4.2016-09-14.tsv,或者选一个最新的样本表格下载。

后续如果不使用ICGC提供的vcf文件的话,在使用Mutect2提取vcf文件过程中,会遇到的一个问题:

A USER ERROR has occurred: Bad input: Sample 80ae5df80bd3e08383ed80f48e4dab58 is not in BAM header: [1e2dcbcc-771c-43c5-8c8d-e0eb77cb3494, 8205d14c-6f7e-4592-b8e1-76e3bc5f9613]

80ae5df80bd3e08383ed80f48e4dab58是我们下载下来的文件的前缀,通过查询文件内容可知,这不是真正的样本名:

samtools view -h 8205d14c-6f7e-4592-b8e1-76e3bc5f9613.bam | head -n 90 | grep '@RG'
@RG     ID:CACR:NA096_1 PL:ILLUMINA     CN:CACR PI:170  DT:2014-04-16T00:00:00+00:00    LB:WGS:CACR:WHC107SM:8205d14c-6f7e-4592-b8e1-76e3bc5f9613 PU:CACR:1_4     PG:fastqtobam
@RG     ID:CACR:NA096_2 PL:ILLUMINA     CN:CACR PI:170  DT:2014-04-16T00:00:00+00:00    LB:WGS:CACR:WHC107SM:8205d14c-6f7e-4592-b8e1-76e3bc5f9613 PU:CACR:1_5     PG:fastqtobam
@RG     ID:CACR:NA096_3 PL:ILLUMINA     CN:CACR PI:170  DT:2014-04-16T00:00:00+00:00    LB:WGS:CACR:WHC107SM:8205d14c-6f7e-4592-b8e1-76e3bc5f9613 PU:CACR:1_6     PG:fastqtobam

需要把样本名进行更改

mv 80ae5df80bd3e08383ed80f48e4dab58.bam 8205d14c-6f7e-4592-b8e1-76e3bc5f9613.bam

  1. 根据DATA TYPE选中Aligned Reads得到BAM文件,这里有四个BAM文件,前两个只有几百M,后两个是几百G,前者下载的是mini bam,看它的覆盖度可能是浅全基因组测序的结果,后者是深全基因组测序的结果。注意四个文件中有两个是正常配对样本。这里根据我的任务需求把这四个全部选中点下载。

  1. 由于我使用的是score-client工具,Repository选择Collaboratory - Toronto

  1. tar xvzf解压下载的文件,manifest.collaboratory.1697079740643.tsv,内容如下(仅展示了前两行):
repo_code	file_id	object_id	file_format	file_name	file_size	md5_sum	index_object_id	donor_id/donor_count	project_id/project_count	study
collaboratory	FI42053	fb96fd5f-5a6f-5fc7-baf9-2fd8f43591df	BAM	73c4a73954614dda9be097dfb8305bd8.bam	128476988498	73c4a73954614dda9be097dfb8305bd8	4634f96e-7e96-5c0d-9cc8-f08f9a305001	DO218440	BTCA-SG	PCAWG
  1. 下载安装配置工具score-client

安装java11:

apt-get install openjdk-11-jdk

下载工具:

wget -O score-client.tar.gz https://artifacts.oicr.on.ca/artifactory/dcc-release/bio/overture/score-client/%5BRELEASE%5D/score-client-%5BRELEASE%5D-dist.tar.gz

tar xvzf score-client.tar.gz

cd score-client-5.9.0

配置,修改conf/application.properties文件,写入token即可(如果是controlled数据,除了有账号,还需要完成数据申请)。

accessToken=your token
  1. 下载数据
bin/score-client --profile collab download --manifest ./manifest.collaboratory.1697079740643.tsv --output-dir /home/data/sda/tzy/pcawg

开始下载。

Collaboratory - Toronto

Collaboratory - Toronto,使用上述的score-client --profile collab download结合manifest下载即可。

EGA - Hinxton

EGA - Hinxton,可以使用EGA的下载方式,下载的manifest``里面有EGA号,如下:

file_ids="EGAF00001198675"
mapping="{EGAF00001198675=FI39832}"

使用该EGA号即可下载,不用修改下载得到的manifest文件填写账号密码:

pyega3 -c 20 -cf /home/tzy/tools/default_credential_file.json fetch EGAF00001156307

AWS

$ curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

PDC - Chicago

linux版本:

mkdir -p ~/.gen3
echo "" >> ~/.bashrc
echo "export PATH=\$PATH:~/.gen3" >> ~/.bashrc
wget https://github.com/uc-cdis/cdis-data-client/releases/download/1.2.0/dataclient_linux.zip
unzip dataclient_linux.zip
mv gen3-client ~/.gen3/
rm dataclient_linux.zip
source ~/.bashrc

安装完毕。

下载manifest,举例manifest.pdc.1706164029030.sh:

#!/bin/bash
#
# ICGC PDC Script Manifest
#
# Script to download ICGC files in PDC. Generated from https://dcc.icgc.org/repositories.
# Requires AWS CLI to be installed:
#
#   https://aws.amazon.com/cli/
#
# For information on the PDC, please see:
#
#   https://bionimbus-pdc.opensciencedatacloud.org/
#
# Script assumes a valid ~/.aws/credentials file with a valid `pdc` profile:
# 
#   $ cat ~/.aws/credentials
#   [pdc]
#   aws_access_key_id=<your key id here>
#   aws_secret_access_key=<your secret here>
#

aws --profile pdc --endpoint-url https://bionimbus-objstore-cs.opensciencedatacloud.org s3 cp s3://pcawg-tcga-read-us/637cddda-6b05-58cc-90a7-1c5cdbe6845e .

先把该manifest转换成gen3-client识别的格式,下载脚本并转换:

wget https://raw.githubusercontent.com/uc-cdis/pdc_tools/1.0/dcc_manifest_conversion/dcc_to_gen3.py
python3 dcc_to_gen3.py --manifest manifest.pdc.1706164029030.sh

生成gen3_manifest_manifest.pdc.1706164029030.sh.json文件。

通过官网给定的方法得到API key获得credentials.json文件,将其重新命名:

gen3-client configure  --profile=icgc --cred=credentials.json --apiendpoint=https://icgc.bionimbus.org/

# 2024/01/25 15:31:03 Profile 'icgc' has been configured successfully.

然后就可以下载了:

gen3-client download-multiple --profile=icgc --manifest=gen3_manifest_manifest.pdc.1706164029030.sh.json  --no-prompt

下载速度很快。

注意,2024-01-23版本的linux下载后实际上是OS版本,报错-bash: /home/tzy/.gen3/gen3-clinet: cannot execute binary file: Exec format error,可以通过如下方法检查是否为linux版本:

ldd /home/tzy/.gen3/gen3-clinet
# not a dynamic executable

file /home/tzy/.gen3/gen3-clinet
#/home/tzy/.gen3/gen3-clinet: Mach-O 64-bit x86_64 executable

正确的应该是:

file ./gen3-client 
#./gen3-client: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, Go BuildID=iNUQ0bctPp9-WreqEiYr/qISPwqavMmPLx4gTtDAt/Bg15GVruXZ8l7ngd0j56/FrgyY8yBc4w-JRhYoQsm, not stripped

参考资料

comments powered by Disqus