Download fastqs from NCBI

Silas.Kieser · January 20, 2021, 10:27am

Hallo

I try to download 1000 fastq files from NCBI.
I wondered what is more limiting the internet connection or the filesystem I/O?

Yann.Sagon · January 21, 2021, 10:34am

Hi,

You say you try: something isn’t working or is it slow?

1000 files at the same time? Which tool? Where do you download them?

If you download from login2.baobab, the bandwidth is 10Gb. But then you use the shared unige network and then the share internet.

Silas.Kieser · January 21, 2021, 1:55pm

I follow the official instructions here: https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump
As I understand it the first step is downloading a binary file then extracting the fastq reads from it. The extraction step can be run in parallel therefore I submit it to the slurm.

I limit the downloads to 10 files.

Silas.Kieser · January 21, 2021, 2:03pm

Does it make sense to summit a script to the slurm cluster that downloads the file.

I tested it and on the slurm cluster I get sometimes a time out error after an hour where as on the login node the download of the binary file takes 5 - 15 min.

Also the extraction creates plain text files. Does it make sense to store the files to the cluster /scratch and then gzip it to store it on the filesystem. Even if the next step would be extract the file again and to work with it in memory I think.

Yann.Sagon · January 21, 2021, 3:21pm

No. The bandwith to internet on a compute node is 1Gb vs 10Gb on the login node.

If sooner or later you need to decompress the files, store them uncompressed. If you can work directly from compressed files, store them compressed.

Silas.Kieser · January 22, 2021, 6:20am

Thank you for your recommendations!