Using the seeker
package together with docker, it’s easy to make fetching and processing
of sequencing and microarray data completely reproducible. First pull
the latest version of the socker image, which has
seeker
and its dependencies already installed.
The seeker
package includes an example yaml file, R
script, and shell script for fetching and processing a subset of an
RNA-seq dataset. Here we’ll download the files from GitHub to avoid
having to install the package locally:
urlBase = 'https://raw.githubusercontent.com/hugheylab/seeker/master/inst/extdata/'
for (filename in c('PRJNA600892.yml', 'run_seeker.R', 'run_seeker.sh')) {
download.file(paste0(urlBase, filename), filename)}
PRJNA600892.yml:
study: 'PRJNA600892' # [string]
metadata:
run: TRUE # [logical]
bioproject: 'PRJNA600892' # [string]
include:
# [named list or NULL]
colname: 'run_accession' # [string]
values: ['SRR10876945', 'SRR10876946'] # [vector]
# exclude # [named list or NULL]
# colname # [string]
# values # [vector]
fetch:
run: TRUE # [logical]
# keep # [logical or NULL]
# overwrite # [logical or NULL]
# keepSra # [logical or NULL]
# prefetchCmd # [string or NULL]
# prefetchArgs # [character vector or NULL]
# fasterqdumpCmd # [string or NULL]
# fasterqdumpArgs # [character vector or NULL]
# pigzCmd # [string or NULL]
# pigzArgs # [character vector or NULL]
trimgalore:
run: TRUE # [logical]
# keep # [logical or NULL]
# cmd # [string or NULL]
# args # [character vector or NULL]
# pigzCmd # [string or NULL]
fastqc:
run: TRUE # [logical]
# keep # [logical or NULL]
# cmd # [string or NULL]
# args # [character vector or NULL]
salmon:
run: TRUE # [logical]
indexDir: '~/refgenie_genomes/alias/mm10/salmon_partial_sa_index/default' # [string]
# sampleColname # [string or NULL]
# keep # [logical or NULL]
# cmd # [string or NULL]
# args # [character vector or NULL]
multiqc:
run: TRUE # [logical]
# cmd # [string or NULL]
# args # [character vector or NULL]
tximport:
run: TRUE # [logical]
tx2gene:
# [named list or NULL]
organism: 'mmusculus' # [string]
# version # [number or NULL]
# filename # [string or NULL]
countsFromAbundance: 'lengthScaledTPM' # [string]
# ignoreTxVersion # [logical or NULL]
run_seeker.R:
doParallel::registerDoParallel()
cArgs = commandArgs(TRUE)
yamlPath = cArgs[1L]
parentDir = cArgs[2L]
params = yaml::read_yaml(yamlPath)
seeker::seeker(params, parentDir)
run_seeker.sh:
#!/bin/sh
docker run \
--mount type=bind,src=`pwd`,dst=/home/rstudio/projects \
-w /home/rstudio/projects \
--rm \
ghcr.io/hugheylab/socker \
bash -c \
"source /home/rstudio/miniconda3/etc/profile.d/conda.sh \
&& conda activate seeker \
&& refgenie pull mm10/salmon_partial_sa_index \
&& Rscript run_seeker.R PRJNA600892.yml ." \
&> PRJNA600892_progress.log
Now simply run the shell script:
The output will appear in your working directory. You can follow
seeker()
’s progress using the log file. To process a
different dataset, modify the yaml file and shell script accordingly.
Beware this example uses “salmon_partial_sa_index” from refgenie to
minimize computational requirements; for actual use we recommend
“salmon_sa_index”.
The seeker
package also includes an example yaml file, R
script, and shell script for fetching and processing a microarray
dataset. Download the files to your working directory:
urlBase = 'https://raw.githubusercontent.com/hugheylab/seeker/master/inst/extdata/'
for (filename in c('GSE25585.yml', 'run_seeker_array.R', 'run_seeker_array.sh')) {
download.file(paste0(urlBase, filename), filename)}
GSE25585.yml:
run_seeker_array.R:
cArgs = commandArgs(TRUE)
params = yaml::read_yaml(cArgs[1L])
parentDir = cArgs[2L]
seeker::seekerArray(
study = params$study, geneIdType = params$geneIdType,
platform = params$platform, parentDir)
run_seeker_array.sh:
#!/bin/sh
docker run \
--mount type=bind,src=`pwd`,dst=/home/rstudio/projects \
-w /home/rstudio/projects \
--rm \
ghcr.io/hugheylab/socker \
bash -c "Rscript run_seeker_array.R GSE25585.yml ." \
&> GSE25585_progress.log
Now simply run the shell script:
The output will appear in your working directory. You can follow
seekerArray()
’s progress using the log file. To process a
different dataset, modify the yaml file and shell script
accordingly.