Title: | Simplified Fetching and Processing of Microarray and RNA-Seq Data |
---|---|
Description: | Wrapper around various existing tools and command-line interfaces, providing a standard interface, simple parallelization, and detailed logging. For microarray data, maps probe sets to standard gene IDs, building on 'GEOquery' Davis and Meltzer (2007) <doi:10.1093/bioinformatics/btm254>, 'ArrayExpress' Kauffmann et al. (2009) <doi:10.1093/bioinformatics/btp354>, Robust multi-array average 'RMA' Irizarry et al. (2003) <doi:10.1093/biostatistics/4.2.249>, and 'BrainArray' Dai et al. (2005) <doi:10.1093/nar/gni179>. For RNA-seq data, fetches metadata and raw reads from National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA), performs standard adapter and quality trimming using 'TrimGalore' Krueger <https://github.com/FelixKrueger/TrimGalore>, performs quality control checks using 'FastQC' Andrews <https://github.com/s-andrews/FastQC>, quantifies transcript abundances using 'salmon' Patro et al. (2017) <doi:10.1038/nmeth.4197> and potentially 'refgenie' Stolarczyk et al. (2020) <doi:10.1093/gigascience/giz149>, aggregates the results using 'MultiQC' Ewels et al. (2016) <doi:10.1093/bioinformatics/btw354>, maps transcripts to genes using 'biomaRt' Durinkck et al. (2009) <doi:10.1038/nprot.2009.97>, and summarizes transcript-level quantifications for gene-level analyses using 'tximport' Soneson et al. (2015) <doi:10.12688/f1000research.7563.2>. |
Authors: | Jake Hughey [aut, cre], Josh Schoenbachler [aut] |
Maintainer: | Jake Hughey <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.6 |
Built: | 2024-11-24 06:13:00 UTC |
Source: | https://github.com/hugheylab/seeker |
This function checks whether the command-line tools used by seeker are accessible in the expected places.
checkDefaultCommands(keepIdx = FALSE)
checkDefaultCommands(keepIdx = FALSE)
keepIdx |
Logical indicating whether to keep the |
A data.table with columns for command, path, and version.
This function calls
fastqc using
system2()
. To run in parallel, register a parallel backend, e.g., using
doParallel::registerDoParallel()
.
fastqc(filepaths, outputDir = "fastqc_output", cmd = "fastqc", args = NULL)
fastqc(filepaths, outputDir = "fastqc_output", cmd = "fastqc", args = NULL)
filepaths |
Paths to fastq files. For single-end reads, each element should be a single filepath. For paired-end reads, each element can be two filepaths separated by ";". |
outputDir |
Directory in which to store output. Will be created if it doesn't exist. |
cmd |
Name or path of the command-line interface. |
args |
Additional arguments to pass to the command-line interface. |
A vector of exit codes, invisibly.
This function calls
fastq_screen
using system2()
. To run in parallel, register a parallel backend, e.g.,
using doParallel::registerDoParallel()
.
fastqscreen( filepaths, outputDir = "fastqscreen_output", cmd = "fastq_screen", args = c("--threads", foreach::getDoParWorkers(), "--conf", "~/FastQ_Screen_Genomes/fastq_screen.conf") )
fastqscreen( filepaths, outputDir = "fastqscreen_output", cmd = "fastq_screen", args = c("--threads", foreach::getDoParWorkers(), "--conf", "~/FastQ_Screen_Genomes/fastq_screen.conf") )
filepaths |
Paths to fastq files. For single-end reads, each element should be a single filepath. For paired-end reads, each element can be two filepaths separated by ";". |
outputDir |
Directory in which to store output. Will be created if it doesn't exist. |
cmd |
Name or path of the command-line interface. |
args |
Additional arguments to pass to the command-line interface. |
A vector of exit codes, invisibly.
This function uses the NCBI SRA Toolkit via system2()
to download files
from SRA and convert them to fastq.gz. To process files in parallel, register
a parallel backend, e.g., using doParallel::registerDoParallel()
. Beware
that intermediate files created by fasterq-dump are uncompressed and could
require hundreds of gigabytes if files are processed in parallel.
fetch( accessions, outputDir, overwrite = FALSE, keepSra = FALSE, prefetchCmd = "prefetch", prefetchArgs = NULL, fasterqdumpCmd = "fasterq-dump", fasterqdumpArgs = NULL, pigzCmd = "pigz", pigzArgs = NULL )
fetch( accessions, outputDir, overwrite = FALSE, keepSra = FALSE, prefetchCmd = "prefetch", prefetchArgs = NULL, fasterqdumpCmd = "fasterq-dump", fasterqdumpArgs = NULL, pigzCmd = "pigz", pigzArgs = NULL )
accessions |
Character vector of SRA run accessions. |
outputDir |
String indicating the local directory in which to save the files. Will be created if it doesn't exist. |
overwrite |
Logical indicating whether to overwrite files that already
exist in |
keepSra |
Logical indicating whether to keep the ".sra" files. |
prefetchCmd |
String indicating command for prefetch, which downloads ".sra" files. |
prefetchArgs |
Character vector indicating arguments to pass to prefetch. |
fasterqdumpCmd |
String indicating command for fasterq-dump, which uses ".sra" files to create ".fastq" files. |
fasterqdumpArgs |
Character vector indicating arguments to pass to fasterq-dump. |
pigzCmd |
String indicating command for pigz, which converts ".fastq" files to ".fastq.gz" files. |
pigzArgs |
Character vector indicating arguments to pass to pigz. |
A list. As the function runs, it updates a tab-delimited log file in
outputDir
called "progress.tsv".
This function can use the API of the European Nucleotide Archive (recommended) or the Sequence Read Archive.
fetchMetadata( bioproject, host = c("ena", "sra"), fields = c("study_accession", "sample_accession", "secondary_sample_accession", "sample_alias", "sample_title", "experiment_accession", "run_accession", "fastq_md5", "fastq_ftp", "fastq_aspera"), file = NULL )
fetchMetadata( bioproject, host = c("ena", "sra"), fields = c("study_accession", "sample_accession", "secondary_sample_accession", "sample_alias", "sample_title", "experiment_accession", "run_accession", "fastq_md5", "fastq_ftp", "fastq_aspera"), file = NULL )
bioproject |
String indicating bioproject accession. |
host |
String indicating from where to fetch the metadata. |
fields |
Character vector indicating which fields to fetch, if |
file |
String indicating output file path, if not |
A data.table
.
Get supported microarray platforms
getPlatforms(type = c("cdf", "mapping"))
getPlatforms(type = c("cdf", "mapping"))
type |
String indicating whether to get supported platforms for processing raw Affymetrix data using custom CDF or for mapping already processed data from probes to genes. |
A data.table
.
Aggregrate metadata from salmon quantifications
getSalmonMetadata(inputDir, outputDir = "data")
getSalmonMetadata(inputDir, outputDir = "data")
inputDir |
Directory that contains output from salmon. |
outputDir |
Directory in which to save the result, a file named
"salmon_meta_info.csv". If |
A data.table, invisibly.
#' @seealso seeker()
, salmon()
This function uses the biomaRt package.
getTx2gene( organism = "mmusculus", version = NULL, outputDir = "data", checkArgsOnly = FALSE )
getTx2gene( organism = "mmusculus", version = NULL, outputDir = "data", checkArgsOnly = FALSE )
organism |
String used to pass |
version |
Passed to |
outputDir |
Directory in which to save the result, a file named
"tx2gene.csv.gz". If |
checkArgsOnly |
Logical indicating whether to only check function arguments. Used for testing. |
If checkArgsOnly
is FALSE
, a data.table based on the result from
biomaRt::getBM()
, with an attribute "version". Otherwise 0
.
Install Brainarray custom CDFs for processing raw Affymetrix data. See http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF_download.asp.
installCustomCdfPackages(pkgs, ver = 25, dryRun = FALSE)
installCustomCdfPackages(pkgs, ver = 25, dryRun = FALSE)
pkgs |
Character vector of package names, e.g., "hgu133ahsentrezgcdf". |
ver |
Integer version number (25 as of 5 Jan 2021). |
dryRun |
Logical indicating whether to actually install the packages. |
A character vector of URLs, invisibly.
This function installs and configures the various programs required for seeker to fetch and process RNA-seq data.
installSysDeps( sraToolkitDir, minicondaDir, refgenieDir, rprofileDir, minicondaEnv = "seeker", refgenieGenomes = NULL, fastqscreenDir = NULL )
installSysDeps( sraToolkitDir, minicondaDir, refgenieDir, rprofileDir, minicondaEnv = "seeker", refgenieGenomes = NULL, fastqscreenDir = NULL )
sraToolkitDir |
String indicating directory in which to install the
SRA Toolkit. Recommended to use "~",
the home directory. If |
minicondaDir |
String indicating directory in which to install
Miniconda. Recommended
to use "~", the home directory. If |
refgenieDir |
String indicating directory in which to store the
directory of genome assets from refgenie, which will be named
"refgenie_genomes". Recommended to use "~", the home directory. Only used
if |
rprofileDir |
String indicating directory in which to create or modify .Rprofile, which is run by R on startup. Common options are "~" or ".". |
minicondaEnv |
String indicating name of the Miniconda environment in which to install various conda packages (fastq-screen, fastqc, multiqc, pigz, refgenie, salmon, and trim-galore). |
refgenieGenomes |
Character vector indicating genome assets, such as
transcriptome indexes for |
fastqscreenDir |
String indicating directory in which to download the
genomes for |
NULL
, invisibly
This function calls multiqc using system2()
.
multiqc( parentDir = ".", outputDir = "multiqc_output", cmd = "multiqc", args = NULL )
multiqc( parentDir = ".", outputDir = "multiqc_output", cmd = "multiqc", args = NULL )
parentDir |
Directory that contains output to be aggregated. |
outputDir |
Directory in which to store output. Will be created if it doesn't exist. |
cmd |
Name or path of the command-line interface. |
args |
Additional arguments to pass to the command-line interface. |
An exit code, invisibly.
This function calls
salmon
using system2()
. To run in parallel, register a parallel backend, e.g.,
using doParallel::registerDoParallel()
.
salmon( filepaths, samples, indexDir, outputDir = "salmon_output", cmd = "salmon", args = c("-l A -q --seqBias --gcBias --no-version-check -p", foreach::getDoParWorkers()), compress = TRUE )
salmon( filepaths, samples, indexDir, outputDir = "salmon_output", cmd = "salmon", args = c("-l A -q --seqBias --gcBias --no-version-check -p", foreach::getDoParWorkers()), compress = TRUE )
filepaths |
Paths to fastq files. For single-end reads, each element should be a single filepath. For paired-end reads, each element should be two filepaths separated by ";". |
samples |
Corresponding sample names for fastq files. |
indexDir |
Directory that contains salmon index. |
outputDir |
Directory in which to store output. Will be created if it doesn't exist. |
cmd |
Name or path of the command-line interface. |
args |
Additional arguments to pass to the command-line interface. |
compress |
Logical indicating whether to gzip the quantification file (quant.sf) from salmon. Does not affect downstream analysis. |
A vector of exit codes, invisibly.
This function selectively performs various steps to process RNA-seq data.
See also the vignettes: browseVignettes('seeker')
.
seeker(params, parentDir = ".", dryRun = FALSE)
seeker(params, parentDir = ".", dryRun = FALSE)
params |
Named list of parameters with components:
|
parentDir |
Directory in which to store the output, which will be a
directory named according to |
dryRun |
Logical indicating whether to check the validity of inputs without actually fetching or processing any data. |
Path to the output directory parentDir
/params$study
, invisibly.
fetchMetadata()
, fetch()
, trimgalore()
, fastqc()
,
salmon()
, multiqc()
, tximport()
, installSysDeps()
, seekerArray()
## Not run: doParallel::registerDoParallel() params = yaml::read_yaml('my_params.yaml') seeker(params) ## End(Not run)
## Not run: doParallel::registerDoParallel() params = yaml::read_yaml('my_params.yaml') seeker(params) ## End(Not run)
This function fetches data and metadata from NCBI GEO and ArrayExpress,
processes raw Affymetrix data using RMA and custom CDFs from Brainarray, and
maps probes to genes. See also the vignettes:
browseVignettes('seeker')
.
seekerArray( study, geneIdType, platform = NULL, parentDir = ".", metadataOnly = FALSE )
seekerArray( study, geneIdType, platform = NULL, parentDir = ".", metadataOnly = FALSE )
study |
String indicating the study accession and used to name the output
directory within |
geneIdType |
String indicating whether to map probes to gene IDs from Ensembl ("ensembl") or Entrez ("entrez"). |
platform |
String indicating the GEO-based platform accession for the raw
data. See https://www.ncbi.nlm.nih.gov/geo/browse/?view=platforms.
Only necessary if |
parentDir |
Directory in which to store the output, which will be a
directory named according to |
metadataOnly |
Logical indicating whether to only process the sample metadata, and skip processing the expression data. |
The standard output:
naive_expression_set.qs: Initial ExpresssionSet
generated by
GEOquery::getGEO or ArrayExpress::ae2bioc()
. Should generally not be
used if sample_metadata.csv and gene_expression_matrix.qs are available.
sample_metadata.csv: Table of sample metadata. Column sample_id
matches
colnames of the gene expression matrix.
gene_expression_matrix.qs: Rows correspond to genes, columns to samples. Expression values are log2-transformed.
custom_cdf_name.txt: Name of custom CDF package used by affy::justRMA()
to process and normalize raw Affymetrix data and map probes to genes.
feature_metadata.qs: GPL
object, if gene expression matrix was generated
from processed data.
probe_gene_mapping.csv.gz: Table of probes and genes, if gene expression matrix was generated from processed data.
"raw" directory: Contains raw Affymetrix files.
params.yml: Parameters used to process the dataset.
session.log: R session information.
The output may include other files from NCBI GEO or ArrayExpress. Files with
extension "qs" can be read into R using qs::qread()
.
Path to the output directory parentDir
/study
, invisibly.
## Not run: seekerArray('GSE25585', 'entrez') ## End(Not run)
## Not run: seekerArray('GSE25585', 'entrez') ## End(Not run)
This function calls
trim_galore
using system2()
, and is only designed to handle standard adapter/quality
trimming. To run in parallel, register a parallel backend, e.g., using
doParallel::registerDoParallel()
.
trimgalore( filepaths, outputDir = "trimgalore_output", cmd = "trim_galore", args = NULL, pigzCmd = "pigz" )
trimgalore( filepaths, outputDir = "trimgalore_output", cmd = "trim_galore", args = NULL, pigzCmd = "pigz" )
filepaths |
Paths to fastq files. For single-end reads, each element should be a single filepath. For paired-end reads, each element should be two filepaths separated by ";". |
outputDir |
Directory in which to store output. Will be created if it doesn't exist. |
cmd |
Name or path of the command-line interface. |
args |
Additional arguments to pass to the command-line interface. Output files will always be compressed. Arguments "–gzip", "–cores", "-j", and "–basename" are not allowed. Arguments "-o" and "–paired" should not be specified here. |
pigzCmd |
String for pigz command, which will gzip the output files. |
A vector of exit codes, invisibly.
This function uses the tximport package.
tximport( inputDir, tx2gene, samples = NULL, outputDir = "data", type = c("salmon", "kallisto"), countsFromAbundance = "lengthScaledTPM", ignoreTxVersion = TRUE, ... )
tximport( inputDir, tx2gene, samples = NULL, outputDir = "data", type = c("salmon", "kallisto"), countsFromAbundance = "lengthScaledTPM", ignoreTxVersion = TRUE, ... )
inputDir |
Directory that contains the quantification directories. |
tx2gene |
|
samples |
Names of quantification directories to include. |
outputDir |
Directory in which to save the result, a file named
"tximport_output.qs", using |
type |
Passed to |
countsFromAbundance |
Passed to |
ignoreTxVersion |
Passed to |
... |
Additional arguments passed to |
A list, as returned by tximport::tximport()
, invisibly.