Running a MAP function
Zorn/Bascet is designed to let you run all kinds of software that operates on your cells. This can be either the raw reads, the contigs, or any other data that you produce. Because these operations can be computationally intense, all of this happens through the MAP framework.
Here is an example of invoking the built-in QUAST script to produce quality metrics of each assembled genome:
BascetMapCell(
bascetRoot,
withfunction = "_quast",
inputName = "skesa",
outputName = "quast"
)
Aggregating MAP results
Once you have run your map function, you most likely want to load the results into R. We call this procedure “aggregate”. In case of QUAST, this procedure loads all quality metrics into an R data.frame object:
quast_aggr <- MapListAsDataFrame(BascetAggregateMap(
bascetRoot,
inputName="quast",
aggr.quast
))
Arguments to MAP functions
Some scripts require additional arguments to be sent (such as a link to a database file). This is done by setting the args argument. Below will set two environment variables such that the contents can be picked up the script:
Custom MAP functions - introduction
It is easy to add new functions! Easiest way is to simply copy and modify the code for an existing script. You can start from either * QUAST, which takes contigs as input * SKESA, which takes FASTQ as input
Once you have written your script, you invoke it with a direct path:
BascetMapCell(
bascetRoot,
withfunction = "/path/to/your/script.sh",
inputName = "...",
outputName = "..."
)
In most cases you want to write your own aggregate function. This function will take the output from your tool, parse it, and put in a sensible R object. Have a look at example and existing aggregate functions for inspiration.
There is also a catch-all aggregate function that requires a bit of a special way of calling. The example below takes “out.txt”, generated by each tool, and stores the raw file content in a list. This is not pretty but it may help you in debugging and development:
quast_aggr <- MapListAsDataFrame(BascetAggregateMap(
bascetRoot,
inputName="..",
aggr.raw("out.txt")
))
Custom MAP functions - details
If you look at any example MAP function, you will find that it is a BASH script that conforms to a certain pattern. It actually is just a script (in any language) that takes certain command line arguments.
--bascet-api
The script returns the API version, also validating that it is a valid script for MAP calls
--expect-files
The script returns a list of what files to extract from the Bascet, for each cell. Here, “*” means to get everything. Asking for less means higher performance
–missing-file-mode The Bascet what to do if the files are missing. “skip” means to just proceed with the next cell
--compression-mode
How to compress the output files. “default” means to compress. However, if your tool generates compressed files already, it is just a waste of time trying to do it again, in which case the script can return “uncompressed”.
--input-dir XXX
This is the directory where input files are located
--output-dir YYY
Where to store output to. This directory is already created
--num-threads ZZZ
How many threads to use for this particular process. Note that Bascet is already calling multiple MAP scripts in parallel and there is thus typically little benefit in making individual process multithreaded
--recommend-threads
Return how many threads (at least 1) the job should get. This is used if the user runs mapcell but only specifies the total number of threads. Bascet will then try to allocate workers accordingly. Return 1 if your mapcell script does not support multithreading
--preflight-check
This is called once only, to check that the script has the needed software dependencies. In such case, it returns “MAPCELL-CHECK”