REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

Last update: Jul 6, 2024

Motivation
Installation
- Requirements
- Quick start
Output
Beta options
- log counts/quantized counts
- input paired-end reads (to bcalm)
Reproduce the manuscript's results
Citation and resources
Advanced FAQ

Motivation

REINDEER builds a data-structure that indexes k-mers and their abundances in a collection of datasets (raw RNA-seq or metagenomic reads for instance). Then, a sequence (FASTA) can be queried for its presence and abundance in each indexed dataset. While other tools (e.g. SBT, BIGSI) were also designed for large-scale k-mer presence/absence queries, retrieving abundances was so far unsupported (except for single datasets, e.g. using some k-mer counters like KMC, Jellyfish). REINDEER combines fast queries, small index size, and low memory footprint during indexing and queries. We showed it allows to index 2585 RNA-seq datasets (~4 billions k-mers) using less than 60GB of RAM and a final index size lower than 60GB on the disk. Then, a REINDEER index can either be queried on disk (low RAM usage) or be loaded in RAM for faster queries.

Note on presence/absence queries: REINDEER supports this type of queries, although other data structures are more fit for this task. See for instance:

to name a few.

Installation

Requirements

GCC >= 4.8
CMAKE > 3.10.0

To install, first clone the project:

git clone --recursive https://github.com/kamimrcht/REINDEER.git

Then:

cd REINDEER

sh install.sh

or

cd REINDEER

make

Test can be run:

make test

Compilation tips

If REINDEER gives a Error: no such instruction during compilation, try replacing -march=native -mtune=native by -msse4 in the file makefile. If this did not work, please file an issue.

Quick start

Have a look at the file of file format in test/fof_unitigs.txt. REINDEER assumes unitig files have been created using BCALM. You can provide a file of file of each unitig file (fasta) instead of the read files (-f) and an output directory for the index files (-o). By default, index files will be written in reindeer_index_files_ + date + a tag. Then build the index:

./Reindeer --index -f test/fof_unitigs.txt -o quick_out

and query: simply provide the fasta query file (single line) to Reindeer using -q, along with the directory of index files that were generated during index construction (-l), and a writeable path for the output (-o). Default output will be written in the query_result directory:

./Reindeer --query -q test/query_test.fa -l quick_out -o quick_out/result_test.txt

In this example, results should be in quick_out/result_test.txt.

Help:

./Reindeer --help

Output values format

When using output with counts, we have 4 kinds of counts:

Format name	Count
raw = default	k-mer abundances
sum	sum of all k-mer counts
average/mean	sum all k-mer / number of k-mer
normalize	sum * 1.10^9 / total kmer in unitig file (bcalm)

The last 3 formats are available since version 1.4.

These formats compute the sum of all k-mer in query sequence, so if length(query sequence) = k, you will get sum = average.

k-mer abundances (format raw)

Let's say that the query is 51 bases long and we look for 31-mers. There are 21 k-mers in the query, from k-mer 0 to k-mer 20. The output of REINDEER looks like:

Why can we observe different values for k-mers in a single query? I answer this question in the advanced FAQ.

Index and query k-mers presence/absence only (no abundances)

By default, REINDEER records k-mers abundances in each input dataset. In order to have k-mer presence/absence instead of abundance per indexed dataset, use --nocount option.

./Reindeer --index -f test/fof_unitigs.txt --nocount -o index_nocount

./Reindeer --query -l index_nocount -q test/query_test.fa --nocount

Beta options

log counts/quantized counts

./Reindeer --index --log-count -f fof_unitigs.txt

./Reindeer --index --quantization -f test/fof_unitigs.txt

input paired-end reads (to bcalm)

./Reindeer --index --paired-end --bcalm -f fof.txt

Reproduce the manuscript's results

We provide a page with scripts to reproduce the results we show in our manuscript, link here.

Citation and resources

REINDEER has been published in Bioinformatics. It was presented during ISMB conference in 2020. Main developper: Camille Marchet.

Access to the preprint: REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets.

Presentation recorded during BiATA 2020.

Citations:

@inproceedings{marchet2020reindeer,
  title ={{REINDEER}: efficient indexing of k-mer presence and abundance in sequencing datasets},
 author={Marchet, Camille and Iqbal, Zamin and Gautheret, Daniel and Salson, Mika{\"e}l and Chikhi, Rayan},
  booktitle =        {28th Intelligent Systems for Molecular Biology (ISMB 2020)},
  year =         {2020},
  doi =         {10.1101/2020.03.29.014159},
}
@article{marchet2020reindeer,
  title={REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets},
  author={Marchet, Camille and Iqbal, Zamin and Gautheret, Daniel and Salson, Mika{\"e}l and Chikhi, Rayan},
  journal={Bioinformatics},
  volume={36},
  number={Supplement\_1},
  pages={i177--i185},
  year={2020},
  publisher={Oxford University Press}
}

Advanced FAQ

Why can we observe different values for k-mers in a single query

Changelog

Notes on last release (1.4.6)

Major

Default disk mode (index written on disk and disk queries)
Dependency on C++17
Output now supports different k-mer countig modes (sum, average, normalized)

Minor

Change of index file names
Correction of various bugs
Switched to object implementation
Some modifications to pass the input/output files
Implementation of code for socket mode [beta]
Allowing lowercase in query files
Multi-threading inactivated in query mode

See Changelog file.

Version 1.4