RAMBO: Repeated And Merged Bloom Filter for Multiple Set Membership Testing (MSMT) in Sub-linear time.
from RUSH-LAB/RAMBO
Last update: Aug 18, 2020
Approximate set membership is a common problem with wide applications in databases, networking, and search. Given a set S and a query q, the task is to determine whether q in S. For Multiple Set Membership Testing (MSMT) problem, we are given K different sets, and for any given query q the goal is the find all of the sets containing the query element. An example of MSMT problem is gene sequence searching in multiple number of files. The state of art sequence searching process is either very slow or it has database with impractical memory requirement.
RAMBO solves this problem by achieving sublinear query time (O(\sqrt{K} log K)) in number of files with memory requirement of slightly more then the information theoretical limit.
This code is the implementation of: https://arxiv.org/abs/1910.02611 for gene sequence search.
- Automake
- g++ >= 8
To build RAMBO, first clone the repository with all submodules
git clone --recurse-submodules https://github.com/RUSH-LAB/RAMBO.git
Next, cd
into the RAMBO/
repository and run the following build commands:
./configure
make
make install
The rambo
binary will be available in bin/
The rambo
binary has 3 main subcommands, build
, insert
, and query
. For universal flags and settings such as how many threads to use, see rambo --help
rambo build
and rambo insert
both accept a list of input files as their first argument. Following the list of files, the user can specify the RAMBO parameters (see rambo build --help
for more details) as well as the output directory.
RAMBO will treat each file as an input set and will name each set by its file stem. For example, building a RAMBO index via
rambo build kmers1.txt kmers2.txt kmers3.txt -o example_kmers
will build an index in the example_kmers/
directory consiting of the 3 input files. Alternatively, the user can use a regular expression to input their list of files:
rambo build *.txt -o example_kmers
The syntax for inserting sets into an existing RAMBO index is similar. See the --help
output for more details.
The syntax for querying a RAMBO index is similar to that of build
and insert
. For example, if I wanted to query the example_kmers
index for each kmer in query.txt
, I would run
rambo query query.txt --database example_kmers
The output of the above command would be stored in query_results.txt
. Users can change this by providing an -o,--output
flag followed by the desired output prefix.
For experiment we use the Whole Genome Sequence (WGS) dataset as used by [1]. It is bacterial, viral and parasitic WGS datasets in the European Nucleotide Archive (ENA) as of December 2016. The total size of data is ~170 TB. It is divided into 100 parts and indexed randomly.
Requirements:
- GCC version >= 6.2.0
- Install latest GNU parallel OS X:
brew install parallel
Debian/Ubuntu:
sudo apt-get install parallel
RedHat/CentOS:
sudo yum install parallel
-
Install wget and bzip2
-
Install cortexpy Refer to this installation [document] (https://cortexpy.readthedocs.io/en/latest/overview.html#installation)
-
Disk memory > 2.5 TB
-
RAM > 160 GB
To download dataset run the script RAMBO/data/0/download.sh
sh download.sh
It downloads data files in RAMBO/data/0/inflated (ensure this path is present before downloading). Once it downloads the batch of 100 data files.