kcollections
Last update: Feb 10, 2020
A fast and efficient library for storing k-mers in c++ and python.
kcollections
is a python-like dict
and set
that efficiently store kmers as keys or elements of the collection.
It is designed for building/prototyping Bioinformatics tools that rely on kmers but where the included dict
and set
consume too much memory for use.
It implements the Bloom Filter Trie algorithm. This implementation differs from Guillaume et al. by allowing kmers of arbitrary size and by providing a generic dictionary/map data structure for associating arbitrary values with kmers.
kcollections
is currently only available for Python version 2.7.
We provide some pre-compiled binaries for python 2/3 and Linux and MacOS:
pip install kcollections
Alternatively, you can build from source.
Prerequisites include:
- CMake, must be installed manually in order to build the code.
- boost, at least version 1.65.1
- uint256_t, included in the repository.
- pybind11, included in the repository.
To build and install the python module from source:
git clone https://github.com/masakistan/kcollections.git
cd kcollections
# python 3
python3 setup.py install
# or
pip3 install .
# python 2
python2 setup.py install
# or
pip2 install .
Additionally, if you would like to access the functions from a C++ program, you can build static libraries with the following steps:
git clone https://github.com/masakistan/kcollections.git
cd kcollections
mkdir build
cd build
cmake ..
make
Kmers can be added one at a time with add
, but the fastest way to add kmers to a set is
to add a DNA sequence using add_seq
.
import kcollections
ks = kcollections.Kset(27)
# add single kmer
ks.add('AAACTGTCTTCCTTTATTTGTTCAGGG')
# sequence insertion
seq = 'AAACTGTCTTCCTTTATTTGTTCAGGGATCGTGTCAGTA'
ks.add_seq(seq, len(seq))
assert 'AAACTGTCTTCCTTTATTTGTTCAGGG' in ks
# iteration
for kmer in ks:
print kmer
ks.write("data.bs")
del ks
# sometime later
ks = Kcollections.Kset()
ks.read("data.bs")
The fastest way to use Kset
is to use multithreaded insertion.
Multithreaded approach is best used when all kmers are loaded upfront.
Kmers not accessible until the threads have been joined using
parallel_add_join
.
See the example below on how parallel and serial insertions can be used.
import kcollections
ks = kcollections.Kset(27)
# multithreaded sequence insertion
# nthreads must be a power of 2.
# nthreads = 4 or 16 work well
ks.parallel_add_init(16)
# insert a sequence of kmers
ks.parallel_add_seq(seq)
# insert a single kmer
ks.parallel_add('AAACTGTCTTCCTTTATTTGTTCACAG')
# merge threads together
# no parallel add methods can be used after joining
ks.parallel_add_join()
# serial add is permissible after threads have joined
ks.add('AAACTGTCTTCCTTTATTTGTTCACAG')
# iteration
for kmer in ks:
print(kmer)
print len(ks)
import kcollections
kd = kcollections.Kdict(str, 27)
# insertion and value assignment
kd['AAACTGTCTTCCTTTATTTGTTCAGGG'] = 'banana'
kd['AAACTGTCTTCCTTTATTTGTTCAGGT'] = 'phone'
assert kd['AAACTGTCTTCCTTTATTTGTTCAGGG'] == 'banana'
assert kd['AAACTGTCTTCCTTTATTTGTTCAGGT'] == 'phone'
# iteration
for kmer, val in kd.items():
print(kmer, val)
# removal
del kd['AAACTGTCTTCCTTTATTTGTTCAGGT']
Parallel insertion for Kdict
requires the inclusion of a merging function to resolve different values for the same key.
The following snippet adds 27mers from a string of DNA using a provided lambda function to merge value conflicts.
This merge function simply keeps the newest value associated with the kmer.
More examples of merging functions with Kdict
can be found here.
kd = kcollections.Kdict(int, 27)
kd.parallel_add_init(4)
kd.set_merge_func(lambda prev_val, new_val: new_val)
kd.parallel_add_seq(dna, generate_idx(len(dna)))
kd.parallel_add_join()
Kcounter
is an implementation of the Python collection's
Counter,
but the keys must be kmers, of course!
Like Kdict
, kmers can be added to Kcounter
one at a time, but the
fastest ways to add kmers to a set is to add an DNA sequence using add_seq
(or
parallel_add_seq
for multithreaded inserts).
from kcollections import Kcounter
kc = Kcounter(27)
# add single kmer
kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] += 1
# sequence insertion
seq = 'AAACTGTCTTCCTTTATTTGTTCAGGGATCGTGTCAGTA'
kc.add_seq(seq)
assert kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] == 2
# iteration
for kmer, count in kc.items():
print(kmer, count)
from kcollections import Kcounter
kc = Kcounter(27)
# multithreaded sequence insertion
# nthreads must be a power of 2.
# nthreads = 4 or 16 work well
kc.parallel_add_init(16)
# insert a sequence of kmers
kc.parallel_add_seq(seq)
# insert a single kmer
kc.parallel_add('AAACTGTCTTCCTTTATTTGTTCACAG')
# merge threads together
# no parallel add methods can be used after joining
kc.parallel_add_join()
# updates can be made after the join
kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] += 1
# iteration
for kmer, count in kc.items():
print(kmer, count)
kcollections
is quite a bit slower than the dict
or set
but is much more memory-efficient.
We measured memory usage and running time using /usr/bin/time -v
on aIntel(R) Xeon(R) E5-2650v4 @2.20GHz
with 256 GB RAM.
27mers used for testing were taken from the human genome, no repetitive kmers appear in our dataset providing a worst case scenario where no insertions or queries are pruned before traversing the entire data structure.
# kmers | kset |
set |
---|---|---|
1 million | 25.32 MB | 105.82 MB |
10 million | 63.74 MB | 906.96 MB |
100 million | 0.56 GB | 11.98 GB |
500 million | 2.42 GB | 48.54 GB |
1 billion | 4.43 GB | 97.07 GB |
1.5 billion | 6.44 GB | 191.61 GB |
2 billion | 8.44 GB | 194.14 GB |
2.4 billion | 10.08 GB | 220.06 GB |
Insertion time comparisons using built-in Python set, kcollections
serial and parallel insert.
An example read mapping algorithm and assembler are provided using kcollections
in the applications
directory.
kcollections
was built at the Computational Science Laboratory at Brigham Young University by Stanley Fujimoto (@masakistan) and Cole Lyman (@colelyman).
Funding was provided by the Utah NASA Space Grant Consortium and the BYU Graduate Research Fellowship.