kcollections
Last update: May 29, 2026
In-memory Python k-mer sets, dicts, and counters that use far less RAM than built-in set/dict.
Built for research prototyping and light production pipelines where you want normal Python APIs and can keep indexes in RAM. For billion-scale on-disk indexes, use KMC or Jellyfish instead â see docs/USAGE.md.
# macOS
brew install cmake
pip install kcollections
# Linux (Debian/Ubuntu)
sudo apt-get install -y cmake build-essential
pip install kcollectionsFrom source: pip install . · Dev: pip install -e ".[dev]"
Requires Python 3.10+, CMake 3.18+, C++17 (no Boost).
from kcollections import Kset, Kdict, Kcounter
# Unique k-mers
ks = Kset(27)
ks.add_seq("AAACTGTCTTCCTTTATTTGTTCAGGGATCGTGTCAGTA")
ks.save("index.kc")
ks2 = Kset.from_file("index.kc")
# K-mer â value
kd = Kdict(int, 27)
kd["AAACTGTCTTCCTTTATTTGTTCAGGG"] = 1
# Abundance
kc = Kcounter(27)
kc.add_seq("AAACTGTCTTCAAACTGTCTTT")
print(kc.most_common(5))More examples: examples/ · Full guide: docs/USAGE.md · Upgrading: MIGRATION.md
Conda: recipe in conda-recipe/ · Releases: RELEASE.md (tag â PyPI + GitHub release).
ks = Kset(27)
ks.parallel_add_init(16) # thread count: power of 2
ks.parallel_add_seq(dna)
ks.parallel_add_join()| Task | Tool |
|---|---|
| Python research, typed k-mer metadata | kcollections |
| RAM-heavy but < ~100M k-mers | built-in set may suffice |
| Huge on-disk k-mer DB | KMC, Jellyfish, Cuttlefish |
| MinHash / sketching | sourmash |
- Prefer
save()/load()orKset.from_file(path). - Pin the package version; binary indexes use
kcollections-v2(see MIGRATION.md). Re-save indexes when upgrading major versions.
pip install -e ".[dev]"
pytest tests -qMemory-efficient vs Python set for large k-mer sets (27-mers, human genome â see figures in repo). Insertion is slower than set; memory is the tradeoff this library optimizes.
Bloom Filter Trie extension and library: Fujimoto & Lyman; also cite Holley et al., Bloom Filter Trie.
GPLv3 â see LICENSE.

