Software for colored k-mer sets (SOCKS)
The SOCKS interface (Software for colored k-mer sets) defines a common set of core features and standard input and output formats that software tools in computational pangenomics should implement. It aims to enhance the comparability and interoperability of these tools for the benefit of both developers and users. A small example dataset, which adopts the input format and can be used throughout PanBench, is provided below.
-
build
: construct an index from a set of sequences-
input: plain text file containing the sequence file names, e.g.:
COLOR_NAME_1: /PATH/TO/GENOME.FASTA COLOR_NAME_2: /PATH/TO/READ_1.FASTQ /PATH/TO/READ_2.FASTQ
-
output: index in binary or interoperable format, e.g. kmer file format
(at least one of the options, binary or interoperable format, should be provided)
(if both options are provided, it should be possible to switch using a parameter)
-
input: plain text file containing the sequence file names, e.g.:
-
lookup-kmer
: find the color sets for a list of k-mers-
input: plain text file containing the k-mers, one per line, e.g.:
ACGTACGT ACCTAGGT
-
output: plain text file listing the color set for each k-mer, e.g.:
(as list of positive hits)ACGTACGT: COLOR_NAME_1 COLOR_NAME_4 COLOR_NAME_7 ... ACCTAGGT: COLOR_NAME_1 COLOR_NAME_5 COLOR_NAME_8 ...
ACGTACGT: 10010010... ACCTAGGT: 10001001...
(if both options are provided, it should be possible to switch using a parameter)
-
input: plain text file containing the k-mers, one per line, e.g.:
-
lookup-color
: find the k-mer sets for a list of colors-
input: plain text file containing color names, one per line, e.g.:
COLOR_NAME_1 COLOR_NAME_2
-
output: plain text file listing the k-mer set for each color, e.g.:
(as list of positive hits)COLOR_NAME_1: ACCTAGGT ACGTACGT CAAGCGTA ... COLOR_NAME_2: AGCTAGCT AGGTACCT CAGGCATA ...
-
input: plain text file containing color names, one per line, e.g.: