Software for colored k-mer sets (SOCKS)

The SOCKS interface (Software for colored k-mer sets) defines a common set of core features and standard input and output formats that software tools in computational pangenomics should implement. It aims to enhance the comparability and interoperability of these tools for the benefit of both developers and users. A small example dataset, which adopts the input format and can be used throughout PanBench, is provided below.

  • build: construct an index from a set of sequences

    • input: plain text file containing the sequence file names, e.g.:
      COLOR_NAME_1: /PATH/TO/GENOME.FASTA
      COLOR_NAME_2: /PATH/TO/READ_1.FASTQ /PATH/TO/READ_2.FASTQ
    • output: index in binary or interoperable format, e.g. kmer file format
      ‎ ‎ ‎ ‎ ‎ (at least one of the options, binary or interoperable format, should be provided)
      ‎ ‎ ‎ ‎ ‎ (if both options are provided, it should be possible to switch using a parameter)

  • lookup-kmer: find the color sets for a list of k-mers

    • input: plain text file containing the k-mers, one per line, e.g.:
      ACGTACGT
      ACCTAGGT
    • output: plain text file listing the color set for each k-mer, e.g.:

      (as list of positive hits)
      ACGTACGT: COLOR_NAME_1 COLOR_NAME_4 COLOR_NAME_7 ...
      ACCTAGGT: COLOR_NAME_1 COLOR_NAME_5 COLOR_NAME_8 ...
      (or as a binary vector)
      ACGTACGT: 10010010...
      ACCTAGGT: 10001001...
      (at least one of the options, positive hits or binary vector, should be provided)
      (if both options are provided, it should be possible to switch using a parameter)

  • lookup-color: find the k-mer sets for a list of colors

    • input: plain text file containing color names, one per line, e.g.:
      COLOR_NAME_1
      COLOR_NAME_2
    • output: plain text file listing the k-mer set for each color, e.g.:

      (as list of positive hits)
      COLOR_NAME_1: ACCTAGGT ACGTACGT CAAGCGTA ...
      COLOR_NAME_2: AGCTAGCT AGGTACCT CAGGCATA ...

Example dataset