SeqOthello

 from LiuBioinfo/SeqOthello

Last update: Sep 6, 2018

SeqOthello is an ultra-fast and memory-efficient indexing structure to support arbitrary sequence query against large collections of RNA-seq experiments. Taking a sequence as query input, SeqOthello returns either the total k-mer hits of the query sequence or the detailed presence/absence information of individual k-mers across all the indexed experiments.

A preprint of the paper describing SeqOthello is available here.

SeqOthello Installation

System Requirements

SeqOthello is tested on Linux platforms with the following system settings. The performance is optimized for Intel CPUs with SSE4.2 support.

  • cmake >= 2.8.4
  • gcc >= 4.9.1
  • zlib >= 1.2.3

Ubuntu

sudo apt install cmake build-essential zlib1g-dev

Fedora

sudo yum install gcc-c++ cmake zlib-dev

Mac OS 10.12. Please install cmake using brew

brew install cmake

Installation

  1. Clone the repository:

    git clone https://github.com/LiuBioinfo/SeqOthello.git
    
  2. Build and install:

    cd SeqOthello/
    ./compile.sh
    

    After successfully build the code, you can find SeqOthello toolchain at build/bin/.

    ls build/bin
    

Manual

Build SeqOthello with an example.

The construction of SeqOthello requires the input of a list of k-mer files, each of which contains the set of k-mers extracted from reads of the corresponding RNA-Seq experiment. Currently the k-mer file is generated by Jellyfish from fasta/fastq files.

For demonstration purpose, we provide an example/ project including 10 simulated pair-end RNA-seq experiments for testing. You can use the pipeline script example/build_and_query.sh to build the SeqOthello map for the sample project from the fastq files and query the example transcripts included in example/transcripts.fa

  1. Extract k-mer count using Jellyfish

    You may need to install Jellyfish first.

    First, use Jellyfish to generate k-mer files for the experiments included in experiments_list.10.txt and save them in the temporary folder tmp/kmers. This setup may take about 10 seconds. For the given example, we provide a shell scripts for this step STEP1_Jellyfish.sh. The You may execute the script in example folder:

    cd example
    ./STEP1_Jellyfish.sh
    
  2. Convert k-mer files to SeqOthello binary format.

    The second step is to convert k-mer files to SeqOthello binary format using PreProcess. For the given example, we provide a shell scripts for this step STEP2_Binary.sh. You may execute the script in example folder

    ./STEP2_Binary.sh
    
  3. Make SeqOthello group files In the third step, binary files are grouped by the Group tool, into small subsets for further process. Generally, each group contains approximately 50 samples. Since we only have 10 samples in this example, we build two groups in tmp/grp/, where Grp_00 contains experiments specified in tmp/binary_list.part00, which is corresponds to the first 5 lines of experiments_list.10.txt. The reset 5 examples are included in Grp_01. We put these two filenames in tmp/grp_list.

    For the given example, we provide a shell scripts for this step STEP3_Group.sh. You may execute the script in example folder

    ./STEP3_Group.sh
    
  4. Build SeqOthello mapping

    Now, we can build the SeqOthello mapping between the entire set of k-mers and their experiment ids using the Build tool. The indexes groups are speicified by --flist parameter, order of group files essentially determines the orders of the experiments stored in SeqOthello. You may execute the following command in example folder

    mkdir -p map
    
    ../build/bin/Build --flist=tmp/grp_list --grp-folder=tmp/grp/ --out-folder=map/
    
    

    And then you will find the SeqOthello mapping in the map folder. The file map.xml provides the metadata of the SeqOthello structure.

  5. An example of adding new experiments to existing SeqOthello.

    In some cases the users may choose to add additional experiments to exisiting SeqOthello mapping. We recommend the users keep the group files generated during SeqOthello construction, so that it can be reused for further reconstruction.

    For example, in addition to the 10 experiments, we have 3 new kmer files generated by Jellyfish. Execute this and you will find the additional example kmer files in tmp/kmers/ STEP_additional_1.sh

    Similarly, we need to convert these k-mer files into binary files and then convert the binary files to group files. Execute these commands. STEP_additional_2.sh STEP_additional_3.sh

    mkdir -p mapa
    
     ../build/bin/Build --flist=tmp/grp_list_a --grp-folder=tmp/grp/ --out-folder=mapa/
    

Build SeqOthello with custom experiments

To build SeqOthello with custom experiments, please use Jellyfish you may use the script generator genBuildFromJellyfishKmers.sh

    ./genBuildFromJellyfishKmers.sh

For all prompted input questions, use the default value for the example data set, or, enter custom values for custom experiments. Then three scripts will be generated and you can execute them one by one.

    ./ConvertToBinary.sh   
    ./MakeGroup.sh
    ./BuildSeqOthello.sh

Transcripts Query

SeqOthello Query takes .fa files as input and can generate output in the following two format.

The file transcripts.fa contains 3 human transcript sequences and are used in the following example to demonstrate query results.

grep "ENST" transcripts.fa
>ENST00000431542
>ENST00000428263
>ENST00000628538
  1. Number of k-mer hits per sequence

    By default, for each transcript in the query, SeqOthello returns the total number of k-mer hits of each experiment in a tab-delimited table.

    ../build/bin/Query --map-folder=map/ \
    --transcript=transcripts.fa \
    --output=query.hits.txt \
    --qthread=1
    

    The query results has three rows, one transcript per row. The first column indicates the transcripts index matching the order in transcript.fa. The rest of the columns are the query results for all the experiments in the SeqOthello map. These experiments are in the order specified in the map.xml.

    cat query.hits.txt
    transcript# 0   222     214     200     150     160     215     209     220     193     233
    transcript# 1   205     205     168     210     217     202     190     211     160     196
    transcript# 2   115     115     115     115     115     115     115     115     115     115
    
  2. Detailed k-mer hit map per query

    If --detail is used, SeqOthello may return the detailed map regarding individual k-mer’s presence/absence information across all the indexed experiments. This mode is limited to one transcript per query due to the large amount of output generated.

    head -2 transcripts.fa > ENST00000431542.fa
    
    ../build/bin/Query --map-folder=map/ \
    --transcript=ENST00000431542.fa \
    --output=ENST00000431542.hits.txt \
    --detail \
    --qthread=1
    

    The output has two columns. The first column is the k-mer sequence and the following set of columns contains the detailed hit map across all indexed experiments. We use + and . sign to indicate whether a k-mer is present or absence in an indexed experiment respectively. Each column of the + and - map, from left to right, indicates the hit infomation for one experiment indexed in the SeqOthello. These experiments are in the order specified in the map.xml.

    Here is an example output showing the first 10 k-mers in transcript ENST00000431542.

    head -10 ENST00000431542.hits.txt
    GTGGGAGTCGCCACCGCACCC ..........
    CGTGGGAGTCGCCACCGCACC .........+
    CCGTGGGAGTCGCCACCGCAC .........+
    GCCGTGGGAGTCGCCACCGCA .......+.+
    CGCCGTGGGAGTCGCCACCGC .......+.+
    CCGCCGTGGGAGTCGCCACCG .......+.+
    TCCGCCGTGGGAGTCGCCACC .......+.+
    ATCCGCCGTGGGAGTCGCCAC +......+.+
    CATCCGCCGTGGGAGTCGCCA +......+.+
    CCATCCGCCGTGGGAGTCGCC +....+.+.+
    

    You may also use the gnu datamash tool to view the transposed representation of this file.

    ./transpose.sh ENST00000431542.hits.txt
    

SeqOthello Online

SeqOthello also accommodates online features for small-batch queries. Online queries preload the entire index into memory prior to querying, and can be executed in approximately 0.09 seconds per transcript.

Use the following command to start a server on the machine, (e.g., on TCP port 3322). The service will run as a deamon.

../build/bin/Query \
--map-folder=map/ \
--start-server-port 3322

Open a new terminal, run the Client program for k-mer hit query.

../build/bin/Client \
--transcript=transcripts.fa \
--output=online.query.txt \
--kmer-hit \
--port=3322

Build SeqOthello in parallel

The PreProcess and Group steps in SeqOthello construction can be easily paralleled. For example, with GNU Parallel, you run the PreProcess commands for the 10 experiments:

cat experiments_list.10.txt | \
parallel  ../build/bin/PreProcess \
--k=21 \
--cutoff=1 \
--in=tmp/kmers/{}.kmer \
--out=tmp/kmer_bins/{}.bin

You can build the 2 group files in parallel with:

parallel ../build/bin/Group \
--flist=tmp/binary_list.part{1} \
--folder=tmp/kmer_bins/ \
--output=tmp/grp/Grp_{1} \
::: 01 02

License

Please refer to LICENSE.TXT.

Citation

    SeqOthello: Query over RNA-seq experiments at scale
    Ye Yu, Jinpeng Liu, Xinan Liu, Yi Zhang, Eamonn Magner, Chen Qian, Jinze Liu
    bioRxiv 258772; doi: https://doi.org/10.1101/258772

Getting help

For questions running SeqOthello, please post to SeqOthello Google Group

Known issues

  • Linux systems often have a limit on the number of files that can be opened simutaneously. When there are a large number of experiment files, the Build program of SeqOthello may hit the limit. You may set it to larger numbers by using limit. e.g, ulimit -nS 4096

  • The maximum size of k-mers supported as of now is 31.

  • Each of binary file is associated with a xml file, the xml files should be put in the same folder with the binary files. Similarly, please put the associated xml files for the group files in the same folder with the group files.