SeqOthello
Last update: Sep 6, 2018
SeqOthello is an ultra-fast and memory-efficient indexing structure to support arbitrary sequence query against large collections of RNA-seq experiments. Taking a sequence as query input, SeqOthello returns either the total k-mer hits of the query sequence or the detailed presence/absence information of individual k-mers across all the indexed experiments.
A preprint of the paper describing SeqOthello is available here.
SeqOthello is tested on Linux platforms with the following system settings. The performance is optimized for Intel CPUs with SSE4.2 support.
- cmake >= 2.8.4
- gcc >= 4.9.1
- zlib >= 1.2.3
Ubuntu
sudo apt install cmake build-essential zlib1g-dev
Fedora
sudo yum install gcc-c++ cmake zlib-dev
Mac OS 10.12. Please install cmake using brew
brew install cmake
-
Clone the repository:
git clone https://github.com/LiuBioinfo/SeqOthello.git
-
Build and install:
cd SeqOthello/ ./compile.sh
After successfully build the code, you can find SeqOthello toolchain at
build/bin/
.ls build/bin
The construction of SeqOthello requires the input of a list of k-mer files, each of which contains the set of k-mers extracted from reads of the corresponding RNA-Seq experiment. Currently the k-mer file is generated by Jellyfish from fasta/fastq files.
For demonstration purpose, we provide an example/
project including 10 simulated pair-end RNA-seq experiments for testing. You can use the pipeline script example/build_and_query.sh
to build the SeqOthello map for the sample project from the fastq
files and query the example transcripts included in example/transcripts.fa
-
Extract k-mer count using Jellyfish
You may need to install Jellyfish first.
First, use Jellyfish to generate k-mer files for the experiments included in
experiments_list.10.txt
and save them in the temporary foldertmp/kmers
. This setup may take about 10 seconds. For the given example, we provide a shell scripts for this stepSTEP1_Jellyfish.sh
. The You may execute the script inexample
folder:cd example ./STEP1_Jellyfish.sh
-
Convert k-mer files to SeqOthello binary format.
The second step is to convert k-mer files to SeqOthello binary format using
PreProcess
. For the given example, we provide a shell scripts for this stepSTEP2_Binary.sh
. You may execute the script inexample
folder./STEP2_Binary.sh
-
Make SeqOthello group files In the third step, binary files are grouped by the
Group
tool, into small subsets for further process. Generally, each group contains approximately 50 samples. Since we only have 10 samples in this example, we build two groups in tmp/grp/, whereGrp_00
contains experiments specified intmp/binary_list.part00
, which is corresponds to the first 5 lines ofexperiments_list.10.txt
. The reset 5 examples are included inGrp_01
. We put these two filenames intmp/grp_list
.For the given example, we provide a shell scripts for this step
STEP3_Group.sh
. You may execute the script inexample
folder./STEP3_Group.sh
-
Build SeqOthello mapping
Now, we can build the SeqOthello mapping between the entire set of k-mers and their experiment ids using the
Build
tool. The indexes groups are speicified by--flist
parameter, order of group files essentially determines the orders of the experiments stored in SeqOthello. You may execute the following command inexample
foldermkdir -p map ../build/bin/Build --flist=tmp/grp_list --grp-folder=tmp/grp/ --out-folder=map/
And then you will find the SeqOthello mapping in the
map
folder. The filemap.xml
provides the metadata of the SeqOthello structure. -
An example of adding new experiments to existing SeqOthello.
In some cases the users may choose to add additional experiments to exisiting SeqOthello mapping. We recommend the users keep the group files generated during SeqOthello construction, so that it can be reused for further reconstruction.
For example, in addition to the 10 experiments, we have 3 new kmer files generated by Jellyfish. Execute this and you will find the additional example kmer files in tmp/kmers/
STEP_additional_1.sh
Similarly, we need to convert these k-mer files into binary files and then convert the binary files to group files. Execute these commands.
STEP_additional_2.sh
STEP_additional_3.sh
mkdir -p mapa ../build/bin/Build --flist=tmp/grp_list_a --grp-folder=tmp/grp/ --out-folder=mapa/
To build SeqOthello with custom experiments, please use Jellyfish you may use the script generator genBuildFromJellyfishKmers.sh
./genBuildFromJellyfishKmers.sh
For all prompted input questions, use the default value for the example data set, or, enter custom values for custom experiments. Then three scripts will be generated and you can execute them one by one.
./ConvertToBinary.sh
./MakeGroup.sh
./BuildSeqOthello.sh
SeqOthello Query
takes .fa
files as input and can generate output in the following two format.
The file transcripts.fa
contains 3 human transcript sequences and are used in the following example to demonstrate query results.
grep "ENST" transcripts.fa
>ENST00000431542
>ENST00000428263
>ENST00000628538
-
Number of k-mer hits per sequence
By default, for each transcript in the query, SeqOthello returns the total number of k-mer hits of each experiment in a tab-delimited table.
../build/bin/Query --map-folder=map/ \ --transcript=transcripts.fa \ --output=query.hits.txt \ --qthread=1
The query results has three rows, one transcript per row. The first column indicates the transcripts index matching the order in
transcript.fa
. The rest of the columns are the query results for all the experiments in the SeqOthello map. These experiments are in the order specified in themap.xml
.cat query.hits.txt transcript# 0 222 214 200 150 160 215 209 220 193 233 transcript# 1 205 205 168 210 217 202 190 211 160 196 transcript# 2 115 115 115 115 115 115 115 115 115 115
-
Detailed k-mer hit map per query
If
--detail
is used, SeqOthello may return the detailed map regarding individual k-merâs presence/absence information across all the indexed experiments. This mode is limited to one transcript per query due to the large amount of output generated.head -2 transcripts.fa > ENST00000431542.fa ../build/bin/Query --map-folder=map/ \ --transcript=ENST00000431542.fa \ --output=ENST00000431542.hits.txt \ --detail \ --qthread=1
The output has two columns. The first column is the k-mer sequence and the following set of columns contains the detailed hit map across all indexed experiments. We use
+
and.
sign to indicate whether a k-mer is present or absence in an indexed experiment respectively. Each column of the+
and-
map, from left to right, indicates the hit infomation for one experiment indexed in the SeqOthello. These experiments are in the order specified in themap.xml
.Here is an example output showing the first 10 k-mers in transcript ENST00000431542.
head -10 ENST00000431542.hits.txt GTGGGAGTCGCCACCGCACCC .......... CGTGGGAGTCGCCACCGCACC .........+ CCGTGGGAGTCGCCACCGCAC .........+ GCCGTGGGAGTCGCCACCGCA .......+.+ CGCCGTGGGAGTCGCCACCGC .......+.+ CCGCCGTGGGAGTCGCCACCG .......+.+ TCCGCCGTGGGAGTCGCCACC .......+.+ ATCCGCCGTGGGAGTCGCCAC +......+.+ CATCCGCCGTGGGAGTCGCCA +......+.+ CCATCCGCCGTGGGAGTCGCC +....+.+.+
You may also use the gnu datamash tool to view the transposed representation of this file.
./transpose.sh ENST00000431542.hits.txt
SeqOthello also accommodates online features for small-batch queries. Online queries preload the entire index into memory prior to querying, and can be executed in approximately 0.09 seconds per transcript.
Use the following command to start a server on the machine, (e.g., on TCP port 3322). The service will run as a deamon.
../build/bin/Query \
--map-folder=map/ \
--start-server-port 3322
Open a new terminal, run the Client program for k-mer hit query.
../build/bin/Client \
--transcript=transcripts.fa \
--output=online.query.txt \
--kmer-hit \
--port=3322
The PreProcess
and Group
steps in SeqOthello construction can be easily paralleled.
For example, with GNU Parallel, you run the PreProcess
commands for the 10 experiments:
cat experiments_list.10.txt | \
parallel ../build/bin/PreProcess \
--k=21 \
--cutoff=1 \
--in=tmp/kmers/{}.kmer \
--out=tmp/kmer_bins/{}.bin
You can build the 2 group files in parallel with:
parallel ../build/bin/Group \
--flist=tmp/binary_list.part{1} \
--folder=tmp/kmer_bins/ \
--output=tmp/grp/Grp_{1} \
::: 01 02
Please refer to LICENSE.TXT.
SeqOthello: Query over RNA-seq experiments at scale
Ye Yu, Jinpeng Liu, Xinan Liu, Yi Zhang, Eamonn Magner, Chen Qian, Jinze Liu
bioRxiv 258772; doi: https://doi.org/10.1101/258772
For questions running SeqOthello, please post to SeqOthello Google Group
-
Linux systems often have a limit on the number of files that can be opened simutaneously. When there are a large number of experiment files, the Build program of SeqOthello may hit the limit. You may set it to larger numbers by using
limit
. e.g,ulimit -nS 4096
-
The maximum size of k-mers supported as of now is 31.
-
Each of binary file is associated with a xml file, the xml files should be put in the same folder with the binary files. Similarly, please put the associated xml files for the group files in the same folder with the group files.