GBSC

Graph Based Sequence Clustering is a method for identifying and clustering short tandem repeats (STRs) by repetitive patterns. It first scans all database sequences to identify STRs and determine their graph representations. Then, it groups sequences having identical graph representations. Additionally, it can search for sequences having the same graph representation. Therefore, the method can work in the following modes: identify, cluster, and search.

Installation

First, you need to download GBSC source code. The official version you can download from method’s GitHub release page. Then extract the archive what you can do using the following command:

tar zxvf gbsc-1.0.0.tar.gz

Before you compile the project, you need to resolve build dependencies. You can do it using the following command:

sudo apt install g++ make

Enter the directory using:

cd gbsc-1.0.0

Then configure and build the software by executing the following commands:

./configure
make

If you wish to install the method in your operating system, then do it using command:

make install

Usage

To identify STRs in fasta database, use the following syntax:

gbsc identify -i [fasta_db]

The syntax for clustering is:

gbsc cluster -i [fasta_db] --clusters-dir [output_dir]

It is also possible to search for similar STRs in a database using the following syntax:

gbsc search -i [fasta_db] --query-file [fasta_query]

For more help use either:

gbsc help

or:

gbsc [command] --help

Where possible commands are: identify, cluster, and search.