Graph Based Sequence Clustering is a method for identifying and clustering short tandem repeats (STRs) by repetitive patterns. It first scans all database sequences to identify STRs and determine their graph representations. Then, it groups sequences having identical graph representations. Additionally, it can search for sequences having the same graph representation. Therefore, the method can work in the following modes: identify, cluster, and search.
Installation
First, you need to download GBSC source code. The official version you can download from method’s GitHub release page. Then extract the archive what you can do using the following command:
tar zxvf gbsc-1.0.0.tar.gz
Before you compile the project, you need to resolve build dependencies. You can do it using the following command:
sudo apt install g++ make
Enter the directory using:
cd gbsc-1.0.0
Then configure and build the software by executing the following commands:
./configure
make
If you wish to install the method in your operating system, then do it using command:
make install
Usage
To identify STRs in fasta database, use the following syntax:
gbsc identify -i [fasta_db]
The syntax for clustering is:
gbsc cluster -i [fasta_db] --clusters-dir [output_dir]
It is also possible to search for similar STRs in a database using the following syntax:
gbsc search -i [fasta_db] --query-file [fasta_query]
For more help use either:
gbsc help
or:
gbsc [command] --help
Where possible commands are: identify, cluster, and search.