stringMLST is a tool for detecting the MLST of an isolate directly from the genome sequencing reads. stringMLST predicts the ST of an isolate in a completely assembly and alignment free manner. The tool is designed in a light-weight, platform-independent fashion with minimum dependencies.
stringMLST is a tool not a database, always use the most up-to-date database files as possible. To facilite keeping your databases updated, stringMLST can download and build databases from pubMLST using the most recent allele and profile definitions. Please see the "Included databases and automated retrieval of databases from pubMLST" section below for instructions. The databases bundled here are for convenience only, do not rely on them being up-to-date.
The following links can be used to download stringMLST, the user manual, profile definition and allele sequences with the configuration files for few organisms and example read files
pip install stringMLST
Installation via git (Not recommended for most users)
git clone https://github.com/jordanlab/stringMLST # Optional, download prebuilt databases cd stringMLST git submodule init git submodule update
pip install stringMLST mkdir -p stringMLST_analysis; cd stringMLST_analysis stringMLST.py --getMLST -P neisseria/nmb --species neisseria # Download all available databases with: # stringMLST.py --getMLST -P mlst_dbs --species all wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR026/ERR026529/ERR026529_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR026/ERR026529/ERR026529_2.fastq.gz stringMLST.py --predict -P neisseria/nmb -1 ERR026529_1.fastq.gz -2 ERR026529_2.fastq.gz Sample abcZ adk aroE fumC gdh pdhC pgm ST ERR026529 231 180 306 612 269 277 260 10174
Usage for Example Read Files (Neisseria meningitidis)
Build database:
# Add dir to path export PATH=$PATH:$PWD # Will connect to EBI's SRA servers download_example_reads.sh
unzip datasets/Neisseria_spp.zip -d datasets
[loci] abcZ datasets/Neisseria_spp/abcZ.fa adk datasets/Neisseria_spp/adk.fa aroE datasets/Neisseria_spp/aroE.fa fumC datasets/Neisseria_spp/fumC.fa gdh datasets/Neisseria_spp/gdh.fa pdhC datasets/Neisseria_spp/pdhC.fa pgm datasets/Neisseria_spp/pgm.fa [profile] profile datasets/Neisseria_spp/neisseria.txt
stringMLST.py --buildDB -c databases/Neisseria_spp/config.txt -k 35 -P NM
Predict:
example/ERR026529_1.fastq example/ERR026529._2fastq example/ERR027250_1.fastq example/ERR027250_2.fastq example/ERR036104_1.fastq example/ERR036104_2.fastq
stringMLST.py --predict -1 tests/fastqs/ERR026529_1.fastq -2 tests/fastqs/ERR026529_2.fastq -k 35 -P NM
stringMLST.py --predict -d ./tests/fastqs/ -k 35 -P NM
tests/fastqs/ERR026529_1.fastq tests/fastqs/ERR026529_2.fastq tests/fastqs/ERR027250_1.fastq tests/fastqs/ERR027250_2.fastq tests/fastqs/ERR036104_1.fastq tests/fastqs/ERR036104_2.fastqRun the tool as:
stringMLST.py --predict -l list_paired.txt -k 35 -P NM
stringMLST.py --predict -1 tests/fastqs/ERR026529_1.fq.gz -2 tests/fastqs/ERR026529_2.fq.gz -p -P NM -k 35 -o ST_NM.txt
stringMLST's workflow is divided into two routines:
1) Database building and
2) ST discovery
Database building: Builds the stringMLST database which is used for assigning STs to input sample files. This step is required once for each organism. Please note that stringMLST is capable of working on a custom user defined typing scheme but its efficiency has not been tested on other typing scheme.
ST discovery: This routine takes the database created in the last step and predicts the ST of the input sample(s). Please note that the database building is required prior to this routine. stringMLST is capable of processing single-end and paired-end files. It can run in three modes:
1) Single sample mode - for running stringMLST on a single sample
2) Batch mode - for running stringMLST on all the FASTQ files present in a directory
3) List mode - for running stringMLST on all the FASTQ files provided in a list file
Readme for stringMLST ============================================================================================= Usage ./stringMLST.py [--buildDB] [--predict] [-1 filename_fastq1][--fastq1 filename_fastq1] [-2 filename_fastq2][--fastq2 filename_fastq2] [-d directory][--dir directory][--directory directory] [-l list_file][--list list_file] [-p][--paired] [-s][--single] [-c][--config] [-P][--prefix] [-z][--fuzzy] [-a] [-C][--coverage] [-k] [-o output_filename][--output output_filename] [-x][--overwrite] [-t] [-r] [-v] [-h][--help] ============================================================================================== There are two steps to predicting ST using stringMLST. 1. Create DB : stringMLST.py --buildDB 2. Predict : stringMLST --predict 1. stringMLST.py --buildDB Synopsis: stringMLST.py --buildDB -c <config file> -k <kmer length(optional)> -P <DB prefix(optional)> config file : is a tab delimited file which has the information for typing scheme ie loci, its multifasta file and profile definition file. Format : [loci] locus1 locusFile1 locus2 locusFile2 [profile] profile profileFile kmer length : is the kmer length for the db. Note, while processing this should be smaller than the read length. We suggest kmer lengths of 35, 66 depending on the read length. DB prefix(optional) : holds the information for DB files to be created and their location. This module creates 3 files with this prefix. You can use a folder structure with prefix to store your db at particular location. Required arguments --buildDB Identifier for build db module -c,--config = <configuration file> Config file in the format described above. All the files follow the structure followed by pubmlst. Refer extended document for details. Optional arguments -k = <kmer length> Kmer size for which the db has to be formed(Default k = 35). Note the tool works best with kmer length in between 35 and 66 for read lengths of 55 to 150 bp. Kmer size can be increased accordingly. It is advised to keep lower kmer sizes if the quality of reads is not very good. -P,--prefix = <prefix> Prefix for db and log files to be created(Default = kmer). Also you can specify folder where you want the dbb to be created. -a File location to write build log -h,--help Prints the help manual for this application -------------------------------------------------------------------------------------------- 2. stringMLST.py --predict stringMLST --predict : can run in three modes 1) single sample (default mode) 2) batch mode : run stringMLST for all the samples in a folder (for a particular specie) 3) list mode : run stringMLST on samples specified in a file stringMLST can process both single and paired end files. By default program expects paired end files. Synopsis stringMLST.py --predict -1 <fastq file> -2 <fastq file> -d <directory location> -l <list file> -p -s -P <DB prefix(optional)> -k <kmer length(optional)> -o <output file> -x Required arguments --predict Identifier for predict miodule Optional arguments -1,--fastq1 = <fastq1_filename> Path to first fastq file for paired end sample and path to the fastq file for single end file. Should have extention fastq or fq. -2,--fastq2 = <fastq2_filename> Path to second fastq file for paired end sample. Should have extention fastq or fq. -d,--dir,--directory = <directory> BATCH MODE : Location of all the samples for batch mode. -C,--coverage Calculate seqence coverage for each allele. Turns on read generation (-r) and turns off fuzzy (-z 1) Requires bwa, bamtools and samtools be in your path -k = <kmer_length> Kmer length for which the db was created(Default k = 35). Could be verified by looking at the name of the db file. Could be used if the reads are of very bad quality or have a lot of N's. -l,--list = <list_file> LIST MODE : Location of list file and flag for list mode. list file should have full file paths for all the samples/files. Each sample takes one line. For paired end samples the 2 files should be tab separated on single line. -o,--output = <output_filename> Prints the output to a file instead of stdio. -p,--paired Flag for specifying paired end files. Default option so would work the same if you do not specify for all modes. For batch mode the paired end samples should be differentiated by 1/2.fastq or 1/2.fq -P,--prefix = <prefix> Prefix using which the db was created(Defaults = kmer). The location of the db could also be provided. -r A seperate reads file is created which has all the reads covering all the locus. -s,--single Flag for specifying single end files. -t Time for each analysis will also be reported. -v Prints the version of the software. -x,--overwrite By default stringMLST appends the results to the output_filename if same name is used. This argument overwrites the previously specified output file. -z,--fuzzy = <fuzzy threshold int> Threshold for reporting a fuzzy match (Default=300). For higher coverage reads this threshold should be set higher to avoid indicating fuzzy match when exact match was more likely. For lower coverage reads, threshold of <100 is recommended -h,--help Prints the help manual for this application -------------------------------------------------------------------------------------------- 3. stringMLST.py --getMLST Synopsis: stringMLST.py --getMLST --species= <species> [-k kmer length] [-P DB prefix] Required arguments --getMLST Identifier for getMLST module --species= <species name> Species name from the pubMLST schemes (use --schemes to get list of available schemes) "all" will download and build all Optional arguments -k = <kmer length> Kmer size for which the db has to be formed(Default k = 35). Note the tool works best with kmer length in between 35 and 66 for read lengths of 55 to 150 bp. Kmer size can be increased accordingly. It is advised to keep lower kmer sizes if the quality of reads is not very good. -P,--prefix = <prefix> Prefix for db and log files to be created(Default = kmer). Also you can specify folder where you want the db to be created. We recommend that prefix and config point to the same folder for cleanliness but this is not required --schemes Display the list of available schemes -h,--help Prints the help manual for this application
stringMLST expects paired end reads to be in Illumina naming convention, minimally ending with _1.fq and _2.fq to delineate read1 and read2:
Periods (.) are disallowed delimiters except for file extensions
Illumina FASTQ files use the following naming scheme: <sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz For example, the following is a valid FASTQ file name: NA10831_ATCACG_L002_R1_001.fastq.gz
Inlcuded databases and automated retrieval of databases from pubMLST
stringMLST includes all the pubMLST databases as of February 15, 2017, built with the default kmer (35). They can be found in the datasets/ folder.
Simply unzip the databases you need and begin using stringMSLT as desbribed below.
All the databases from pubMLST can be downloaded and prepared with your kmer choice
Getting all pubMLST schemes
stringMLST.py --getMLST -P datasets/ --species allIndividual databases from pubMLST can also be downloaded as needed, using the scheme indentifiers Downloading a scheme
# List available schemes stringMLST.py --getMLST --schemes # Download the Neisseria spp. scheme stringMLST.py --getMLST -P datasets/nmb --species neisseriaDatabase Preparation
ST abcZ adk aroE fumC gdh pdhC pgm clonal_complex 1 1 3 1 1 1 1 3 ST-1 complex/subgroup I/II 2 1 3 4 7 1 1 3 ST-1 complex/subgroup I/II 3 1 3 1 1 1 23 13 ST-1 complex/subgroup I/II 4 1 3 3 1 4 2 3 ST-4 complex/subgroup IV
>abcZ_1 TTTGATACTGTTGCCGA... >abcZ_2 TTTGATACCGTTGCCGA... >abcZ_3 TTTGATACCGTTGCGAA... >abcZ_4 TTTGATACCGTTGCCAA...
[loci] abcZ /data/home/stringMLST/pubmlst/Neisseria_sp/abcZ.fa adk /data/home/stringMLST/pubmlst/Neisseria_sp/adk.fa aroE /data/home/stringMLST/pubmlst/Neisseria_sp/aroE.fa fumC /data/home/stringMLST/pubmlst/Neisseria_sp/fumC.fa gdh /data/home/stringMLST/pubmlst/Neisseria_sp/gdh.fa pdhC /data/home/stringMLST/pubmlst/Neisseria_sp/pdhC.fa pgm /data/home/stringMLST/pubmlst/Neisseria_sp/pgm.fa [profile] profile /data/home/stringMLST/pubmlst/Neisseria_sp/neisseria.txt
stringMLST.py --buildDB --config <config file> -k <k-mer length> -P <prefix>
stringMLST.py --buildDB --config config.txt -k 35 -P NM
stringMLST.py --predict -1 <paired-end file 1> -2 <paired-end file 2> -p --prefix <prefix for the database> -k <k-mer size> -o <output file name>
stringMLST.py --predict -1 <single-end file> -s --prefix <prefix for the database> -k <k-mer size> -o <output file name>
stringMLST.py --predict -d <directory for samples> -p --prefix <prefix for the database> -k <k-mer size> -o <output file name>
stringMLST.py --predict -d <directory for samples> -s --prefix <prefix for the database> -k <k-mer size> -o <output file name>
<full path of sample 1 fastq file> <full path of sample 2 fastq file> <full path of sample 3 fastq file> . . <full path of sample n fastq file>
<full path of sample 1 fastq file 1> <full path of sample 1 fastq file 2> <full path of sample 2 fastq file 1> <full path of sample 2 fastq file 2> <full path of sample 3 fastq file 1> <full path of sample 3 fastq file 2> . . <full path of sample n fastq file 1> <full path of sample n fastq file 2>
stringMLST.py --predict -l <full path to list file> -p --prefix <prefix for the database> -k <k-mer size> -o <output file name>
stringMLST.py --predict -l <full path to list file > -s --prefix <prefix for the database> -k <k-mer size> -o <output file name>
-C|--coverage
flag and -z|--fuzzy
threshold option.
-z|--fuzzy
threshold (Default = 300), assigns significance to the difference between supports. Much like SRST2 and Torsten Seemann's popular pubMLST script, stringMLST reports potentially new or closely supported alleles in allele* syntax. For high coverage reads, we suggest a fuzzy threshold >500. For low coverage reads, a fuzzy threshold of <50.
bedtools
, bwa
, and samtools
in your PATH and an additional python module, pyfaidx
(See the dependencies section for installion information). Coverage mode by default disables display of fuzzy alleles in favor of sequence coverage information made by mapping potential reads to the putative allele sequence. In our testing, coverage mode slightly increases prediction time (<1 sec increase per sample).
stringMLST.py --predict -1 <paired-end file 1> -2 <paired-end file 2> -p --prefix <prefix for the database> -k <k-mer size> -t -o <output file name>
stringMLST.py --predict -1 <paired-end file 1> -2 <paired-end file 2> -p --prefix <prefix for the database> -k <k-mer size> -r -o <output file name>