# Lesson 32: Basic Local Alignment Search Tool (BLAST)¶

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [1]:
import Bio.Entrez
import Bio.SeqIO
import Bio.Blast.NCBIWWW
import Bio.SearchIO

/Users/saladi/anaconda3/lib/python3.4/site-packages/Bio/SearchIO/__init__.py:211: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release.
BiopythonExperimentalWarning)


## Background and Theory¶

In the process of understanding our place in the world, we often seek to ask whether DNA or proteins that we know or work with bear some resemblance to other molecules in the universe or, a perhaps a specific subset, of the sequence space. This problem is potentially enormous given the size of the sequence space (billions of letters) versus the size of the query (hundreds of letters).

The most widely used algorithm/solution to this problem is BLAST.

Given the size of the sequence space and the potential matches, we need an intelligent way to search. BLAST provides an efficient algorithm to do this explained below. See this blog post for a nice retrospective on the development of BLAST.

### Algorithm¶

1. Filter the query sequence for regions of low complexity. NCBI describes these regions here as :

Regions with low-complexity sequence have an unusual composition that can create problems in sequence similarity searching. Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits.

In BLAST searches performed without a filter, high scoring hits may be reported only because of the presence of a low-complexity region. Most often, it is inappropriate to consider this type of match as the result of shared homology. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.

For amino acid queries this compositional bias is determined by the SEG program (Wootton and Federhen, 1996). For nucleotide queries it is determined by the DustMasker program (Morgulis, et al., 2006).

2. The query sequence is transformed into subsequences of length $k$ "k-words". The default $k$ is 3 for protein queries and 11 for nucleotide queries.

3. Each k-word is compared to all of the k-words found in the database of sequences via a substitution matrix. The high-scoring database words for each k-word are organized into an efficient search tree. This is performed for each of the k-words in the query sequence.

How do we do comparisions? This is an involved discussion of substitution matrices. We may do this in class, but won't delve into it here.

4. Sequences with exact matches of high scoring words are then retrieved from the database. Since each high-scoring word is unique and hashable, you can imagine a data structure/organization like a Python dictionary can be used to retrieve these matches with low computational cost.

5. For sequences with hits of the each high scoring word, the alignment with the query sequence is extended in either direction until the overall score starts to decrease. These are called high scoring pairs (HSPs).
6. HSPs with a score above a given cutoff are retained. This cutoff is empirically determined by computing alignment scores of many random sequences.
7. The significance of each high scoring pair is computed via the Gumbel extreme value distribution. Ungapped Smith-Waterman alignment scores between two random sequences follow the Gumbel EVD; this has not been proven for the case of gapped alignments (whether step 6 uses a gapped or ungapped alignment depends on parameters specified to BLAST).

$$p \left( S \ge x \right)=1-\exp\left( -e^{ -\lambda \left( x-\mu \right) } \right)$$
where
$$\mu =\frac{ \log \left( Km'n' \right) }{\lambda }\;$$

The probability $p$ that a score $S$ is greater than $x$ is a function of parameters $K$, $\lambda$, $m'$, and $n'$. $\lambda$ and $K$ depend upon the substitution matrix, gap penalties, and sequence composition (the letter frequencies) of the database and are found by fitting to the Grumbel EVD. $m'$ and $n'$ are the effective lengths of the query and HSP sequences, i.e. adjusting for the "edge effect" that an alignment at the end of a sequence is less likely to have enough letters to build an optimal alignment. $H$ is the average expected score per aligned pair of residues in an alignment of two random sequences.

$$m'\approx m-\frac{\ln Kmn }{H}$$$$n'\approx n-\frac{\ln Kmn}{H}$$

What we'd like to know, however, is how likely the alignment generated was found by chance. To do so, we want to find the expectation, or the average outcome, of the distribution. The expect score $E$ does just that; it gives the number of times that an unrelated sequence in the database of size $D$ would obtain a score $S$ higher than $x$.

$$E\approx 1-e^{-p\left( s>x \right)D}$$

When $p<0.1$, $E$ is approximated by the Poisson distribution as

$$E\approx pD$$

The expect score $E$ is often referred to as the "E-value" and the score $s$ as the "bit score". Critically:

• the score $s$ is independent of the size of the database and has everything to do with the quality of the alignment
• the expect score is completely dependent on the size of the database
8. HSPs are combined into longer alignments where possible and the scores are recomputed.
9. Finally, those alignments with an E-value less than user-specified threshold are reported.

Remember, it is the bit score that is independent of the database size, so, an important point:

When comparing hits from queries performed on different databases, always use the **bit-score** not the **E-value**!

### BLAST programs:¶

Query Database Name Description
Nucleotide Nucleotide blastn Nucleotide against nucleotide database
Protein Protein blastp Protein against protein database
Protein Protein Position-Specific Iterative BLAST (PSI-BLAST) Useful to find distant relatives of a protein. A list of closely related proteins is created via an initial BLASTp search and then aligned into a "profile" sequence, which summarises the significant features and filters out less significant ones. This profile is then run against the database again producing a larger set of proteins. This larger set is used to construct another profile. The process is "iterated". PSI-BLAST is much more sensitive in detecting up distant relationships than a standard BLASTp.
Nucleotide (6-frame translation) Protein blastx All six-frame frames of the nucleotide query sequence, i.e. +1,+2,+3,-1,-2,-3, are queried against a protein sequence database.
Nucleotide (6-frame translation) Nucleotide (6-frame translation) tblastx All translations of the query sequences are compared to all translations of the sequence database. tblastx is to useful in detecting very distant relationships between nucleotide sequences. Slowest of the BLAST family.
Protein Nucleotide (6-frame translation) tblastn Compares a protein query against all translations of a nucleotide sequence database. Good for searching a draft genome for proteins of interest.
Any (Many query sequences) Any (One database) megablast When querying large numbers of input sequences via the same sequence database, there are a number of steps that can be peformed together and decrease overall computational time. "megablast" does just this and is much faster than running standard BLAST multiple times. Simply put, it concatenates the query sequences to form a single long query, searches this new query against the BLAST database, and then analyzes the search results to generate individual alignments and scores.

### BLAST flavors:¶

Many software pacakges have been written for BLAST exploiting fine and/or coarse-grained parallelism in the algorithm. For example

• Instead of comparing one k word to those in a database sequentiallially, these operations can be processed in parallel
• Instead of comparing high scoring words sequentially, this can be done in parallel
• Instead of searching a single large database, it can be broken up and done in parallel
• Instead of aligning many similar sequences through different HSPs, similar sequences can be collapsed into a smaller database that is instead searched with the query sequence

Primarily, depending on the level/type of parallellism exploited together with special hardward architecture, several software packages have been written. Some take advantage of multiple processing cores on a single machine (PTHREADS) or multiple nodes in a cluster (MPI) and others leverage special hardware like graphical processing units (GPUs) and field-programmable gate arrays (FPGAs). See a more complete list here.

Compressively-accelerated BLAST (CaBLAST) has a pretty cool approach to increasing the speed of BLAST searches. Sequence databases are extremely redundant (think about gene homologs and genome sequencing of similar organisms or substrains of bacteria). CaBLAST restructures problems such that queries are exceuted against a precomputed, redundancy-minimized database. After the BLAST is peformed, resulting HSPs are compared against sequences within a the semi-redundant sequence cluster.

## Executing a BLAST Query through NCBI Entrez¶

Given a protein of interest, one potential question is that we might want to figure out whether it has homolog of known structure. BLAST is well-suited for this problem!

Let's look at the E. coli mechanosensitive channels to see which ones have known structures. We can query Uniprot for proteins with "mechanosensitive" in their name and from the E. coli K12 genome.

I've selected a subset here for our analysis.

Uniprot Accession Entry Name Description
P0C0S1 MSCS_ECOLI Small-conductance mechanosensitive channel
P0A742 MSCL_ECOLI Large-conductance mechanosensitive channel
P39285 MSCM_ECOLI Miniconductance mechanosensitive channel
P77338 MSCK_ECOLI Mechanosensitive channel MscK (gated by K${}^{+}$)

Let's first retrieve an example Genbank entry from NCBI corresponding to these accession number:

In [1]:
# Always tell NCBI who you are
Bio.Entrez.tool = 'BLAST-for-PDB-Entry'

id_list = ('P0C0S1', 'P0A742', 'P39285', 'P77338')
handle_list = {}

# example record retrieval
handle = Bio.Entrez.efetch(db='protein', id=id_list[0], rettype='gb', retmode='text')

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-28072d8e9a40> in <module>()
1 # Always tell NCBI who you are
3 Bio.Entrez.tool = 'BLAST-for-PDB-Entry'
4
5 id_list = ('P0C0S1', 'P0A742', 'P39285', 'P77338')

NameError: name 'Bio' is not defined

Here, let's retrieve the records for each of the identifiers.

In [5]:
record_list = {}

# Fetch all the handles, read each into a seq_record
for thisid in id_list:
handle = Bio.Entrez.efetch(db='protein', id=thisid, rettype='gb', retmode='text')
handle.close()
record_list[thisid] = seq_record

print(record_list)

{'P0A742': SeqRecord(seq=Seq('MSIIKEFREFAMRGNVVDLAVGVIIGAAFGKIVSSLVADIIMPPLGLLIGGIDF...NRS', IUPACProtein()), id='P0A742.1', name='MSCL_ECOLI', description='RecName: Full=Large-conductance mechanosensitive channel.', dbxrefs=[]), 'P77338': SeqRecord(seq=Seq('MTMFQYYKRSRHFVFSAFIAFVFVLLCQNTAFARASSNGDLPTKADLQAQLDSL...AVG', IUPACProtein()), id='P77338.1', name='MSCK_ECOLI', description='RecName: Full=Mechanosensitive channel MscK; AltName: Full=Potassium efflux system KefA; Flags: Precursor.', dbxrefs=[]), 'P39285': SeqRecord(seq=Seq('MRLIITFLMAWCLSWGAYAATAPDSKQITQELEQAKAAKPAQPEVVEALQSALN...GSL', IUPACProtein()), id='P39285.3', name='MSCM_ECOLI', description='RecName: Full=Miniconductance mechanosensitive channel MscM; Flags: Precursor.', dbxrefs=[]), 'P0C0S1': SeqRecord(seq=Seq('MEDLNVVDSINGAGSWLVANQALLLSYAVNIVAALAIIIVGLIIARMISNAVNR...KAA', IUPACProtein()), id='P0C0S1.1', name='MSCS_ECOLI', description='RecName: Full=Small-conductance mechanosensitive channel.', dbxrefs=[])}


Given the sequence, Biopython has a method to query NCBI's Web Blast service that returns an XML file. There are a huge number of options! Read more about each in the documentation here.

qblast(program, database, sequence, auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=None, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None, megablast=None)

If you find that Bio.Blast.NCBIWWW.qblast() does not give the same results as the NCBI BLAST website, it is likely becuase the default search parameters/options are likely different between the web version and qblast. Check parameters like gap penalties and expectation threshold.

We are only required to specify the program, database, and sequence to run a simple query.

In [6]:
result_handle = Bio.Blast.NCBIWWW.qblast('blastp', 'pdb', record_list['P77338'].seq)

In [13]:
record_list.keys()

Out[13]:
dict_keys(['P0A742', 'P77338', 'P39285', 'P0C0S1'])

I like to save the results of my BLAST query to file before I parse them in case I need to go back and start the parsing over.

In [7]:
with open('blastp-pdb-P77338.xml', 'w') as save_file:

result_handle.close()


Check that the file was created ok:

In [ ]:
!head blastp-pdb-P77338.xml


Now, let's do this for the entire set of identifiers.

In [ ]:
# perform all blast queries and write each to file
for thisid in record_list:
# perform blast query
result_handle = Bio.Blast.NCBIWWW.qblast('blastp', 'pdb', record_list[thisid].seq)

# write query to file
filename = "blast-pdb-{recordid}.xml".format(recordid=thisid)
with open(filename, 'w+') as save_file:

result_handle.close()


Biopython provides two ways to parse the output of BLAST queries. One is specalized for BLAST-XML output Bio.Blast.NCBIXML and the other, Bio.SearchIO is more general for all sorts of sequence search methods including BLAST but also for other methods, e.g. HMMER and Exonerate. While Biopython issues a warning about using Bio.SearchIO since it's "an experimental submodule", it's been stable for 2 years and any further changes should be minor.

We will use it here since the same code could be adapted for future work with other search methods.

In [ ]:
!ls *.xml

In [11]:
blast_qresult = Bio.SearchIO.read('blastp-pdb-P77338.xml', 'blast-xml')
print(blast_qresult)

Program: blastp (2.2.32+)
Query: unnamed (1120)
protein product
Target: pdb
Hits: ----  -----  ----------------------------------------------------------
#  # HSP  ID + description
----  -----  ----------------------------------------------------------
0      1  gi|873090862|pdb|5AJI|A  Chain A, Mscs D67r1 High Resol...
1      1  gi|400977297|pdb|4AGE|A  Chain A, Mtssl Spin Labeled D6...
2      1  gi|126031460|pdb|2OAU|A  Chain A, Mechanosensitive Chan...
3      1  gi|400977304|pdb|4AGF|A  Chain A, Mtssl Spin Labeled L1...
4      1  gi|195927344|pdb|2VV5|A  Chain A, The Open Structure Of...
5      1  gi|449802644|pdb|4HW9|A  Chain A, Crystal Structure Of ...
6      1  gi|410562602|pdb|3UDC|A  Chain A, Crystal Structure Of ...
7      1  gi|410562584|pdb|3T9N|A  Chain A, Crystal Structure Of ...
8      1  gi|399124994|pdb|3ZUG|A  Chain A, E268d Mutant Of Fad S...
9      1  gi|747155358|pdb|4U0V|A  Chain A, Crystal Structure Of ...
10      1  gi|283806768|pdb|2WV0|A  Chain A, Crystal Structure Of ...


At first glance, it looks like there are many hits of proteins with known structure with homology to MscK. Let's parse these hits to determine whether the hits are bonafide -- do they span the whole MscK protein sequence or only one/few domains?

To do so, let's examine the structure of Bio.SearchIO objects. SearchIO objects hold multiple hit objects each of which holds a hsp ("High Scoring Pair") objects.

In [12]:
for blast_hit in blast_qresult:
for hsp in blast_hit:
print(hsp)

      Query: unnamed protein product
Hit: gi|873090862|pdb|5AJI|A Chain A, Mscs D67r1 High Resolution
Query range: [884:1102] (0)
Hit range: [68:284] (0)
Quick stats: evalue 1.9e-17; bitscore 85.11
Fragments: 1 (224 columns)
Query - ITTILNYIIIAVGAMTVFGSLGVSWDKLQWLAAALSVGLGFGLQEIFGNFVSGLIILFE~~~EKGDE
++ ++ Y IIA   +   G +GV    +  +  A  + +G  LQ    N  +G++++  ~~~ K D+
Hit - LSALVRYGIIAFTLIAALGRVGVQTASVIAVLGAAGLAVGLALQGSLSNLAAGVLLVMF~~~VKEDK
Query: unnamed protein product
Hit: gi|400977297|pdb|4AGE|A Chain A, Mtssl Spin Labeled D67c Mutant ...
Query range: [884:1102] (0)
Hit range: [68:284] (0)
Quick stats: evalue 2e-17; bitscore 85.11
Fragments: 1 (224 columns)
Query - ITTILNYIIIAVGAMTVFGSLGVSWDKLQWLAAALSVGLGFGLQEIFGNFVSGLIILFE~~~EKGDE
++ ++ Y IIA   +   G +GV    +  +  A  + +G  LQ    N  +G++++  ~~~ K D+
Hit - LSALVRYGIIAFTLIAALGRVGVQTASVIAVLGAAGLAVGLALQGSLSNLAAGVLLVMF~~~VKEDK
Query: unnamed protein product
Hit: gi|126031460|pdb|2OAU|A Chain A, Mechanosensitive Channel Of Sma...
Query range: [884:1102] (0)
Hit range: [88:304] (0)
Quick stats: evalue 2.9e-17; bitscore 84.73
Fragments: 1 (224 columns)
Query - ITTILNYIIIAVGAMTVFGSLGVSWDKLQWLAAALSVGLGFGLQEIFGNFVSGLIILFE~~~EKGDE
++ ++ Y IIA   +   G +GV    +  +  A  + +G  LQ    N  +G++++  ~~~ K D+
Hit - LSALVRYGIIAFTLIAALGRVGVQTASVIAVLGAAGLAVGLALQGSLSNLAAGVLLVMF~~~VKEDK
Query: unnamed protein product
Hit: gi|400977304|pdb|4AGF|A Chain A, Mtssl Spin Labeled L124c Mutant...
Query range: [884:1102] (0)
Hit range: [68:284] (0)
Quick stats: evalue 3.2e-17; bitscore 84.34
Fragments: 1 (224 columns)
Query - ITTILNYIIIAVGAMTVFGSLGVSWDKLQWLAAALSVGLGFGLQEIFGNFVSGLIILFE~~~EKGDE
++ ++ Y IIA   +   G +GV    +  +  A  + +G  LQ    N  +G++ +  ~~~ K D+
Hit - LSALVRYGIIAFTLIAALGRVGVQTASVIAVLGAAGLAVGLALQGSLSNLAAGVLCVMF~~~VKEDK
Query: unnamed protein product
Hit: gi|195927344|pdb|2VV5|A Chain A, The Open Structure Of Mscs
Query range: [884:1102] (0)
Hit range: [68:284] (0)
Quick stats: evalue 6e-17; bitscore 83.57
Fragments: 1 (224 columns)
Query - ITTILNYIIIAVGAMTVFGSLGVSWDKLQWLAAALSVGLGFGLQEIFGNFVSGLIILFE~~~EKGDE
++ ++ Y IIA   +   G +GV    +  +  A  + +G  LQ    N  +G++++  ~~~ K D+
Hit - LSALVRYGIIAFTLIAALGRVGVQTASVIAVLGAAGLVVGLALQGSLSNLAAGVLLVMF~~~VKEDK
Query: unnamed protein product
Hit: gi|449802644|pdb|4HW9|A Chain A, Crystal Structure Of Helicobact...
Query range: [901:1114] (0)
Hit range: [98:307] (0)
Quick stats: evalue 3.5e-10; bitscore 63.16
Fragments: 1 (219 columns)
Query - FGSLGVSWDKLQWLAAALSVGLGFGLQEIFGNFVSGLIILFERPVRIGDTVTIGSFSGT~~~YKGDD
+LGV    +  +   + + +   L++   +   G+I++   P + GD + I    G ~~~YK DD
Hit - LSTLGVQTTSIITVLGTVGIAVALALKDYLSSIAGGIILIILHPFKKGDIIEISGLEGK~~~YKDDD
Query: unnamed protein product
Hit: gi|410562602|pdb|3UDC|A Chain A, Crystal Structure Of A Membrane...
Query range: [888:1018] (0)
Hit range: [71:197] (0)
Quick stats: evalue 8.1e-10; bitscore 61.62
Fragments: 1 (130 columns)
Query - LNYIIIAVGAMTVFGSLGVSWDKLQWLAAALSVGLGFGLQEIFGNFVSGLIILFERPVR~~~DLEKV
+ YII  +   ++     +    L  +A   S+ +GFG Q +  + +SG  I+FE    ~~~D++K+
Hit - VRYIIYFLAGASILKLFNIDMTSLLAVAGIGSLAIGFGAQNLVKDMISGFFIIFEDQFS~~~DVDKI
Query: unnamed protein product
Hit: gi|410562584|pdb|3T9N|A Chain A, Crystal Structure Of A Membrane...
Query range: [888:1018] (0)
Hit range: [71:197] (0)
Quick stats: evalue 8.4e-10; bitscore 61.62
Fragments: 1 (130 columns)
Query - LNYIIIAVGAMTVFGSLGVSWDKLQWLAAALSVGLGFGLQEIFGNFVSGLIILFERPVR~~~DLEKV
+ YII  +   ++     +    L  +A   S+ +GFG Q +  + +SG  I+FE    ~~~D++K+
Hit - VRYIIYFLAGASILKLFNIDMTSLLAVAGIGSLAIGFGAQNLVKDMISGFFIIFEDQFS~~~DVDKI
Query: unnamed protein product
Hit: gi|399124994|pdb|3ZUG|A Chain A, E268d Mutant Of Fad Synthetase ...
Query range: [30:107] (0)
Hit range: [246:319] (0)
Quick stats: evalue 3.1; bitscore 31.96
Fragments: 1 (79 columns)
A+A A S G  PT  D Q  +DS  L++  DL   D  V+ +  D +  ++K D ++  ~~~R   A
Query: unnamed protein product
Hit: gi|747155358|pdb|4U0V|A Chain A, Crystal Structure Of Yvoa From ...
Query range: [989:1052] (0)
Hit range: [105:172] (0)
Quick stats: evalue 4.3; bitscore 31.19
Fragments: 1 (67 columns)
Query - RLINWSLTDTTTRLVIRLGVAYGSDLEKVRKVLLK----AATEHPRVMHEPMPEVFFTAFGASTLDH
RLI++ L D+T  L   LG  + S + K+ +V L      A E   +  E   E+  + F +S  DH
Hit - RLIDYQLIDSTEELAAILGCGHPSSIHKITRVRLANDIPMAIESSHIPFELAGELNESHFQSSIYDH
Query: unnamed protein product
Hit: gi|283806768|pdb|2WV0|A Chain A, Crystal Structure Of The Gntr-H...
Query range: [989:1052] (0)
Hit range: [102:169] (0)
Quick stats: evalue 4.5; bitscore 31.19
Fragments: 1 (67 columns)
Query - RLINWSLTDTTTRLVIRLGVAYGSDLEKVRKVLLK----AATEHPRVMHEPMPEVFFTAFGASTLDH
RLI++ L D+T  L   LG  + S + K+ +V L      A E   +  E   E+  + F +S  DH
Hit - RLIDYQLIDSTEELAAILGCGHPSSIHKITRVRLANDIPMAIESSHIPFELAGELNESHFQSSIYDH


Let's focus in on the query range, hit range, and statistics of the HSP to evaluate whether we have real BLAST hits for each of the query sequences.

In [31]:
print("ID", "length", "hit_start", "hit_end", "hit_span",
"query_start", "query_end", "query_span", sep="\t\t")

for thisid in record_list:
filename = "blastp-pdb-{recordid}.xml".format(recordid=thisid)

ID		length		hit_start		hit_end		hit_span		query_start		query_end		query_span