This tutorial was generated from a Jupyter notebook. You can download the notebook here.
import Bio.Entrez
import Bio.SeqIO
import Bio.Blast.NCBIWWW
import Bio.SearchIO
In the process of understanding our place in the world, we often seek to ask whether DNA or proteins that we know or work with bear some resemblance to other molecules in the universe or, a perhaps a specific subset, of the sequence space. This problem is potentially enormous given the size of the sequence space (billions of letters) versus the size of the query (hundreds of letters).
The most widely used algorithm/solution to this problem is BLAST.
Given the size of the sequence space and the potential matches, we need an intelligent way to search. BLAST provides an efficient algorithm to do this explained below. See this blog post for a nice retrospective on the development of BLAST.
Filter the query sequence for regions of low complexity. NCBI describes these regions here as :
Regions with low-complexity sequence have an unusual composition that can create problems in sequence similarity searching. Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits.
In BLAST searches performed without a filter, high scoring hits may be reported only because of the presence of a low-complexity region. Most often, it is inappropriate to consider this type of match as the result of shared homology. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.
For amino acid queries this compositional bias is determined by the SEG program (Wootton and Federhen, 1996). For nucleotide queries it is determined by the DustMasker program (Morgulis, et al., 2006).
The query sequence is transformed into subsequences of length $k$ "k-words". The default $k$ is 3 for protein queries and 11 for nucleotide queries.
Each k-word is compared to all of the k-words found in the database of sequences via a substitution matrix. The high-scoring database words for each k-word are organized into an efficient search tree. This is performed for each of the k-words in the query sequence.
How do we do comparisions? This is an involved discussion of substitution matrices. We may do this in class, but won't delve into it here.
Sequences with exact matches of high scoring words are then retrieved from the database. Since each high-scoring word is unique and hashable, you can imagine a data structure/organization like a Python dictionary can be used to retrieve these matches with low computational cost.
The significance of each high scoring pair is computed via the Gumbel extreme value distribution. Ungapped Smith-Waterman alignment scores between two random sequences follow the Gumbel EVD; this has not been proven for the case of gapped alignments (whether step 6 uses a gapped or ungapped alignment depends on parameters specified to BLAST).
$$p \left( S \ge x \right)=1-\exp\left( -e^{ -\lambda \left( x-\mu \right) } \right)$$where $$\mu =\frac{ \log \left( Km'n' \right) }{\lambda }\;$$The probability $p$ that a score $S$ is greater than $x$ is a function of parameters $K$, $\lambda$, $m'$, and $n'$. $\lambda$ and $K$ depend upon the substitution matrix, gap penalties, and sequence composition (the letter frequencies) of the database and are found by fitting to the Grumbel EVD. $m'$ and $n'$ are the effective lengths of the query and HSP sequences, i.e. adjusting for the "edge effect" that an alignment at the end of a sequence is less likely to have enough letters to build an optimal alignment. $H$ is the average expected score per aligned pair of residues in an alignment of two random sequences.
$$m'\approx m-\frac{\ln Kmn }{H}$$$$n'\approx n-\frac{\ln Kmn}{H}$$What we'd like to know, however, is how likely the alignment generated was found by chance. To do so, we want to find the expectation, or the average outcome, of the distribution. The expect score $E$ does just that; it gives the number of times that an unrelated sequence in the database of size $D$ would obtain a score $S$ higher than $x$.
$$E\approx 1-e^{-p\left( s>x \right)D}$$When $p<0.1$, $E$ is approximated by the Poisson distribution as
$$E\approx pD$$The expect score $E$ is often referred to as the "E-value" and the score $s$ as the "bit score". Critically:
- the score $s$ is independent of the size of the database and has everything to do with the quality of the alignment
- the expect score is completely dependent on the size of the database
Remember, it is the bit score that is independent of the database size, so, an important point:
Query | Database | Name | Description |
---|---|---|---|
Nucleotide | Nucleotide | blastn | Nucleotide against nucleotide database |
Protein | Protein | blastp | Protein against protein database |
Protein | Protein | Position-Specific Iterative BLAST (PSI-BLAST) | Useful to find distant relatives of a protein. A list of closely related proteins is created via an initial BLASTp search and then aligned into a "profile" sequence, which summarises the significant features and filters out less significant ones. This profile is then run against the database again producing a larger set of proteins. This larger set is used to construct another profile. The process is "iterated". PSI-BLAST is much more sensitive in detecting up distant relationships than a standard BLASTp. |
Nucleotide (6-frame translation) | Protein | blastx | All six-frame frames of the nucleotide query sequence, i.e. +1,+2,+3,-1,-2,-3, are queried against a protein sequence database. |
Nucleotide (6-frame translation) | Nucleotide (6-frame translation) | tblastx | All translations of the query sequences are compared to all translations of the sequence database. tblastx is to useful in detecting very distant relationships between nucleotide sequences. Slowest of the BLAST family. |
Protein | Nucleotide (6-frame translation) | tblastn | Compares a protein query against all translations of a nucleotide sequence database. Good for searching a draft genome for proteins of interest. |
Any (Many query sequences) | Any (One database) | megablast | When querying large numbers of input sequences via the same sequence database, there are a number of steps that can be peformed together and decrease overall computational time. "megablast" does just this and is much faster than running standard BLAST multiple times. Simply put, it concatenates the query sequences to form a single long query, searches this new query against the BLAST database, and then analyzes the search results to generate individual alignments and scores. |
Many software pacakges have been written for BLAST exploiting fine and/or coarse-grained parallelism in the algorithm. For example
Primarily, depending on the level/type of parallellism exploited together with special hardward architecture, several software packages have been written. Some take advantage of multiple processing cores on a single machine (PTHREADS) or multiple nodes in a cluster (MPI) and others leverage special hardware like graphical processing units (GPUs) and field-programmable gate arrays (FPGAs). See a more complete list here.
Compressively-accelerated BLAST (CaBLAST) has a pretty cool approach to increasing the speed of BLAST searches. Sequence databases are extremely redundant (think about gene homologs and genome sequencing of similar organisms or substrains of bacteria). CaBLAST restructures problems such that queries are exceuted against a precomputed, redundancy-minimized database. After the BLAST is peformed, resulting HSPs are compared against sequences within a the semi-redundant sequence cluster.
Given a protein of interest, one potential question is that we might want to figure out whether it has homolog of known structure. BLAST is well-suited for this problem!
Let's look at the E. coli mechanosensitive channels to see which ones have known structures. We can query Uniprot for proteins with "mechanosensitive" in their name and from the E. coli K12 genome.
I've selected a subset here for our analysis.
Uniprot Accession | Entry Name | Description |
---|---|---|
P0C0S1 | MSCS_ECOLI | Small-conductance mechanosensitive channel |
P0A742 | MSCL_ECOLI | Large-conductance mechanosensitive channel |
P39285 | MSCM_ECOLI | Miniconductance mechanosensitive channel |
P77338 | MSCK_ECOLI | Mechanosensitive channel MscK (gated by K${}^{+}$) |
Let's first retrieve an example Genbank entry from NCBI corresponding to these accession number:
# Always tell NCBI who you are
Bio.Entrez.email = 'saladi@caltech.edu'
Bio.Entrez.tool = 'BLAST-for-PDB-Entry'
id_list = ('P0C0S1', 'P0A742', 'P39285', 'P77338')
handle_list = {}
# example record retrieval
handle = Bio.Entrez.efetch(db='protein', id=id_list[0], rettype='gb', retmode='text')
print(handle.read())
Here, let's retrieve the records for each of the identifiers.
record_list = {}
# Fetch all the handles, read each into a seq_record
for thisid in id_list:
handle = Bio.Entrez.efetch(db='protein', id=thisid, rettype='gb', retmode='text')
seq_record = Bio.SeqIO.read(handle, 'genbank')
handle.close()
record_list[thisid] = seq_record
print(record_list)
Given the sequence, Biopython has a method to query NCBI's Web Blast service that returns an XML file. There are a huge number of options! Read more about each in the documentation here.
qblast(program, database, sequence, auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=None, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None, megablast=None)
If you find that Bio.Blast.NCBIWWW.qblast()
does not give the same results as the NCBI BLAST website, it is likely becuase the default search parameters/options are likely different between the web version and qblast. Check parameters like gap penalties and expectation threshold.
We are only required to specify the program, database, and sequence to run a simple query.
result_handle = Bio.Blast.NCBIWWW.qblast('blastp', 'pdb', record_list['P77338'].seq)
record_list.keys()
I like to save the results of my BLAST query to file before I parse them in case I need to go back and start the parsing over.
with open('blastp-pdb-P77338.xml', 'w') as save_file:
save_file.write(result_handle.read())
result_handle.close()
Check that the file was created ok:
!head blastp-pdb-P77338.xml
Now, let's do this for the entire set of identifiers.
# perform all blast queries and write each to file
for thisid in record_list:
# perform blast query
result_handle = Bio.Blast.NCBIWWW.qblast('blastp', 'pdb', record_list[thisid].seq)
# write query to file
filename = "blast-pdb-{recordid}.xml".format(recordid=thisid)
with open(filename, 'w+') as save_file:
save_file.write(result_handle.read())
result_handle.close()
Biopython provides two ways to parse the output of BLAST queries. One is specalized for BLAST-XML output Bio.Blast.NCBIXML
and the other, Bio.SearchIO
is more general for all sorts of sequence search methods including BLAST but also for other methods, e.g. HMMER and Exonerate. While Biopython issues a warning about using Bio.SearchIO
since it's "an experimental submodule", it's been stable for 2 years and any further changes should be minor.
We will use it here since the same code could be adapted for future work with other search methods.
!ls *.xml
blast_qresult = Bio.SearchIO.read('blastp-pdb-P77338.xml', 'blast-xml')
print(blast_qresult)
At first glance, it looks like there are many hits of proteins with known structure with homology to MscK. Let's parse these hits to determine whether the hits are bonafide -- do they span the whole MscK protein sequence or only one/few domains?
To do so, let's examine the structure of Bio.SearchIO
objects. SearchIO
objects hold multiple hit
objects each of which holds a hsp
("High Scoring Pair") objects.
for blast_hit in blast_qresult:
for hsp in blast_hit:
print(hsp)
Let's focus in on the query range, hit range, and statistics of the HSP to evaluate whether we have real BLAST hits for each of the query sequences.
print("ID", "length", "hit_start", "hit_end", "hit_span",
"query_start", "query_end", "query_span", sep="\t\t")
for thisid in record_list:
filename = "blastp-pdb-{recordid}.xml".format(recordid=thisid)
thisresult = Bio.SearchIO.read(filename, 'blast-xml')
best_hsp = thisresult[0][0]
print(thisid, len(record_list[thisid]), best_hsp.hit_start, best_hsp.hit_end, best_hsp.hit_span,
best_hsp.query_start, best_hsp.query_end, best_hsp.query_span, sep="\t\t")