Exercise 3.3: RNA secondary structure validator


RNA secondary structure validator

In this problem, we will write a function that takes an RNA sequence and an RNA secondary structure and decides if the secondary structure is possible given the sequence. Remember, single stranded RNA can fold back on itself and form base pairs. An RNA secondary structure is simply the list of base pairs that are present. We will represent the base pairs in dot-parentheses notation. For example, a sequence/secondary structure pair would be

0123456789
GCAUCUAUGC
(((....)))

For convenience of discussion, I have labeled the indices of the bases on the top row. In this case, base 0, a G, pairs with base 9, a C. Base 1 pairs with base 8, and base 2 pairs with base 7. Bases 3, 4, 5, and 6 are unpaired. (This structure is aptly called a “hairpin.”)

I hope the dot-parentheses notation is clear. An open parenthesis is paired with the parenthesis that closes it. Dots are unpaired.

So, the goal of our function is to check all base pairs present in a secondary structure and see if they are with G-C, A-U, or (optionally) G-U.

a) Write a function to make sure that the number of closed parentheses is equal to the number of open parentheses, a requirement for a valid secondary structure. It should return True if the parentheses are valid and False otherwise.

b) Write a function that converts the dot-parens notation to a tuple of 2-tuples representing the base pairs. We’ll call this function dotparen_to_bp(). An example input/output of this function would be:

dotparen_to_bp('(((....)))')

((0, 9), (1, 8), (2, 7))

Hint: You should look at methods that are available for lists. You might find the append() and pop() methods useful.

c) Because of sterics, the minimal length of a hairpin loop is three bases. A hairpin loop is a series of unpaired bases that are closed by a base pair. For example, the secondary structure (.(....).) has a single hairpin loop of length 4. So, the structure ((((..)))) is not valid because it has a hairpin loop of only two bases.

Write a function that verifies that a list of base pairs (as outputted by dotparen_to_bp()) satisfies the minimal hairpin length requirement.

d) Now write your validator function. The function definition should look like this:

def rna_ss_validator(seq, sec_struc, wobble=True):

It should return True if the sequence is commensurate with a valid secondary structure and False otherwise. The wobble keyword argument is True if we allow wobble pairs (G paired with U). Here are some expected results:

Returns True:

rna_ss_validator('GCAUCUAUGC', '(((....)))')
rna_ss_validator('GCAUCUAUGU', '(((....)))')
rna_ss_validator('GCAUCUAUGU', '(.(....).)')

Returns False:

rna_ss_validator('GCAUCUACGC', '(((....)))')
rna_ss_validator('GCAUCUAUGU', '(((....)))', wobble=False)
rna_ss_validator('GCAUCUAUGU', '(.(....)).')
rna_ss_validator('GCCCUUGGCA', '(.((..))).')