This exercise was generated from a Jupyter notebook. You can download the notebook here.
In this problem, you will play around with the command line on your machine and get more familiar with it.
a) Let's play around with some options for the ls
command. First cd
into a directory that has some interesting files in it (like ~/bootcamp/command_line_tutorial
). Try the following.
ls -F
ls -G # Might not be as cool with Git Bash on Windows
ls -l
ls -lh
ls -lS
ls -FGLh
You should be able to infer what these different options do, but you should talk with the TAs as well.
Normally, files that begin with a dot (.
) are omitted when listing things. They are also generally omitted when you use your OS's GUI-based file handling system (like Finder on Macs). To see them, use ls -a
. So, cd
into your home directory (you remember how to do that, right?), and then do
ls -a
b) The nuclear option to delete everything in a directory is rm -rf
. The r
means to delete recursively, and the f
means to "force" deletion. I was going to give you an exercise that uses the nuclear option, but I'm not going to do that. So, just forget I said anything. For this part of the problem, I want you to discuss with your neighbor when the nuclear option might be used, and what needs to be in place before exercising it.
c) Try doing this:
ls /
What is /
? Try cd
-ing there and seeing what's in there. Do not delete anything!
d) Make a copy of the ~/bootcamp
directory called ~/bootcamp_sandbox
. Make sure to copy over all files. Why might you want do do this?
This problem more or less consisted of messing around with the command line. It is not a bad idea to have sandbox
directories around to play with where you can't do any damage. However, as soon as you do anything you might want to save or keep track of, take it out of the sandbox.
a) The trick here is to do what we did in Lesson 7, except use [::-1]
indexing instead of the reversed()
function.
def complement_base(base):
"""Returns the Watson-Crick complement of a base."""
if base == 'A' or base == 'a':
return 'T'
elif base == 'T' or base == 't':
return 'A'
elif base == 'G' or base == 'g':
return 'C'
else:
return 'G'
def reverse_complement(seq):
"""Compute reverse complement of a sequence."""
# Initialize reverse complement
rev_seq = ''
# Loop through and populate list with reverse complement
for base in seq:
rev_seq += complement_base(base)
return rev_seq[::-1]
And we'll do a quick test with the same sequence as in lesson 7.
reverse_complement('GCAGTTGCA')
Bingo!
b) We can eliminate the for
loop by using the sub()
method of strings.
def reverse_complement(seq):
"""Compute reverse complement of a sequence."""
# Initialize rev_seq to a lowercase seq
rev_seq = seq.lower()
# Substitute bases
rev_seq = rev_seq.replace('t', 'A')
rev_seq = rev_seq.replace('a', 'T')
rev_seq = rev_seq.replace('g', 'C')
rev_seq = rev_seq.replace('c', 'G')
return rev_seq[::-1]
And let's give it a test!
reverse_complement('GCAGTTGCA')
Write a function that takes two sequences and returns the longest common substring. A substring is a contiguous portion of a string. For example:
Is substring of ATGCATAT
:
TGCA
T
TAT
Is not a substring of ATGCATAT
:
AGCA # Skipped T
CCATA # Added another C
Hello, world. # Has nothing to do with the input sequence
There may be more than one longest common substring; you only need to return one of them.
This is actually an important problem, and there are clever algorithms to solve it. We will take a more brute force approach. We will generate all substrings of the shorter of the two strings, and find which ones are in the other.
def longest_common_substring(s1, s2):
"""Return one of the longest common substrings"""
# Make sure s1 is the shorter
if len(s1) > len(s2):
s1, s2 = s2, s1
# Start with the entire sequence and shorten
substr_len = len(s1)
while substr_len > 0:
# Try all substrings
for i in range(len(s1) - substr_len + 1):
if s1[i:i+substr_len] in s2:
return s1[i:i+substr_len]
substr_len -= 1
# If we haven't returned, there is no common substring
return ''
Let's try our function out.
longest_common_substring('ACGTGGAAAGCCA', 'GTACACACGTTTTGAGAGACAC')
In this problem, we will write a function that takes an RNA sequence and an RNA secondary structure and decides if the secondary structure is possible given the sequence. Remember, single stranded RNA can fold back on itself and for base pairs. An RNA secondary structure is simply the list of base pairs that are present. We will represent the base pairs in dot-parentheses notation. For example, a sequence/secondary structure pair would be
0123456789
GCATCTATGC
(((....)))
For convenience of discussion, I have labeled the indices of the bases on the top row. In this case, base 0
, a G
, pairs with base 9
, a C
. Base 1
pairs with base 8
, and base 2
pairs with base 7
. Bases 3
, 4
, 5
, and 6
are unpaired. (This structure is aptly called a "hairpin.")
So, I hope the dot-parentheses notation is clear. An open parenthesis is paired with the parenthesis that closes it. Dots are unpaired.
So, the goal of our function is to check all base pairs present in a secondary structure and see if they are with G-C
, A-U
, or (optionally) G-U
.
a) Write a function to make sure that the number of closed parentheses is equal to the number of open parentheses, a requirement for a valid secondary structure. It should return True
if the parentheses are valid and False
otherwise.
b) Write a function that converts the dot-parens notation to a tuple of 2-tuples representing the base pairs. We'll call this function dotparen_to_bp()
. An example input/output of this function would be:
dotparen_to_bp('(((...)))')
((0, 9), (1, 8), (2, 7))
Hint: You might find the pop()
method of lists useful.
c) Because of sterics, the minimal length of a hairpin loop is three bases. A hairpin loop is a series of unpaired bases that are closed by a base pair. For example, the secondary structure (.(....).)
has a single hairpin loop of length 4. So, the structure ((((..))))
is not valid because it has a hairpin loop of only two bases.
Write a function that verifies that a list of base pairs (as outputted by dotparen_to_bp()
) satisfies the hairpin requirement.
d) Now write your validator function. The function definition should look like this:
def rna_ss_validator(seq, sec_struc, wobble=True):
It should return True
if the sequence is commensurate with a valid secondary structure and False
otherwise. The wobble
keyword argument is True
if we allow wobble pairs (G
paired with U
). Here are some expected results:
Returns True
:
rna_ss_validator('GCAUCUAUGC', '(((....)))')
rna_ss_validator('GCAUCUAUGU', '(((....)))')
rna_ss_validator('GCAUCUAUGU', '(.(....).)')
Returns False
:
rna_ss_validator('GCAUCUACGC', '(((....)))')
rna_ss_validator('GCAUCUAUGU', '(((....)))', wobble=False)
rna_ss_validator('GCAUCUAUGU', '(.(....)).')
rna_ss_validator('GCCCUUGGCA', '(.((..))).')
a) The validation is simple. We just need to make sure the number of open and closed parentheses are equal.
def parens_count(struc):
"""
Ensures there are equal number of open and closed parentheses
in structure.
"""
return struc.count('(') == struc.count(')')
Let's give it a try.
print(parens_count('(((..(((...)).))))'))
print(parens_count('(((..(((...)).)))'))
b) As we scan a dot-parens structure from left to right, we can keep a list of the positions of an open parenthesis. Whenever we encounter a closed one, we have closed the last open one we added. So, we can just scan through the dot-parens string and pop out base pairs. If this procedure fails, we know that there was an error in the input structure (i.e., a closed parenthesis appeared without a corresponding open one before it).
def dot_parens_to_bp(struc):
"""
Convert a dot-parens structure to a list of base pairs.
Return False if the structure is invalid.
"""
if not parens_count(struc):
print('Error in input structure.')
return False
# Initialize list of open parens and list of base pairs
open_parens = []
bps = []
# Scan through string
for i, x in enumerate(struc):
if x == '(':
open_parens.append(i)
elif x == ')':
if len(open_parens) > 0:
bps.append((open_parens.pop(), i))
else:
print('Error in input structure.')
return False
# Return the result as a tuple
return tuple(sorted(bps))
Let's try it on some legitimate sequences and on some bad ones.
# Good structure
dot_parens_to_bp('..(((...)))(((((....)))).)..')
# Good structure
dot_parens_to_bp('(((..(((...)).))))')
# Bad structure
dot_parens_to_bp('((....)))(')
dot_parens_to_bp('())....))))')
c) It is quite easy to detect short hairpins once we have a list of base pairs. We just need to make sure the difference in index of any pair of paired bases is not less than three.
def hairpin_check(bps):
"""Check to make sure no hairpins are too short."""
for bp in bps:
if bp[1] - bp[0] < 4:
print('A hairpin is too short.')
return False
# Everything checks out
return True
d) Most everything is in place. We just need to check the sequence.
def rna_ss_validator(seq, sec_struc, wobble=True):
"""Validate and RNA structure"""
# Convert structure to base pairs
bps = dot_parens_to_bp(sec_struc)
# If this failed, the structure was invalid
if not bps:
return False
# Do the hairpin check
if not hairpin_check(bps):
return False
# Possible base pairs
if wobble:
ok_bps = ('gc', 'cg', 'au', 'ua', 'gu', 'ug')
else:
ok_bps = ('gc', 'cg', 'au', 'ua')
# Check complementarity
for bp in bps:
bp_str = (seq[bp[0]] + seq[bp[1]]).lower()
if bp_str not in ok_bps:
print('Invalid base pair.')
return False
# Everything passed
return True
Let's test it on the test cases from the problem statement.
print('Should be True:')
print(rna_ss_validator('GCAUCUAUGC', '(((....)))'))
print(rna_ss_validator('GCAUCUAUGU', '(((....)))'))
print(rna_ss_validator('GCAUCUAUGU', '(.(....).)'))
print('\nShould be False:')
print(rna_ss_validator('GCAUCUACGC', '(((....)))'), '\n')
print(rna_ss_validator('GCAUCUAUGU', '(((....)))', wobble=False), '\n')
print(rna_ss_validator('GCAUCUAUGU', '(.(....)).'), '\n')
print(rna_ss_validator('GCCCUUGGCA', '(.((..))).'),'\n')
Looks good!