(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This exercise was generated from an Jupyter notebook. You can download the notebook here.
Write a function that takes two sequences and returns the longest common substring. A substring is a contiguous portion of a string. For example:
Substrings of ATGCATAT
:
TGCA
T
TAT
Not substrings of ATGCATAT
:
AGCA # Skipped T
CCATA # Added another C
Hello, world. # Has nothing to do with the input sequence
There may be more than one longest common substring; you only need to return one of them.
The call signature of the function should be
longest_common_substring(s1, s2)
Here are some return values you should get.
Function call | Result |
---|---|
longest_common_substring('ATGC', 'ATGCA') |
'ATGC' |
longest_common_substring('GATGCCATGCA', 'ATGCC') |
'ATGCC' |
longest_common_substring('ACGTGGAAAGCCA', 'GTACACACGTTTTGAGAGACAC') |
'ACGT' |
In this problem, we will write a function that takes an RNA sequence and an RNA secondary structure and decides if the secondary structure is possible given the sequence. Remember, single stranded RNA can fold back on itself and form base pairs. An RNA secondary structure is simply the list of base pairs that are present. We will represent the base pairs in dot-parentheses notation. For example, a sequence/secondary structure pair would be
0123456789
GCAUCUAUGC
(((....)))
For convenience of discussion, I have labeled the indices of the bases on the top row. In this case, base 0
, a G
, pairs with base 9
, a C
. Base 1
pairs with base 8
, and base 2
pairs with base 7
. Bases 3
, 4
, 5
, and 6
are unpaired. (This structure is aptly called a "hairpin.")
I hope the dot-parentheses notation is clear. An open parenthesis is paired with the parenthesis that closes it. Dots are unpaired.
So, the goal of our function is to check all base pairs present in a secondary structure and see if they are with G-C
, A-U
, or (optionally) G-U
.
a) Write a function to make sure that the number of closed parentheses is equal to the number of open parentheses, a requirement for a valid secondary structure. It should return True
if the parentheses are valid and False
otherwise.
b) Write a function that converts the dot-parens notation to a tuple of 2-tuples representing the base pairs. We'll call this function dotparen_to_bp()
. An example input/output of this function would be:
dotparen_to_bp('(((....)))')
((0, 9), (1, 8), (2, 7))
Hint: You should look at methods that are available for lists. You might find the append()
and pop()
methods useful.
c) Because of sterics, the minimal length of a hairpin loop is three bases. A hairpin loop is a series of unpaired bases that are closed by a base pair. For example, the secondary structure (.(....).)
has a single hairpin loop of length 4. So, the structure ((((..))))
is not valid because it has a hairpin loop of only two bases.
Write a function that verifies that a list of base pairs (as outputted by dotparen_to_bp()
) satisfies the minimal hairpin length requirement.
d) Now write your validator function. The function definition should look like this:
def rna_ss_validator(seq, sec_struc, wobble=True):
It should return True
if the sequence is commensurate with a valid secondary structure and False
otherwise. The wobble
keyword argument is True
if we allow wobble pairs (G
paired with U
). Here are some expected results:
Returns True
:
rna_ss_validator('GCAUCUAUGC', '(((....)))')
rna_ss_validator('GCAUCUAUGU', '(((....)))')
rna_ss_validator('GCAUCUAUGU', '(.(....).)')
Returns False
:
rna_ss_validator('GCAUCUACGC', '(((....)))')
rna_ss_validator('GCAUCUAUGU', '(((....)))', wobble=False)
rna_ss_validator('GCAUCUAUGU', '(.(....)).')
rna_ss_validator('GCCCUUGGCA', '(.((..))).')
In this problem, you will play around with the command line on your machine and get more familiar with it.
a) Let's play around with some options for the ls
command. First cd
into a directory that has some interesting files in it (like ~git/bootcamp/command_line_tutorial
). Try the following if you are using bash
.
ls -F
ls -G # Might not be as cool with Git Bash on Windows
ls -l
ls -lh
ls -lS
ls -FGLh
You should be able to infer what these different options do, but you should talk with the TAs as well.
Normally, files that begin with a dot (.
) are omitted when listing things. They are also generally omitted when you use your OS's GUI-based file handling system (like Finder on Macs). To see them, use ls -a
. So, cd
into your home directory (you remember how to do that, right?), and then do
ls -a
b) The nuclear option to delete everything in a directory is rm -rf
. The r
means to delete recursively, and the f
means to "force" deletion. I was going to give you an exercise that uses the nuclear option, but I'm not going to do that. So, just forget I said anything. For this part of the problem, I want you to discuss with your neighbor when the nuclear option might be used, and what needs to be in place before exercising it.
c) Try doing this if you are using macOS or Linux:
ls /
What is /
? Try cd
-ing there and seeing what's in there. Do not delete anything!
Having a .bashrc
file allows you to configure your bash
shell how you like.
a) Create a .bashrc
file in your home directory. If you already have a .bashrc
file, open it up for editing using Jupyter's text editor.
b) It is often useful to alias
functions to other functions. For example, I am always worried I will accidentally delete things by accident. I therefore have the following line in my .bashrc
file.
alias rm="rm -i"
You should create aliases for commands like ls
based on the flags you like to always use. Do the same for rm
and mv
(I use the -i
flag with these). To figure out what flags are available, you can look at the man
pages. Asking Google will usually give you the information you need on flags.
If you like, you can use my .bashrc
file, available in ~/git/bootcamp/misc/jb_bashrc
.
c) Depending on your operating system, your ~/.bashrc
file may or may not be properly loaded upon opening a new bash shell. You may, e.g. for new macOS versions, need to explicitly source your .bashrc
file in your ~/.bash_profile
file. Therefore, you should add the following to the bottom of your ~/.bash_profile
file.
if [ -f $HOME/.bashrc ]; then
. $HOME/.bashrc
fi
%load_ext watermark
%watermark -v -p jupyterlab