Lesson 38: Interval trees and efficient genomic queries

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

A common application when manipulating genomic data are queries on genomic intervals. For example, trying to find all genes within a given genomic region or identifying other transcript isoforms that overlap a gene of interest.

The naive approach of simply storing all features and searching for it through iteration can be very inefficient especially for large datasets (such as those that are common in genomics).

Because of the nature of genomic intervals, there are more efficient data structures that enable faster queries based on position.

The approach used is based on a general class of solutions and datastructures known as an Interval Tree

There is an implementation of an Interval Tree in python that can be installed as follows:

pip install intervaltree

Now that we have this installed, let's explore the functionality a bit.

In [19]:
import intervaltree

First, let's build a tree from a set of genes:

In [20]:
tree = intervaltree.IntervalTree() # Initialize an empty tree
In [21]:
with open('chr3.genes.bed', 'r') as f:
    for line in f:
        tokens=line.split("\t")
        tree.addi(int(tokens[1]), int(tokens[2]), tokens[3])

We now have a tree that contains the positions of all genes on mouse chromosome 3

Let's find all genes at a specific location. We'll use the Sox2 locus, which has these 2 genes some text

In [22]:
tree.search(34548927, 34551382)
Out[22]:
{Interval(34459302, 34576915, 'NR_015580'),
 Interval(34548926, 34551382, 'NM_011443')}

But, you can even look at larger regions such as the following one with many genes: some text

In [23]:
tree.search(26156916, 39409396)
Out[23]:
{Interval(25330762, 26230831, 'NM_138666'),
 Interval(26536552, 26826403, 'NM_027583'),
 Interval(26536552, 26882134, 'NM_029150'),
 Interval(26996143, 27052776, 'NM_001177625'),
 Interval(26996143, 27052776, 'NM_007900'),
 Interval(26996143, 27052800, 'NM_001177626'),
 Interval(27052947, 27055941, 'NR_040548'),
 Interval(27081925, 27143833, 'NM_178772'),
 Interval(27215998, 27238587, 'NM_009425'),
 Interval(27270272, 27276932, 'NM_177330'),
 Interval(27315083, 27609361, 'NM_173182'),
 Interval(27483821, 27483917, 'NR_037275'),
 Interval(27764986, 27795290, 'NM_001164437'),
 Interval(27837601, 28032284, 'NM_001164056'),
 Interval(27883092, 28032284, 'NM_008875'),
 Interval(28162135, 28565371, 'NM_001163009'),
 Interval(28162135, 28569507, 'NM_001163007'),
 Interval(28162135, 28569507, 'NM_001163008'),
 Interval(28162135, 28569507, 'NM_026910'),
 Interval(28318864, 28318940, 'NR_130342'),
 Interval(28596824, 28627282, 'NM_031197'),
 Interval(28680232, 28697768, 'NM_177586'),
 Interval(28704432, 28706337, 'NM_026517'),
 Interval(28791538, 28825646, 'NM_001033479'),
 Interval(28981498, 29589978, 'NM_001167748'),
 Interval(28981498, 29589978, 'NM_029412'),
 Interval(29315744, 29315842, 'NR_030422'),
 Interval(29535665, 29535773, 'NR_105796'),
 Interval(29789938, 29823113, 'NR_110437'),
 Interval(29850217, 29912126, 'NM_007963'),
 Interval(30135490, 30408409, 'NM_021442'),
 Interval(30495994, 30498792, 'NM_029690'),
 Interval(30501008, 30518795, 'NM_001289621'),
 Interval(30501008, 30518795, 'NM_001289622'),
 Interval(30501008, 30518795, 'NM_001289623'),
 Interval(30501008, 30518795, 'NM_030557'),
 Interval(30523188, 30546740, 'NM_027941'),
 Interval(30543428, 30571353, 'NM_026668'),
 Interval(30546807, 30571353, 'NM_001290510'),
 Interval(30577475, 30598765, 'NM_001309233'),
 Interval(30645214, 30666096, 'NM_029489'),
 Interval(30645214, 30666096, 'NR_028060'),
 Interval(30691797, 30720185, 'NM_027016'),
 Interval(30754871, 30796116, 'NM_001134385'),
 Interval(30754871, 30796116, 'NM_001134386'),
 Interval(30754871, 30796116, 'NM_001286994'),
 Interval(30754871, 30796116, 'NM_001286995'),
 Interval(30754871, 30796116, 'NM_001286996'),
 Interval(30754871, 30796116, 'NM_027965'),
 Interval(30798216, 30868337, 'NM_001165954'),
 Interval(30798216, 30868337, 'NM_001165955'),
 Interval(30798216, 30868337, 'NM_001165956'),
 Interval(30798216, 30868337, 'NM_153421'),
 Interval(30894692, 30951662, 'NM_008857'),
 Interval(30937887, 30937950, 'NR_105975'),
 Interval(30993979, 31021845, 'NM_001039090'),
 Interval(30993979, 31021845, 'NM_001271772'),
 Interval(30993979, 31021845, 'NM_011386'),
 Interval(31048841, 31063248, 'NM_008770'),
 Interval(31101777, 31209241, 'NM_172861'),
 Interval(31801624, 32099102, 'NM_028231'),
 Interval(32233715, 32264587, 'NM_009517'),
 Interval(32264410, 32266360, 'NR_027966'),
 Interval(32335072, 32367408, 'NM_008839'),
 Interval(32371242, 32390891, 'NM_001195074'),
 Interval(32409471, 32419755, 'NM_001161818'),
 Interval(32409513, 32419755, 'NM_144519'),
 Interval(32428403, 32478147, 'NM_024200'),
 Interval(32482449, 32515457, 'NM_013531'),
 Interval(32607467, 32625893, 'NM_019673'),
 Interval(32626418, 32635677, 'NM_029017'),
 Interval(32635984, 32650481, 'NM_025316'),
 Interval(32716547, 32834179, 'NM_001013024'),
 Interval(32846847, 32884001, 'NM_001163517'),
 Interval(32846847, 32981982, 'NM_001289505'),
 Interval(32846847, 32981982, 'NM_001310460'),
 Interval(32846847, 32981982, 'NM_021483'),
 Interval(32846847, 33042113, 'NM_001163516'),
 Interval(32923698, 32987529, 'NR_040595'),
 Interval(33699104, 33713782, 'NM_001290500'),
 Interval(33699104, 33713782, 'NM_001290502'),
 Interval(33699104, 33713782, 'NM_025978'),
 Interval(33699104, 33713782, 'NM_027619'),
 Interval(33711282, 33743232, 'NM_026222'),
 Interval(33919000, 33968266, 'NM_001113188'),
 Interval(33919000, 33968266, 'NM_001113189'),
 Interval(33919000, 33968266, 'NM_008053'),
 Interval(33956201, 33980276, 'NM_001026211'),
 Interval(33976206, 33980276, 'NM_001286973'),
 Interval(33976206, 33980276, 'NM_026332'),
 Interval(33976272, 33980276, 'NM_001286972'),
 Interval(34459302, 34576915, 'NR_015580'),
 Interval(34537385, 34537464, 'NR_035433'),
 Interval(34548926, 34551382, 'NM_011443'),
 Interval(34598561, 34615584, 'NR_038347'),
 Interval(34608544, 34615584, 'NR_038348'),
 Interval(34670996, 34681268, 'NR_110488'),
 Interval(34821466, 34821573, 'NR_105799'),
 Interval(35496073, 35565085, 'NR_040748'),
 Interval(35653059, 35755198, 'NM_029570'),
 Interval(35791026, 35828996, 'NM_033623'),
 Interval(35791026, 35831256, 'NM_001205362'),
 Interval(35791026, 35831888, 'NM_001205361'),
 Interval(35858230, 35899600, 'NM_023644'),
 Interval(35898834, 35907750, 'NR_029456'),
 Interval(35906168, 35952469, 'NM_178418'),
 Interval(35964921, 35991779, 'NM_172678'),
 Interval(36050001, 36069263, 'NM_001101478'),
 Interval(36078347, 36121197, 'NM_198192'),
 Interval(36347845, 36374809, 'NM_009673'),
 Interval(36374382, 36374446, 'NR_105976'),
 Interval(36374858, 36381221, 'NM_028183'),
 Interval(36398725, 36405731, 'NR_040590'),
 Interval(36413978, 36420367, 'NR_040411'),
 Interval(36451527, 36464649, 'NM_019393'),
 Interval(36463786, 36470918, 'NM_009828'),
 Interval(36472064, 36512311, 'NM_027810'),
 Interval(36519403, 36589089, 'NM_019510'),
 Interval(36762027, 36951955, 'NM_172679'),
 Interval(36962577, 37010434, 'NM_009350'),
 Interval(37019641, 37024876, 'NM_008366'),
 Interval(37121680, 37131558, 'NM_001291041'),
 Interval(37121680, 37131558, 'NM_021782'),
 Interval(37207548, 37211368, 'NM_145825'),
 Interval(37211475, 37220372, 'NM_001255992'),
 Interval(37211475, 37220373, 'NM_001008502'),
 Interval(37247574, 37303752, 'NM_008006'),
 Interval(37303903, 37318518, 'NM_001291044'),
 Interval(37303903, 37318518, 'NM_153561'),
 Interval(37318871, 37478018, 'NM_001163511'),
 Interval(37318871, 37478018, 'NM_001310473'),
 Interval(37318871, 37478018, 'NM_021343'),
 Interval(37508812, 37514198, 'NR_130162'),
 Interval(37512834, 37514198, 'NR_130161'),
 Interval(37512834, 37514198, 'NR_130163'),
 Interval(37538868, 37543521, 'NM_001305440'),
 Interval(37538868, 37543521, 'NM_001305441'),
 Interval(37538868, 37543521, 'NM_001305442'),
 Interval(37538868, 37543521, 'NM_011896'),
 Interval(37613111, 37623282, 'NM_198657'),
 Interval(37797734, 37891650, 'NR_040559'),
 Interval(38102072, 38107041, 'NR_040541'),
 Interval(38348182, 38383738, 'NM_001167883'),
 Interval(38785861, 38910905, 'NM_183221')}

The return type is a set and so any standard iteration methods will work for going through the list.