{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 27: Numpy arrays and operations with them\n", "\n", "(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).\n", "\n", "This document was prepared at [Caltech](http://www.caltech.edu) with financial support from the [Donna and Benjamin M. Rosen Bioengineering Center](http://rosen.caltech.edu).\n", "\n", "\n", "\n", "*This tutorial was generated from a Jupyter notebook. You can download the notebook [here](l27_numpy_arrays.ipynb).*\n", "\n", "

" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import altair as alt\n", "\n", "import bootcamp_utils" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We just got an introduction to NumPy and SciPy. The packages are extensive. At the center is the NumPy array data type. We will explore this data type in this tutorial. It is worth noting that under the hood of many of the operations we do with Pandas `DataFrame`s are done with NumPy arrays. As you understand how NumPy arrays work, you will also better understand what Pandas is doing.\n", "\n", "As it is always more fun to work with a real biological application, we will populate our NumPy arrays with data. In their 2011 [paper in PLoS ONE](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0025840), Harvey and Orbidans measured the cross-sectional area of *C. elegans* eggs that came from mothers who had a high concentration of food and from mothers of a low concentration of food. I digitized the data from their plots, and they are available in the file `data/c_elegans_egg_xa.csv` in the bootcamp repository." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting Numpy arrays from Pandas data frames\n", "\n", "NumPy has a primitive function for loading in data from text files, `np.loadtxt()`, but with Panda's `read_csv()`, there is really no reason to ever use it. So, we will load in the (tidy) data using Pandas." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodarea (sq. um)
0high1683
1high2061
2high1792
3high1852
4high2091
\n", "
" ], "text/plain": [ " food area (sq. um)\n", "0 high 1683\n", "1 high 2061\n", "2 high 1792\n", "3 high 1852\n", "4 high 2091" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/c_elegans_egg_xa.csv', comment='#')\n", "\n", "# Take a look\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's a good idea, since we're using Altair, to change the column name for the area to exclude the dot." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = df.rename(columns={'area (sq. um)': 'area (sq um)'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just for fun, let's make a quick plot. We can do this because we went high level first." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.vegalite.v2+json": { "$schema": "https://vega.github.io/schema/vega-lite/v2.4.3.json", "config": { "point": { "filled": true, "opacity": 0.3 }, "view": { "height": 300, "width": 400 } }, "data": { "values": [ { "__jitter": -0.18601242476385216, "area (sq um)": 1683, "food": "high" }, { "__jitter": -0.19588448167425795, "area (sq um)": 2061, "food": "high" }, { "__jitter": 0.026993988266302005, "area (sq um)": 1792, "food": "high" }, { "__jitter": -0.11169301153564214, "area (sq um)": 1852, "food": "high" }, { "__jitter": 0.06385862244341062, "area (sq um)": 2091, "food": "high" }, { "__jitter": -0.1556356232074438, "area (sq um)": 1781, "food": "high" }, { "__jitter": 0.10619050922026235, "area (sq um)": 1912, "food": "high" }, { "__jitter": -0.017110496018028443, "area (sq um)": 1802, "food": "high" }, { "__jitter": -0.10053322661339466, "area (sq um)": 1751, "food": "high" }, { "__jitter": -0.09017222573156919, "area (sq um)": 1731, "food": "high" }, { "__jitter": -0.16361610383798586, "area (sq um)": 1892, "food": "high" }, { "__jitter": -0.16898212675309107, "area (sq um)": 1951, "food": "high" }, { "__jitter": 0.037142301652123205, "area (sq um)": 1809, "food": "high" }, { "__jitter": -0.15555502933842577, "area (sq um)": 1683, "food": "high" }, { "__jitter": 0.04682896877557824, "area (sq um)": 1787, "food": "high" }, { "__jitter": 0.1907779229714205, "area (sq um)": 1840, "food": "high" }, { "__jitter": 0.055471547853575276, "area (sq um)": 1821, "food": "high" }, { "__jitter": 0.15381124355306797, "area (sq um)": 1910, "food": "high" }, { "__jitter": 0.13807974911153903, "area (sq um)": 1930, "food": "high" }, { "__jitter": 0.13573054296649734, "area (sq um)": 1800, "food": "high" }, { "__jitter": -0.08593586997423866, "area (sq um)": 1833, "food": "high" }, { "__jitter": 0.06217174309000273, "area (sq um)": 1683, "food": "high" }, { "__jitter": 0.07463042546469506, "area (sq um)": 1671, "food": "high" }, { "__jitter": 0.15866381119486606, "area (sq um)": 1680, "food": "high" }, { "__jitter": 0.09249148808084062, "area (sq um)": 1692, "food": "high" }, { "__jitter": 0.1456136466763019, "area (sq um)": 1800, "food": "high" }, { "__jitter": 0.13102384621278762, "area (sq um)": 1821, "food": "high" }, { "__jitter": -0.03906961417318802, "area (sq um)": 1882, "food": "high" }, { "__jitter": -0.013285077018703684, "area (sq um)": 1642, "food": "high" }, { "__jitter": 0.12461584580543833, "area (sq um)": 1749, "food": "high" }, { "__jitter": -0.11215888038197451, "area (sq um)": 1712, "food": "high" }, { "__jitter": -0.011351927711327148, "area (sq um)": 1661, "food": "high" }, { "__jitter": 0.00019029401250558742, "area (sq um)": 1701, "food": "high" }, { "__jitter": 0.0499577283654033, "area (sq um)": 2141, "food": "high" }, { "__jitter": 0.17897211408019398, "area (sq um)": 1863, "food": "high" }, { "__jitter": -0.1160808505842239, "area (sq um)": 1752, "food": "high" }, { "__jitter": -0.08213780301322618, "area (sq um)": 1740, "food": "high" }, { "__jitter": 0.009833180798598518, "area (sq um)": 1721, "food": "high" }, { "__jitter": -0.08225441224413794, "area (sq um)": 1660, "food": "high" }, { "__jitter": -0.05105233887508548, "area (sq um)": 1930, "food": "high" }, { "__jitter": -0.16624018530808354, "area (sq um)": 2030, "food": "high" }, { "__jitter": -0.14952530527616142, "area (sq um)": 1851, "food": "high" }, { "__jitter": 0.1970675541391319, "area (sq um)": 2131, "food": "high" }, { "__jitter": -0.10196512412359815, "area (sq um)": 1828, "food": "high" }, { "__jitter": 0.8064558535554716, "area (sq um)": 1840, "food": "low" }, { "__jitter": 1.1684307302699484, "area (sq um)": 2090, "food": "low" }, { "__jitter": 0.8677268571673435, "area (sq um)": 2169, "food": "low" }, { "__jitter": 0.8569663195478624, "area (sq um)": 1988, "food": "low" }, { "__jitter": 0.97741753873965, "area (sq um)": 2212, "food": "low" }, { "__jitter": 1.020295501724004, "area (sq um)": 2339, "food": "low" }, { "__jitter": 1.1734002060473334, "area (sq um)": 1989, "food": "low" }, { "__jitter": 1.0621946774028368, "area (sq um)": 2144, "food": "low" }, { "__jitter": 1.1406070726064168, "area (sq um)": 2290, "food": "low" }, { "__jitter": 1.1128266783559368, "area (sq um)": 1920, "food": "low" }, { "__jitter": 0.8138527249939662, "area (sq um)": 2280, "food": "low" }, { "__jitter": 1.1323514301983895, "area (sq um)": 1809, "food": "low" }, { "__jitter": 1.0982958302793537, "area (sq um)": 2158, "food": "low" }, { "__jitter": 1.046304935725333, "area (sq um)": 1800, "food": "low" }, { "__jitter": 1.0141490502411714, "area (sq um)": 2133, "food": "low" }, { "__jitter": 1.0221901318897237, "area (sq um)": 2060, "food": "low" }, { "__jitter": 0.813953533438544, "area (sq um)": 2160, "food": "low" }, { "__jitter": 1.0380859722956888, "area (sq um)": 2001, "food": "low" }, { "__jitter": 0.9497687191981147, "area (sq um)": 2030, "food": "low" }, { "__jitter": 1.0957158892465173, "area (sq um)": 2088, "food": "low" }, { "__jitter": 0.9841775978341435, "area (sq um)": 1951, "food": "low" }, { "__jitter": 0.8370262963019697, "area (sq um)": 2460, "food": "low" }, { "__jitter": 0.9000979283210087, "area (sq um)": 2021, "food": "low" }, { "__jitter": 0.9829628882270973, "area (sq um)": 2010, "food": "low" }, { "__jitter": 1.0609658402905318, "area (sq um)": 2139, "food": "low" }, { "__jitter": 0.8000532438045769, "area (sq um)": 2160, "food": "low" }, { "__jitter": 1.1827727806471802, "area (sq um)": 2106, "food": "low" }, { "__jitter": 1.17262954323727, "area (sq um)": 2171, "food": "low" }, { "__jitter": 0.8129523305927425, "area (sq um)": 2113, "food": "low" }, { "__jitter": 1.0377930490088356, "area (sq um)": 2179, "food": "low" }, { "__jitter": 0.9868822007155247, "area (sq um)": 1890, "food": "low" }, { "__jitter": 1.160920125297981, "area (sq um)": 2179, "food": "low" }, { "__jitter": 0.9088765976538461, "area (sq um)": 2021, "food": "low" }, { "__jitter": 1.06653291796239, "area (sq um)": 1969, "food": "low" }, { "__jitter": 0.8957387041930729, "area (sq um)": 2150, "food": "low" }, { "__jitter": 1.171065913192848, "area (sq um)": 1900, "food": "low" }, { "__jitter": 1.0722209401437661, "area (sq um)": 2267, "food": "low" }, { "__jitter": 1.0018558690898236, "area (sq um)": 1711, "food": "low" }, { "__jitter": 1.0825657320822843, "area (sq um)": 1901, "food": "low" }, { "__jitter": 1.0130060072362734, "area (sq um)": 2114, "food": "low" }, { "__jitter": 0.9050381329989383, "area (sq um)": 2112, "food": "low" }, { "__jitter": 1.0987061715713733, "area (sq um)": 2361, "food": "low" }, { "__jitter": 0.9982923052667696, "area (sq um)": 2130, "food": "low" }, { "__jitter": 0.89799145904877, "area (sq um)": 2061, "food": "low" }, { "__jitter": 0.8068640364651295, "area (sq um)": 2121, "food": "low" }, { "__jitter": 0.9798778869629317, "area (sq um)": 1832, "food": "low" }, { "__jitter": 1.0207258370413532, "area (sq um)": 2210, "food": "low" }, { "__jitter": 1.0803220971224452, "area (sq um)": 2130, "food": "low" }, { "__jitter": 0.9978503031570902, "area (sq um)": 2153, "food": "low" }, { "__jitter": 0.9265860508026453, "area (sq um)": 2009, "food": "low" }, { "__jitter": 0.8473602452019033, "area (sq um)": 2100, "food": "low" }, { "__jitter": 0.9277743488202809, "area (sq um)": 2252, "food": "low" }, { "__jitter": 0.832126283138602, "area (sq um)": 2143, "food": "low" }, { "__jitter": 1.1568353181672906, "area (sq um)": 2252, "food": "low" }, { "__jitter": 1.0105119316836175, "area (sq um)": 2222, "food": "low" }, { "__jitter": 1.119745765588489, "area (sq um)": 2121, "food": "low" }, { "__jitter": 0.8061264629267176, "area (sq um)": 2409, "food": "low" } ] }, "encoding": { "color": { "field": "food", "type": "nominal" }, "tooltip": {}, "x": { "axis": { "grid": false, "labels": false, "ticks": false, "title": null, "values": [ 0, 1 ] }, "field": "__jitter", "type": "quantitative" }, "y": { "field": "area (sq um)", "scale": { "zero": false }, "title": "area (µm²)", "type": "quantitative" } }, "height": 300, "mark": "point", "width": 150 }, "image/png": "", "text/plain": [ "\n", "\n", "If you see this message, it means the renderer has not been properly enabled\n", "for the frontend that you are using. For more information, see\n", "https://altair-viz.github.io/user_guide/troubleshooting.html\n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bootcamp_utils.altair_jitter(df,\n", " encode_x=alt.X('food:N'), \n", " encode_y=alt.Y('area (sq um):Q',\n", " title='area (µm²)', \n", " scale=alt.Scale(zero=False)),\n", " height=300,\n", " width=150).configure_point(filled=True,\n", " opacity=0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like worms that eat more food have smaller eggs.\n", "\n", "If we wanted to extract the measurements for worms with high food, we can do so using Boolean indexing in Pandas." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xa_high = df.loc[df['food']=='high', 'area (sq um)']\n", "\n", "# Take a look at the data type\n", "type(xa_high)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a Pandas `Series`, which is kind of like a single-column `DataFrame`. If we want to convert this to a Numpy array, we use the `.values` attribute." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "numpy.ndarray" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xa_high = df.loc[df['food']=='high', 'area (sq um)'].values\n", "\n", "type(xa_high)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have a Numpy array! Let's pull out the low food cross sectional areas as well." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "xa_low = df.loc[df['food']=='low', 'area (sq um)'].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now let's take a look at these arrays." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1683, 2061, 1792, 1852, 2091, 1781, 1912, 1802, 1751, 1731, 1892,\n", " 1951, 1809, 1683, 1787, 1840, 1821, 1910, 1930, 1800, 1833, 1683,\n", " 1671, 1680, 1692, 1800, 1821, 1882, 1642, 1749, 1712, 1661, 1701,\n", " 2141, 1863, 1752, 1740, 1721, 1660, 1930, 2030, 1851, 2131, 1828])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xa_high" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1840, 2090, 2169, 1988, 2212, 2339, 1989, 2144, 2290, 1920, 2280,\n", " 1809, 2158, 1800, 2133, 2060, 2160, 2001, 2030, 2088, 1951, 2460,\n", " 2021, 2010, 2139, 2160, 2106, 2171, 2113, 2179, 1890, 2179, 2021,\n", " 1969, 2150, 1900, 2267, 1711, 1901, 2114, 2112, 2361, 2130, 2061,\n", " 2121, 1832, 2210, 2130, 2153, 2009, 2100, 2252, 2143, 2252, 2222,\n", " 2121, 2409])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xa_low" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use these arrays as examples to learn about Numpy arrays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operators and NumPy arrays\n", "\n", "We saw in the previous tutorial that NumPy arrays are a special data type. They have well-defined ways in which our familiar operators work with them. Let's learn about this by example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scalars and NumPy arrays\n", "\n", "We'll start with multiplying by an array by a constant. Say we wanted to convert the units of the cross sectional area from µm$^2$ to mm$^2$. This means we have to divide every entry by 10$^6$ (or multiply by 10$^{-6}$). Multiplication by a scalar works elementwise on NumPy arrays. Check it out." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.001683, 0.002061, 0.001792, 0.001852, 0.002091, 0.001781,\n", " 0.001912, 0.001802, 0.001751, 0.001731, 0.001892, 0.001951,\n", " 0.001809, 0.001683, 0.001787, 0.00184 , 0.001821, 0.00191 ,\n", " 0.00193 , 0.0018 , 0.001833, 0.001683, 0.001671, 0.00168 ,\n", " 0.001692, 0.0018 , 0.001821, 0.001882, 0.001642, 0.001749,\n", " 0.001712, 0.001661, 0.001701, 0.002141, 0.001863, 0.001752,\n", " 0.00174 , 0.001721, 0.00166 , 0.00193 , 0.00203 , 0.001851,\n", " 0.002131, 0.001828])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xa_high / 1e6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that `1e6` is how we represent numbers in Python in scientific notation, and that dividing the NumPy array by this number resulted in every entry in the array being divided. The `+`, `-`, and `*` operators all work in this way. For example:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2683, 3061, 2792, 2852, 3091, 2781, 2912, 2802, 2751, 2731, 2892,\n", " 2951, 2809, 2683, 2787, 2840, 2821, 2910, 2930, 2800, 2833, 2683,\n", " 2671, 2680, 2692, 2800, 2821, 2882, 2642, 2749, 2712, 2661, 2701,\n", " 3141, 2863, 2752, 2740, 2721, 2660, 2930, 3030, 2851, 3131, 2828])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xa_high + 1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Booleans with NumPy arrays and scalars\n", "\n", "Let's see what happens when we compare a NumPy array to a scalar." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ True, False, True, True, False, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, False, True, True,\n", " True, True, True, True, False, True, False, True])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xa_high < 2000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get an array of Booleans! The comparison is elementwise. This is important to know because we cannot use these comparisons with an **`if`** clause." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mif\u001b[0m \u001b[0mxa_high\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m2000\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Nothing to print, really. This will just be an error.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" ] } ], "source": [ "if xa_high > 2000:\n", " print('Nothing to print, really. This will just be an error.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take the advice from the exception and use the `.any()` or `.all()` operators." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check if any values are biggern than 2000\n", "(xa_high > 2000).any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember, the expresson `(xa_high > 2000)` is itself a NumPy array of Booleans. The `any()` method returns `True` if *any* of the entries in that array are `True`. Similarly, the `all()` method returns `True` if *all* entries in the array are `True`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(xa_high > 2000).all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yup! At least one cross sectional area is greater than 2000 µm$^2$ but not all of them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Equality checks with NumPy arrays\n", "\n", "Remember, you should never use the equality operator (`==`) with `float`s. Fortunately, NumPy offers a couple nice functions to check if two numbers are *almost* equal. This helps deal with the numerical precision issues when comparing `float`s. The `np.isclose()` function checks to see if two numbers are close in value, and this is really useful in writing tests. It works elementwise for NumPy arrays." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compare two numbers\n", "np.isclose(1.3, 1.29999999999)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, True, False, False, False, False, False, True, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compare an array to a scalar\n", "np.isclose(xa_high, 1800)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A couple cross sectional areas are 1800 µm$^2$. The `np.allclose()` function checks to see if all values in a NumPy array are close." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.allclose(xa_high, 1800)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Operators with two NumPy arrays\n", "\n", "We can apply operators with two NumPy ararys. Let's give it a whirl. (This is meaningless in the context of the actual data contained in these arrays, but it's an operation we need to understand.)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "operands could not be broadcast together with shapes (44,) (57,) ", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mxa_high\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mxa_low\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mValueError\u001b[0m: operands could not be broadcast together with shapes (44,) (57,) " ] } ], "source": [ "xa_high + xa_low" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yikes! The exception tells us that the two arrays we are using the operator on need to have the same shape. This makes sense: if we are going to do element-by-element addition, the arrays better have the same number of elements. To continue with our operators on two arrays, we'll slice the longer NumPy array. The basic slicing syntax is the same as for strings, lists, and tuples." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Just take the first elements\n", "xa_low_slice = xa_low[:len(xa_high)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, let's try adding arrays again." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([3523, 4151, 3961, 3840, 4303, 4120, 3901, 3946, 4041, 3651, 4172,\n", " 3760, 3967, 3483, 3920, 3900, 3981, 3911, 3960, 3888, 3784, 4143,\n", " 3692, 3690, 3831, 3960, 3927, 4053, 3755, 3928, 3602, 3840, 3722,\n", " 4110, 4013, 3652, 4007, 3432, 3561, 4044, 4142, 4212, 4261, 3889])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xa_high + xa_low_slice" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! We get element-by-element addition. The same happens for the other operators we've discussed. `np.isclose()` also operates element-by-element." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.isclose(xa_high, xa_low_slice)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Slicing NumPy arrays\n", "\n", "We already saw that we can slice NumPy arrays like lists and tuples. Here are a few examples." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1828, 2131, 1851, 2030, 1930, 1660, 1721, 1740, 1752, 1863, 2141,\n", " 1701, 1661, 1712, 1749, 1642, 1882, 1821, 1800, 1692, 1680, 1671,\n", " 1683, 1833, 1800, 1930, 1910, 1821, 1840, 1787, 1683, 1809, 1951,\n", " 1892, 1731, 1751, 1802, 1912, 1781, 2091, 1852, 1792, 2061, 1683])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reversed array\n", "xa_high[::-1]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1852, 1751, 1683, 1930, 1680, 1642, 2141, 1660, 1828])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Every 5th element, starting at index 3\n", "xa_high[3::5]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1892, 1951, 1809, 1683, 1787, 1840, 1821, 1910, 1930, 1800, 1833])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Entries 10 to 20\n", "xa_high[10:21]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fancy indexing\n", "\n", "NumPy arrays also allow **fancy indexing**, where we can slice out specific values. For example, say we wanted indices 1, 19, and 6 (in that order) from `xa_high`. We just index with a list of the indices we want." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2061, 1800, 1912])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xa_high[[1, 19, 6]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of a list, we could also use a NumPy array." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2061, 1800, 1912])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xa_high[np.array([1, 19, 6])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a very nice feature, we can use Boolean indexing with Numpy arrays, just like with Pandas using `.loc`. Say we only want the egg cross sectional areas that are greater than 2000 µm$^2$." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2061, 2091, 2141, 2030, 2131])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Just slice out the big ones\n", "xa_high[xa_high > 2000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to know the indices where the values are high, we can use the `np.where()` function." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([ 1, 4, 33, 40, 42]),)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.where(xa_high > 2000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NumPy arrays are mutable\n", "\n", "Yes, NumPy arrays are mutable. Let's look at some consequences." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 6, 4])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make an array\n", "my_ar = np.array([1, 2, 3, 4])\n", "\n", "# Change an element\n", "my_ar[2] = 6\n", "\n", "# See the result\n", "my_ar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's try working attaching another variable to the NumPy array." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 6, 9])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Attach a new variable\n", "my_ar2 = my_ar\n", "\n", "# Set an entry using the new variable\n", "my_ar2[3] = 9\n", "\n", "# Does the original change? (yes.)\n", "my_ar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how messing with NumPy in functions affects things." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.1, 0.2, 0.3, 0.4])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Re-instantiate my_ar\n", "my_ar = np.array([1, 2, 3, 4]).astype(float)\n", "\n", "# Function to normalize x (note that /= works with mutable objects)\n", "def normalize(x):\n", " x /= np.sum(x)\n", "\n", "# Pass it through a function\n", "normalize(my_ar)\n", "\n", "# Is it normalized even though we didn't return anything? (Yes.)\n", "my_ar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, be careful when writing functions. What you do to your NumPy array inside the function will happen outside of the function as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slices of NumPy arrays are **views**, not copies\n", "\n", "A very important distinction between NumPy arrays and lists is that slices of NumPy arrays are **views** into the original NumPy array, NOT copies." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1, 2, 3, 4]\n", "[1 9 3 4]\n" ] } ], "source": [ "# Make list and array\n", "my_list = [1, 2, 3, 4]\n", "my_ar = np.array(my_list)\n", "\n", "# Slice out of each\n", "my_list_slice = my_list[1:-1]\n", "my_ar_slice = my_ar[1:-1]\n", "\n", "# Mess with the slices\n", "my_list_slice[0] = 9\n", "my_ar_slice[0] = 9\n", "\n", "# Look at originals\n", "print(my_list)\n", "print(my_ar)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Messing with an element of a slice of a NumPy array messes with that element in the original! This is not the case with lists. Let's issue a warning.\n", "\n", "
\n", "
Slices of NumPy arrays are **views**, not copies.
\n", "
\n", "\n", "Fortunately, you can make a copy of an array using the `np.copy()` function." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make a copy\n", "xa_high_copy = np.copy(xa_high)\n", "\n", "# Mess with an entry\n", "xa_high_copy[10] = 2000\n", "\n", "# Check equality\n", "np.allclose(xa_high, xa_high_copy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, messing with an entry in the copy did not affect the original." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing 2D NumPy arrays\n", "\n", "NumPy arrays need not be one-dimensional. We'll create a two-dimensional NumPy array by reshaping our `xa_high` array from having shape `(44,)` to having shape `(11, 4)`. That is, it will become an array with 11 rows and 4 columns." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1683, 2061, 1792, 1852],\n", " [2091, 1781, 1912, 1802],\n", " [1751, 1731, 1892, 1951],\n", " [1809, 1683, 1787, 1840],\n", " [1821, 1910, 1930, 1800],\n", " [1833, 1683, 1671, 1680],\n", " [1692, 1800, 1821, 1882],\n", " [1642, 1749, 1712, 1661],\n", " [1701, 2141, 1863, 1752],\n", " [1740, 1721, 1660, 1930],\n", " [2030, 1851, 2131, 1828]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# New 2D array using the reshape() method\n", "my_ar = xa_high.reshape((11, 4))\n", "\n", "# Look at it\n", "my_ar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that it is represented as an array made out of a list of lists. If we had a list of lists, we would index it like this:\n", "\n", " list_of_lists[i][j]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make list of lists\n", "list_of_lists = [[1, 2], [3, 4]]\n", "\n", "# Pull out value in first row, second column\n", "list_of_lists[0][1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Though this will work with NumPy arrays, this is *not* how NumPy arrays are indexed. They are indexed much more conveniently." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2061" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_ar[0,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We essentially have a tuple in the indexing brackets. Now, say we wanted the second row (indexing starting at 0)." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1751, 1731, 1892, 1951])" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_ar[2,:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use Boolean indexing as before." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2061, 2091, 2141, 2030, 2131])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_ar[my_ar > 2000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this gives a one-dimensional list of the entries greater than 2000. If we wanted indices where this is the case, we can again use `np.where()`." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([ 0, 1, 8, 10, 10]), array([1, 0, 1, 0, 2]))" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.where(my_ar > 2000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tuple of NumPy arrays is how we would index using fancy indexing to pull those values out using fancy indexing." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2061, 2091, 2141, 2030, 2131])" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_ar[(np.array([ 0, 1, 8, 10, 10]), np.array([1, 0, 1, 0, 2]))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NumPy arrays can be of arbitrary integer dimension, and these principles extrapolate to 3D, 4D, etc., arrays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concatenating arrays\n", "\n", "Let's say we want to study all cross sectional areas and don't care if the mother was well-fed or not. We would want to concatenate our arrays. The `np.concatenate()` function accomplishes this. We simply have to pass it a tuple containing the NumPy arrays we want to concatenate." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1683, 2061, 1792, 1852, 2091, 1781, 1912, 1802, 1751, 1731, 1892,\n", " 1951, 1809, 1683, 1787, 1840, 1821, 1910, 1930, 1800, 1833, 1683,\n", " 1671, 1680, 1692, 1800, 1821, 1882, 1642, 1749, 1712, 1661, 1701,\n", " 2141, 1863, 1752, 1740, 1721, 1660, 1930, 2030, 1851, 2131, 1828,\n", " 1840, 2090, 2169, 1988, 2212, 2339, 1989, 2144, 2290, 1920, 2280,\n", " 1809, 2158, 1800, 2133, 2060, 2160, 2001, 2030, 2088, 1951, 2460,\n", " 2021, 2010, 2139, 2160, 2106, 2171, 2113, 2179, 1890, 2179, 2021,\n", " 1969, 2150, 1900, 2267, 1711, 1901, 2114, 2112, 2361, 2130, 2061,\n", " 2121, 1832, 2210, 2130, 2153, 2009, 2100, 2252, 2143, 2252, 2222,\n", " 2121, 2409])" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined = np.concatenate((xa_high, xa_low))\n", "\n", "# Look at it\n", "combined" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }