--- layout: false .left-column[ ## Data from the web ### .orange[Yahoo finance] ] .right-column[ [yahoo_logo]: pics/yahoo_logo.png "Yahoo finance web site" Now let's look at some formatted data Google's (Alphabet's) historical share price data. Finance is nice because there is lots of nicely formatted data available. Let's go to the Yahoo finance web site and get some data. | ![yahoo_logo] | http://finance.yahoo.com | |---------------|--------------------------| ] --- .left-column[ ## Data from the web ### Yahoo finance #### .orange[Google Stock] ] .right-column[ [google_chart]: pics/google_chart.png "Chart of Google's stock price" Now search for Google's (the parent company is called *Alphabet*) financial data. The ticker is **GOOG**. Note the new URL in your browser. In the [HTTP](https://tools.ietf.org/html/rfc7230) specification this is called a GET request: ![google_chart] http://finance.yahoo.com/quote/GOOG?p=GOOG ] --- .left-column[ ## Data from the web ### Yahoo finance #### .orange[Google stock prices] ] .right-column[ [google_download]: pics/google_data_download.png "The download interface on Yahoo" Follow the link that says .blue[**Historical Data**] We can now choose the *time period*, the *type* of data and the *frequency* of observations. Let's take the default and the link [*Download data*]( http://chart.finance.yahoo.com/table.csv?s=GOOG&a=7&b=8&c=2016&d=8&e=8&f=2016&g=d&ignore=.csv) ![google_download] Now you'll get a file called *table.csv* to download. Take a look at the URL where it is coming from. http://chart.finance.yahoo.com/table.csv?s=GOOG&a=7&b=8&c=2016&d=8&e=8&f=2016&g=d&ignore=.csv .blue[Question: Notice anything about it? ] We'll come back to this! ] --- .left-column[ ## Data from the web ### Yahoo finance #### Google stock prices ##### .orange[Peek at the data] ] .right-column[ Let's use the command line tools to quickly get a sense of the data. (Note that there are much better ways to do a more thorough job of this!) Let's get the **minimum** value and the **maximum** value using the tools that we know. First, we need to extract the price of the stock from our table.csv. Which column is it? Let's take a quick look with **head**: ```.remark $ head table.csv Date,Open,High,Low,Close,Volume,Adj Close 2016-09-08,778.590027,780.349976,773.580017,775.320007,1260600,775.320007 2016-09-07,780.00,782.72998,776.200012,780.349976,893700,780.349976 2016-09-06,773.450012,782.00,771.00,780.080017,1439900,780.080017 ... ``` Okay, let's use the opening price **Open**. .blue[Question: How to extract the column we want? ] ] --- .left-column[ ## Data from the web ### Yahoo finance #### Google stock prices ##### .orange[Peek at the data] ] .right-column[ Here's one way to look at the data from **table.csv**. ```.remark $ cut -d ',' -f2 table.csv | head -n2 Open 778.590027 ``` **cut** extracts the 2nd column (-f2) after we tell it that the file is delimited by commans (-d ',') **head** shows us the first 3 rows (-n2) .blue[Question: How to calculate the minimum and the maximum from the column of numbers? ] ] --- name: inverse layout: true class: center, middle, inverse --- --- layout: false .left-column[ ## Data from the web ### Yahoo finance #### Google stock prices ##### .orange[Peek at the data] ] .right-column[ Now to get the **minimum** and the **maximum** values we just chain these commands togther. ```.remark $ cut -d ',' -f2 table.csv | sort -n | head -n2 | grep -v Open 767.00 $ cut -d ',' -f2 table.csv | sort -n | tail -n1 785.00 ``` Now let's get the **average** of this. ```.remark cut -d ',' -f2 table.csv | awk '{a+=$1} END{print a/NR}' ``` Okay, this was a trick question. Here we are piping to a language called **_awk_** that is frequently used by shell scripters. In this case we feed the the numbers one by one and **_awk_** adds them together in a variable we call **a**. At the END of the **_awk_** script (which is just one line long!) our variable **a** is divided by **NR** which is **_awk_**'s count of the number of rows. In other words, we just calculated the mean! This gets at the power of this approach! ] --- name: inverse layout: true class: center, middle, inverse --- # Step 3: Automating the boring stuff --- layout: false ## Data from the web: .orange[Automating everything 1] We know we can chain commands togther, but it can be tedious. Let's look back at our download URL. http://chart.finance.yahoo.com/table.csv?s=GOOG&a=7&b=8&c=2016&d=8&e=8&f=2016&g=d&ignore=.csv .blue[ Question: See anything interesting? ] --- ## Data from the web: .orange[Automating everything 2] The ticker symbol .blue[GOOG] is also used in the URL. This is called a [HTTP](https://tools.ietf.org/html/rfc7230) [**GET**](https://tools.ietf.org/html/rfc7230#section-2.1) request and we can take advantage of it! http://chart.finance.yahoo.com/table.csv?s=GOOG&a=7&b=8&c=2016&d=8&e=8&f=2016&g=d&ignore=.csv For example, we can re-use the data and put .blue[BBRY] for [**BlackBerry**](http://finance.yahoo.com/quote/BBRY?p=BBRY) or .blue[TD] for [**Toronto Dominion**](http://finance.yahoo.com/quote/TD?p=TD) bank. --- ## Data from the web: .orange[Automating everything 3] Let's make use of a command called **wget** that allows us to download files from the web. We'll also introduce a simple for loop in **bash**. ```.remark # Notice that the ticker variable is set to GOOG, TD or BBRY as it loops through for ticker in GOOG TD BBRY do wget -O $ticker.csv "http://chart.finance.yahoo.com/table.csv?s=$ticker&a=7&b=8&c=2016&d=8&e=8&f=2016&g=d&ignore=.csv" done ``` -------------------- To find out more about any command just use the **man** command:
$ man wget NAME Wget - The non-interactive network downloader. SYNOPSIS wget [option]... [URL]... DESCRIPTION GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. ...
.blue[ Question: What do I mean by that? ] --- ## .blue[LESSON - Make everything reproducable.] Be able to start with your raw data and do your entire analysis from scratch. - This means that you can always update your data source and see if the results still hold. - We did this when we downloaded data for .blue[TD Bank] and .blue[BlackBerry]. - We were able to re-use our script. We'll see this theme again in other aspects of our data munging.
cat myfile | tr '[:upper:]' '[:lower:]'
grep "Steve-o" myfile
cat myfile | sed -e "s/Steve-o/Justin/"
Also, be sure to try
man [your command]