├── .gitignore ├── Dockerfile ├── LICENSE ├── README.md ├── airline └── Exploration of Airline On-Time Performance.ipynb ├── bluemix-spark-cloudant ├── 1-Streaming-Meetups-to-IBM-Cloudant-using-Spark.ipynb ├── 2-Reading-Meetups-from-IBM-Cloudant-using-Spark.ipynb └── README.md ├── elasticity ├── Elasticity Experiment.ipynb ├── LICENSE ├── README.md └── springData.txt ├── hacks ├── IPython Parallel and R.ipynb ├── Webserver in a Notebook.ipynb └── instaquery.ipynb ├── hn ├── Hacker News Runner.ipynb └── Hacker News and AlchemyAPI.ipynb ├── index.ipynb ├── mlb ├── README.md └── mlb-salaries.ipynb ├── noaa ├── README.md ├── etl │ ├── README.md │ ├── noaa_hdta_etl.ipynb │ ├── noaa_hdta_etl_csv_tools.ipynb │ └── noaa_hdta_etl_hdf_tools.ipynb ├── hdtadash │ ├── README.md │ ├── data │ │ ├── ghcnd-stations.txt │ │ └── station_summaries.tar │ ├── dev_weather_dashboard.ipynb │ ├── folium_map.ipynb │ ├── urth-core-watch.html │ ├── urth-raw-html.html │ ├── urth_env_test.ipynb │ └── weather_dashboard.ipynb └── tmaxfreq │ ├── README.md │ ├── ghcnd-stations.txt │ ├── noaaquery_tmaxfreq.ipynb │ └── noaaquery_tmaxfreq_tools.ipynb ├── requirements.txt ├── scikit-learn └── sklearn_cookbook.ipynb ├── tax-maps ├── Interactive Data Maps.ipynb └── data │ └── us-states-10m.json └── united-nations ├── README.md └── senegal_population_trends.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | # Compiled source # 2 | ################### 3 | *.com 4 | *.class 5 | *.dll 6 | *.exe 7 | *.o 8 | *.so 9 | 10 | # Packages # 11 | ############ 12 | # it's better to unpack these files and commit the raw source 13 | # git has its own built in compression methods 14 | *.7z 15 | *.dmg 16 | *.gz 17 | *.iso 18 | *.jar 19 | *.rar 20 | *.zip 21 | 22 | # Logs and databases # 23 | ###################### 24 | *.log 25 | *.sql 26 | *.sqlite 27 | 28 | # OS generated files # 29 | ###################### 30 | .DS_Store 31 | .DS_Store? 32 | ._* 33 | .Spotlight-V100 34 | .Trashes 35 | ehthumbs.db 36 | Thumbs.db 37 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM andrewosh/binder-base 2 | 3 | USER root 4 | 5 | RUN curl -sL https://deb.nodesource.com/setup_0.12 | bash - && \ 6 | apt-get install -y nodejs && \ 7 | npm install -g bower 8 | 9 | USER main 10 | 11 | COPY requirements.txt /tmp/requirements.txt 12 | RUN cd /tmp && \ 13 | pip install -r requirements.txt && \ 14 | bash -c "source activate python3 && \ 15 | pip install -r requirements.txt" 16 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015 IBM Corporation 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Jupyter Samples 2 | 3 | [![Binder](http://mybinder.org/badge.svg)](http://mybinder.org/repo/ibm-et/jupyter-samples) 4 | 5 | This repository contains sample [IPython](http://ipython.org) / [Jupyter](http://jupyter.org/) notebooks ranging from tutorials on using popular open source repositories to sample analyses on public data sets to neat notebook hacks. 6 | 7 | ## View the Notebooks 8 | 9 | You can view the notebooks here on GitHub. See the [index](index.ipynb) for the complete list. 10 | 11 | ## Try the Notebooks 12 | 13 | You can run these notebooks yourself in a [Binder](https://mybinder.org). Click the *Launch Binder* badge above to get your own Jupyter Notebook server with all the prereqs installed. 14 | 15 | ## License 16 | 17 | Notebooks are copyright (c) 2015 IBM Corporation under the MIT license. See LICENSE for details. 18 | 19 | Sample data files, libraries, techniques, external publications, etc. are cited in the notebooks in which they are used. Those works remain under the copyright of their respective owners. 20 | -------------------------------------------------------------------------------- /bluemix-spark-cloudant/1-Streaming-Meetups-to-IBM-Cloudant-using-Spark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# (1) Streaming Meetups to IBM Cloudant using Spark" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In this notebook we are going to utilize the [IBM Apache Spark](https://console.ng.bluemix.net/catalog/services/apache-spark/) and [IBM Cloudant](https://console.ng.bluemix.net/catalog/services/cloudant-nosql-db/) [Bluemix](https://console.ng.bluemix.net/) services to process and persist data from the [Meetup rsvp stream](http://www.meetup.com/meetup_api/docs/stream/2/rsvps/). On the backend the IBM Apache Spark service will be using the the [Spark Kernel](https://github.com/ibm-et/spark-kernel). The Spark Kernel provides an interface that allows clients to interact with a Spark Cluster. Clients can send libraries and snippets of code that are interpreted and run against a Spark context. This notebook should be run on IBM Bluemix using the IBM Apache Spark and IBM Cloudant services available in the [IBM Bluemix Catalog](https://console.ng.bluemix.net/catalog/)." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## Prerequisites\n", 22 | "* IBM Bluemix account - [Sign up for IBM Bluemix](https://console.ng.bluemix.net/registration/)\n", 23 | "* IBM Spark service instance - [Create an IBM Apache Spark Service](https://console.ng.bluemix.net/catalog/apache-spark/). Note once the IBM Spark service is created you can then create a Notebook on top of the IBM Spark service.\n", 24 | "* IBM Cloudant service instance - [Create a IBM Cloudant NoSQL DB Service](https://console.ng.bluemix.net/catalog/cloudant-nosql-db/). Note once the IBM Cloudant service is created you can then create a meetup_group database by launching the IBM Cloudant service, navigating to the databases tab within the dashboard, and selecting create database.\n", 25 | "* IBM Cloudant service credentials - Once in a notebook use the Data Source option on the right Palete to Add a data source. After the data source is configured and linked to the created notebook you can use the Insert to code link which will add metadata regarding the datasource to your notebook. You will want to keep track of the hostname, username, and password to be used for configuration in step (2) below." 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "\n", 33 | "## 1. Add Dependencies/Jars\n", 34 | "In order to run this demonstration notebook we are using the cloudant-spark and scalawebsocket libraries. These scala dependencies/jars are added to our environment using the AddJar magic from the Spark Kernel, which adds the specified jars to the Spark Kernel and Spark cluster.\n", 35 | "* scalawebsocket - Used for streaming data\n", 36 | "* async-http-client - Used for streaming data\n", 37 | "* cloudant-spark - Allows us to perform Spark SQL queries against RDDs backed by IBM Cloudant\n", 38 | "* scalalogging-log4j - Logging mechanism" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 1, 44 | "metadata": { 45 | "collapsed": false 46 | }, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "Using cached version of scalawebsocket_2.10-0.1.1.jar\n", 53 | "Using cached version of cloudant-spark.jar\n", 54 | "Using cached version of async-http-client-1.8.0.jar\n", 55 | "Using cached version of scalalogging-log4j_2.10-1.1.0.jar\n" 56 | ] 57 | } 58 | ], 59 | "source": [ 60 | "%AddJar http://central.maven.org/maven2/eu/piotrbuda/scalawebsocket_2.10/0.1.1/scalawebsocket_2.10-0.1.1.jar\n", 61 | "%AddJar https://dl.dropboxusercontent.com/u/19043899/cloudant-spark.jar\n", 62 | "%AddJar http://central.maven.org/maven2/com/ning/async-http-client/1.8.0/async-http-client-1.8.0.jar\n", 63 | "%AddJar http://central.maven.org/maven2/com/typesafe/scalalogging-log4j_2.10/1.1.0/scalalogging-log4j_2.10-1.1.0.jar" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "## 2. Initialize Spark context with IBM Cloudant configurations\n", 71 | "The IBM Bluemix Apache Spark service notebook comes with a Spark context ready to use, but we are going to have to modify this one to configure built in support for IBM Cloudant. Note for the demo purposes we are setting the Spark master to run in local mode, but by default the IBM Spark service will run in cluster mode. Update the HOST, USERNAME, and PASSWORD below with the credentials to connect to your IBM Bluemix Cloudant service which our demo depends on. You can get these credentials from the Palette on the right by clicking on the Data Source option. If your data source does not exist add it using the Add Source button or if it already does you can use the \"Insert to code\" button to add the information to the notebook." 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 2, 77 | "metadata": { 78 | "collapsed": false 79 | }, 80 | "outputs": [], 81 | "source": [ 82 | "import org.apache.spark.rdd.RDD\n", 83 | "import org.apache.spark.sql.{DataFrame, SQLContext}\n", 84 | "import org.apache.spark.storage.StorageLevel\n", 85 | "import org.apache.spark.streaming.{Time, Seconds, StreamingContext}\n", 86 | "import org.apache.spark.{SparkConf, SparkContext}\n", 87 | "\n", 88 | "val conf = sc.getConf\n", 89 | "conf.set(\"cloudant.host\",\"HOST\")\n", 90 | "conf.set(\"cloudant.username\", \"USERNAME\")\n", 91 | "conf.set(\"cloudant.password\",\"PASSWORD\")\n", 92 | "\n", 93 | "conf.setJars(ClassLoader.getSystemClassLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSet.toSeq ++ kernel.interpreter.classLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSeq)\n", 94 | "conf.set(\"spark.driver.allowMultipleContexts\", \"true\")\n", 95 | "conf.set(\"spark.master\",\"local[*]\")\n", 96 | "val scCloudant = new SparkContext(conf)\n", 97 | "scCloudant.getConf.getAll.foreach(println)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "## 3. Write to the IBM Cloudant Bluemix service\n", 105 | "Using the cloudant-spark library we are able to seemlessly interact with our IBM Cloudant Bluemix Service meetup_group database through the abstraction of Spark Dataframes." 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 3, 111 | "metadata": { 112 | "collapsed": true 113 | }, 114 | "outputs": [], 115 | "source": [ 116 | "def writeToDatabse(databaseName: String, df: DataFrame) = {\n", 117 | " df.write.format(\"com.cloudant.spark\").save(databaseName)\n", 118 | "}" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "## 4. Create the WebSocketReceiver for our Streaming Context\n", 126 | "First we must create a custom WebSocketReceiver that extends [Receiver](http://spark.apache.org/docs/latest/streaming-custom-receivers.html \"Spark Streaming Custom Receivers\") and implements the required onStart and onStop functions." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 4, 132 | "metadata": { 133 | "collapsed": true 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "import org.apache.spark.storage.StorageLevel\n", 138 | "import scalawebsocket.WebSocket\n", 139 | "import org.apache.spark.streaming.receiver.Receiver\n", 140 | "import org.apache.spark.Logging\n", 141 | "\n", 142 | "import org.json4s._\n", 143 | "import org.json4s.DefaultFormats\n", 144 | "import org.json4s.jackson.JsonMethods._\n", 145 | "\n", 146 | "case class MeetupEvent(event_id: String,event_name: Option[String],event_url: Option[String],time: Long)\n", 147 | "case class MeetupGroupTopics(topic_name: Option[String],urlkey: Option[String])\n", 148 | "case class MeetupGroup( group_id: Long, group_name: String,group_city: Option[String],group_country: Option[String], group_state: Option[String], group_urlname: Option[String], group_lat: Option[String],group_lon: Option[String],group_topics: List[MeetupGroupTopics]) \n", 149 | "case class MeetupMember( member_id: Long, member_name: Option[String],other_services: Option[String],photo: Option[String])\n", 150 | "case class MeetupVenue(venue_id: Long, venue_name: Option[String],lat: Option[String], lon: Option[String])\n", 151 | "case class MeetupRsvp(rsvp_id: Long,response: String,guests: Int, mtime: Long, visibility : String, event: MeetupEvent, group: MeetupGroup, member: MeetupMember, venue: MeetupVenue)\n", 152 | "\n", 153 | "\n", 154 | "class WebSocketReceiver(url: String, storageLevel: StorageLevel) extends Receiver[MeetupRsvp](storageLevel) with Logging {\n", 155 | " private var webSocket: WebSocket = _\n", 156 | " \n", 157 | " def onStart() {\n", 158 | " try{\n", 159 | " logInfo(\"Connecting to WebSocket: \" + url)\n", 160 | " val newWebSocket = WebSocket().open(url).onTextMessage({ msg: String => parseJson(msg) })\n", 161 | " setWebSocket(newWebSocket)\n", 162 | " logInfo(\"Connected to: WebSocket\" + url)\n", 163 | " } catch {\n", 164 | " case e: Exception => restart(\"Error starting WebSocket stream\", e)\n", 165 | " }\n", 166 | " }\n", 167 | "\n", 168 | " def onStop() {\n", 169 | " setWebSocket(null)\n", 170 | " logInfo(\"WebSocket receiver stopped\")\n", 171 | " }\n", 172 | "\n", 173 | " private def setWebSocket(newWebSocket: WebSocket) = synchronized {\n", 174 | " if (webSocket != null) {\n", 175 | " webSocket.shutdown()\n", 176 | " }\n", 177 | " webSocket = newWebSocket\n", 178 | " }\n", 179 | "\n", 180 | " private def parseJson(jsonStr: String): Unit =\n", 181 | " {\n", 182 | " try {\n", 183 | " implicit lazy val formats = DefaultFormats\n", 184 | " val json = parse(jsonStr)\n", 185 | " val rsvp = json.extract[MeetupRsvp]\n", 186 | " store(rsvp)\n", 187 | " } catch {\n", 188 | " case e: Exception => e.getMessage()\n", 189 | " }\n", 190 | " }\n", 191 | "}" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "## 5. Persist the Meetup stream to IBM Cloudant\n", 199 | "This function creates a streaming context that operates on a 10 second window. Note due to our implementation in the custom WebSocketReceiver we are able to transform the text content of the websocket to JSON and extract the MeetupRsvp from it. Then for each MeetupRsvp RDD in our stream we are able to filter the stream where the group_state equals Texas. Lastly we convert the MeetupRsvp to a dataframe and utilize our writeToDatabase function to persist the instance to IBM Cloudant. As this is a quick demo we set the timeout of the streaming context to be one minute so we can move onto our analysis of the data in IBM Cloudant." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 5, 205 | "metadata": { 206 | "collapsed": false 207 | }, 208 | "outputs": [ 209 | { 210 | "name": "stdout", 211 | "output_type": "stream", 212 | "text": [ 213 | "Use dbName=meetup_group, indexName=null, jsonstore.rdd.partitions=5, jsonstore.rdd.maxInPartition=-1, jsonstore.rdd.minInPartition=10, jsonstore.rdd.requestTimeout=100000,jsonstore.rdd.concurrentSave=-1,jsonstore.rdd.bulkSize=1\n", 214 | "+--------+--------------------+----------+-------------+-----------+--------------------+---------+---------+--------------------+\n", 215 | "|group_id| group_name|group_city|group_country|group_state| group_urlname|group_lat|group_lon| group_topics|\n", 216 | "+--------+--------------------+----------+-------------+-----------+--------------------+---------+---------+--------------------+\n", 217 | "| 3448252|Network After Wor...| Austin| us| TX|NetworkAfterWork-...| 30.27| -97.74|List([Professiona...|\n", 218 | "+--------+--------------------+----------+-------------+-----------+--------------------+---------+---------+--------------------+\n", 219 | "\n" 220 | ] 221 | }, 222 | { 223 | "data": { 224 | "text/plain": [ 225 | "false" 226 | ] 227 | }, 228 | "execution_count": 5, 229 | "metadata": {}, 230 | "output_type": "execute_result" 231 | }, 232 | { 233 | "name": "stdout", 234 | "output_type": "stream", 235 | "text": [ 236 | "Use dbName=meetup_group, indexName=null, jsonstore.rdd.partitions=5, jsonstore.rdd.maxInPartition=-1, jsonstore.rdd.minInPartition=10, jsonstore.rdd.requestTimeout=100000,jsonstore.rdd.concurrentSave=-1,jsonstore.rdd.bulkSize=1\n", 237 | "+--------+--------------------+----------+-------------+-----------+----------------+---------+---------+--------------------+\n", 238 | "|group_id| group_name|group_city|group_country|group_state| group_urlname|group_lat|group_lon| group_topics|\n", 239 | "+--------+--------------------+----------+-------------+-----------+----------------+---------+---------+--------------------+\n", 240 | "| 2672242|OpenStack Austin ...| Austin| us| TX|OpenStack-Austin| 30.4| -97.75|List([Virtualizat...|\n", 241 | "+--------+--------------------+----------+-------------+-----------+----------------+---------+---------+--------------------+\n", 242 | "\n", 243 | "Use dbName=meetup_group, indexName=null, jsonstore.rdd.partitions=5, jsonstore.rdd.maxInPartition=-1, jsonstore.rdd.minInPartition=10, jsonstore.rdd.requestTimeout=100000,jsonstore.rdd.concurrentSave=-1,jsonstore.rdd.bulkSize=1\n", 244 | "+--------+--------------------+----------+-------------+-----------+--------------------+---------+---------+--------------------+\n", 245 | "|group_id| group_name|group_city|group_country|group_state| group_urlname|group_lat|group_lon| group_topics|\n", 246 | "+--------+--------------------+----------+-------------+-----------+--------------------+---------+---------+--------------------+\n", 247 | "|11106032|Holistic & Energy...|Richardson| us| TX|Holistic-Energy-H...| 32.96| -96.75|List([Alternative...|\n", 248 | "+--------+--------------------+----------+-------------+-----------+--------------------+---------+---------+--------------------+\n", 249 | "\n", 250 | "Use dbName=meetup_group, indexName=null, jsonstore.rdd.partitions=5, jsonstore.rdd.maxInPartition=-1, jsonstore.rdd.minInPartition=10, jsonstore.rdd.requestTimeout=100000,jsonstore.rdd.concurrentSave=-1,jsonstore.rdd.bulkSize=1\n", 251 | "+--------+--------------------+----------+-------------+-----------+-------------+---------+---------+--------------------+\n", 252 | "|group_id| group_name|group_city|group_country|group_state|group_urlname|group_lat|group_lon| group_topics|\n", 253 | "+--------+--------------------+----------+-------------+-----------+-------------+---------+---------+--------------------+\n", 254 | "| 3282852|Dallas Agile Lead...| Irving| us| TX| Dallas-ALN| 32.85| -96.96|List([Self-Improv...|\n", 255 | "+--------+--------------------+----------+-------------+-----------+-------------+---------+---------+--------------------+\n", 256 | "\n" 257 | ] 258 | } 259 | ], 260 | "source": [ 261 | "def persistStream(conf: SparkConf) = {\n", 262 | " val ssc = new StreamingContext(conf, Seconds(10))\n", 263 | " val stream = ssc.receiverStream[MeetupRsvp](new WebSocketReceiver(\"ws://stream.meetup.com/2/rsvps\", StorageLevel.MEMORY_ONLY_SER))\n", 264 | " stream.foreachRDD((rdd: RDD[MeetupRsvp], time: Time) => {\n", 265 | " val sqlContext = new SQLContext(rdd.sparkContext) \n", 266 | " import sqlContext.implicits._\n", 267 | " val df = rdd.map(data => {\n", 268 | " data.group\n", 269 | " }).filter(_.group_state.getOrElse(\"\").equals(\"TX\")).toDF()\n", 270 | " if(df.collect().length > 0) {\n", 271 | " writeToDatabse(\"meetup_group\", df)\n", 272 | " df.show()\n", 273 | " }\n", 274 | " })\n", 275 | " ssc.start()\n", 276 | " ssc.awaitTerminationOrTimeout(60000)\n", 277 | "}\n", 278 | "\n", 279 | "persistStream(conf)" 280 | ] 281 | } 282 | ], 283 | "metadata": { 284 | "kernelspec": { 285 | "display_name": "Scala 2.10", 286 | "language": "scala", 287 | "name": "spark" 288 | }, 289 | "language_info": { 290 | "name": "scala" 291 | } 292 | }, 293 | "nbformat": 4, 294 | "nbformat_minor": 0 295 | } -------------------------------------------------------------------------------- /bluemix-spark-cloudant/2-Reading-Meetups-from-IBM-Cloudant-using-Spark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# (2) Reading Meetups from IBM Cloudant using Spark" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This notebook will be the second in our series, we are going to utilize the [Apache Spark](http://spark.apache.org/) and [Cloudant](https://cloudant.com/) [Bluemix](https://console.ng.bluemix.net/) services to read data into Spark Dataframes from our IBM Cloudant Bluemix service." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "Please reference the first notebook in our series, [Streaming Meetups to IBM Cloudant using Spark](https://github.com/ibm-et/jupyter-samples/blob/master/bluemix-spark-cloudant/1-Streaming-Meetups-to-IBM-Cloudant-using-Spark.ipynb), for a detailed list of prerequisites to get up and running." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "\n", 29 | "## 1. Add Dependencies/Jars\n", 30 | "In order to run this demonstration notebook we are using the cloudant-spark library. These scala dependencies/jars are added to our environment using the AddJar magic from the Spark Kernel, which adds the specified jars to the Spark Kernel and Spark cluster.\n", 31 | "* cloudant-spark - Allows us to perform Spark SQL queries against RDDs backed by Cloudant" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 1, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [ 41 | { 42 | "name": "stdout", 43 | "output_type": "stream", 44 | "text": [ 45 | "Using cached version of cloudant-spark.jar\n" 46 | ] 47 | } 48 | ], 49 | "source": [ 50 | "%AddJar https://dl.dropboxusercontent.com/u/19043899/cloudant-spark.jar" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "## 2. Initialize spark context with cloudant configurations\n", 58 | "The Bluemix Apache Spark service notebook comes with a spark context ready to use, but we are going to have to modify this one to configure built in support for cloudant. Note for the demo purposes we are setting the spark master to run in local mode, but by default the Spark service will run in cluster mode. Update the HOST, USERNAME, and PASSWORD below with the credentials to connect to your Bluemix Cloudant service which our demo depends on. You can get these credentials from the Palette on the right by clicking on the Data Source option. If your data source does not exist add it using the Add Source button or if it already does you can use the \"Insert to code\" button to add the information to the notebook." 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 2, 64 | "metadata": { 65 | "collapsed": false 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "import org.apache.spark.rdd.RDD\n", 70 | "import org.apache.spark.sql.{DataFrame, SQLContext}\n", 71 | "import org.apache.spark.storage.StorageLevel\n", 72 | "import org.apache.spark.streaming.{Time, Seconds, StreamingContext}\n", 73 | "import org.apache.spark.{SparkConf, SparkContext}\n", 74 | "\n", 75 | "val conf = sc.getConf\n", 76 | "conf.set(\"cloudant.host\",\"HOST\")\n", 77 | "conf.set(\"cloudant.username\", \"USERNAME\")\n", 78 | "conf.set(\"cloudant.password\",\"PASSWORD\")\n", 79 | "\n", 80 | "conf.setJars(ClassLoader.getSystemClassLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSet.toSeq ++ kernel.interpreter.classLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSeq)\n", 81 | "conf.set(\"spark.driver.allowMultipleContexts\", \"true\")\n", 82 | "conf.set(\"spark.master\",\"local[*]\")\n", 83 | "val scCloudant = new SparkContext(conf)\n", 84 | "scCloudant.getConf.getAll.foreach(println)" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "## 3. Read from the IBM Cloudant Bluemix service\n", 92 | "Using the cloudant-spark library we are able to seemlessly interact with our IBM Cloudant Bluemix Service meetup_group database through the abstraction of Spark Dataframes." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 3, 98 | "metadata": { 99 | "collapsed": true 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "def readFromDatabase(sqlContext: SQLContext, databaseName: String) = {\n", 104 | " val df = sqlContext.read.format(\"com.cloudant.spark\").load(databaseName)\n", 105 | " df\n", 106 | "}" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "## 4. Read from IBM Cloudant and filter on our dataframe\n", 114 | "First we must create an SQL context from our Spark context we created in step 2. We can then simply use our readFromDatabase function previously defined to perform Spark Dataframe operations such as filtering on fields." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 4, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [ 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | "Use dbName=meetup_group, indexName=null, jsonstore.rdd.partitions=5, jsonstore.rdd.maxInPartition=-1, jsonstore.rdd.minInPartition=10, jsonstore.rdd.requestTimeout=100000,jsonstore.rdd.concurrentSave=-1,jsonstore.rdd.bulkSize=1\n", 129 | "+--------------------+--------------------+----------+-------------+--------+---------+---------+--------------------+-----------+--------------------+--------------------+\n", 130 | "| _id| _rev|group_city|group_country|group_id|group_lat|group_lon| group_name|group_state| group_topics| group_urlname|\n", 131 | "+--------------------+--------------------+----------+-------------+--------+---------+---------+--------------------+-----------+--------------------+--------------------+\n", 132 | "|1d335ff5eb23f7f00...|1-acea197c65b80be...|Richardson| us|15196832| 32.96| -96.75|Richardson/Plano ...| TX|List([Referral Ma...|RichardsonPlanoNe...|\n", 133 | "|1d335ff5eb23f7f00...|1-72d2c8a62057bcd...| Austin| us|18179233| 30.24| -97.76|Doodle Crew USA :...| TX|List([Cartoonists...| Doodle-Dudes-TX|\n", 134 | "|2c9d4bb8ad5eaa34f...|1-bbc4604e3e53946...| Spring| us|18652297| 30.14| -95.47|Spring/Woodlands ...| TX|List([Dining Out,...|Spring-Woodlands-...|\n", 135 | "|2c9d4bb8ad5eaa34f...|1-ec2d7e34afb7173...| Austin| us| 5256562| 30.24| -97.76|Austin Associatio...| TX|List([Investing,i...|Austin-Associatio...|\n", 136 | "|2c9d4bb8ad5eaa34f...|1-acea197c65b80be...|Richardson| us|15196832| 32.96| -96.75|Richardson/Plano ...| TX|List([Referral Ma...|RichardsonPlanoNe...|\n", 137 | "+--------------------+--------------------+----------+-------------+--------+---------+---------+--------------------+-----------+--------------------+--------------------+\n", 138 | "\n", 139 | "+--------------------+--------------------+----------+-------------+--------+---------+---------+--------------------+-----------+--------------------+--------------------+\n", 140 | "| _id| _rev|group_city|group_country|group_id|group_lat|group_lon| group_name|group_state| group_topics| group_urlname|\n", 141 | "+--------------------+--------------------+----------+-------------+--------+---------+---------+--------------------+-----------+--------------------+--------------------+\n", 142 | "|1d335ff5eb23f7f00...|1-72d2c8a62057bcd...| Austin| us|18179233| 30.24| -97.76|Doodle Crew USA :...| TX|List([Cartoonists...| Doodle-Dudes-TX|\n", 143 | "|2c9d4bb8ad5eaa34f...|1-ec2d7e34afb7173...| Austin| us| 5256562| 30.24| -97.76|Austin Associatio...| TX|List([Investing,i...|Austin-Associatio...|\n", 144 | "|2c9d4bb8ad5eaa34f...|1-b07f0f605694ffc...| Austin| us| 7894862| 30.26| -97.87| Dance Walk! Austin| TX|List([New In Town...| Dance-Walk-Austin|\n", 145 | "|39cc2df929299bb06...|1-f1b6882f0b1912a...| Austin| us| 3648022| 30.27| -97.74|Austin Geeks and ...| TX|List([Doctor Who,...|AustinGeeksandGamers|\n", 146 | "|39cc2df929299bb06...|1-ee5bc44c0904ece...| Austin| us| 1758659| 30.27| -97.74|AWESOME AUSTIN! I...| TX|List([Dining Out,...| AustinTexas|\n", 147 | "+--------------------+--------------------+----------+-------------+--------+---------+---------+--------------------+-----------+--------------------+--------------------+\n", 148 | "\n" 149 | ] 150 | } 151 | ], 152 | "source": [ 153 | "val sqlContext = new SQLContext(scCloudant)\n", 154 | "import sqlContext.implicits._\n", 155 | "val df = readFromDatabase(sqlContext, \"meetup_group\")\n", 156 | "df.show(5)\n", 157 | "df.filter(df(\"group_city\")===\"Austin\").show(5)" 158 | ] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "Scala 2.10", 164 | "language": "scala", 165 | "name": "spark" 166 | }, 167 | "language_info": { 168 | "name": "scala" 169 | } 170 | }, 171 | "nbformat": 4, 172 | "nbformat_minor": 0 173 | } -------------------------------------------------------------------------------- /bluemix-spark-cloudant/README.md: -------------------------------------------------------------------------------- 1 | ## Notebooks 2 | 3 | These notebooks are written assuming that you are running them on [IBM Bluemix](https://console.ng.bluemix.net/), using the [IBM Apache Spark](https://console.ng.bluemix.net/catalog/services/apache-spark/) and [IBM Cloudant](https://console.ng.bluemix.net/catalog/services/cloudant-nosql-db/) Bluemix services. 4 | 5 | The notebooks utilize the [IBM Apache Spark](https://console.ng.bluemix.net/catalog/services/apache-spark/) and [IBM Cloudant](https://console.ng.bluemix.net/catalog/services/cloudant-nosql-db/) Bluemix services to process and persist data from the [Meetup rsvp stream](http://www.meetup.com/meetup_api/docs/stream/2/rsvps/). 6 | 7 | * [(1) Streaming Meetups to IBM Cloudant using Spark](https://github.com/ibm-et/jupyter-samples/blob/master/bluemix-spark-cloudant/1-Streaming-Meetups-to-IBM-Cloudant-using-Spark.ipynb) 8 | * [(2) Reading Meetups from IBM Cloudant using Spark](https://github.com/ibm-et/jupyter-samples/blob/master/bluemix-spark-cloudant/2-Reading-Meetups-from-IBM-Cloudant-using-Spark.ipynb) 9 | 10 | 11 | ## Prerequisites 12 | * IBM Bluemix account - [Sign up for IBM Bluemix](https://console.ng.bluemix.net/registration/) 13 | * IBM Spark service instance - [Create an IBM Apache Spark Service](https://console.ng.bluemix.net/catalog/apache-spark/). Note once the IBM Spark service is created you can then use the create notebook option and upload the IBM Cloudant Spark notebooks which are ready to run. 14 | * IBM Cloudant service instance - [Create a IBM Cloudant NoSQL DB Service](https://console.ng.bluemix.net/catalog/cloudant-nosql-db/). Note once the IBM Cloudant service is created you can then create a meetup_group database by launching the IBM Cloudant service, navigating to the databases tab within the dashboard, and selecting create database. 15 | * IBM Cloudant service credentials - Once in a notebook use the Data Source option on the right Palete to Add a data source. After the data source is configured and linked to the created notebook you can use the Insert to code link which will add metadata regarding the datasource to your notebook. You will want to keep track of the hostname, username, and password to be used for configuration. 16 | 17 | ## License 18 | 19 | Notebooks are copyright (c) 2015 IBM Corporation under the MIT license. See LICENSE for details. 20 | 21 | Sample data files, libraries, techniques, external publications, etc. are cited in the notebooks in which they are used. Those works remain under the copyright of their respective owners. 22 | -------------------------------------------------------------------------------- /elasticity/Elasticity Experiment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:fbfe9280e01693debbb46dc82549e79d2c406a48f71181d301a4147f3c8ed598" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Hooke's Law of Elasticity\n", 16 | "This sample notebook is based on the experiment outlined in Chapter 15 of the book [Computation and Programming in Python](http://mitpress.mit.edu/books/introduction-computation-and-programming-using-python-0).\n", 17 | "\n", 18 | "## Overview\n", 19 | "Hooke's Law suggests that the force associated with a spring when it is released is linearly related to the distance it has been compressed. \n", 20 | "\n", 21 | "An experiment was conducted to capture data for a number of springs across various compression lengths. The indent of the experiment is demonstrate Hooke's Law by showing that the results of the experiment lie on a straight line. Yet experiment data tends not to be perfect so we expect the results to lie around a curved line not necessary on it. \n", 22 | "\n", 23 | "Could we use the results to fit a model that will allow us to depict the linear premise posited by Hooke? Can we use a linear regression to solve the problem?" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## Get Data\n", 31 | "\n", 32 | "1. [Download](http://mitpress.mit.edu/sites/all/modules/pubdlcnt/pubdlcnt.php?file=/sites/default/files/titles/content/9780262525008_Code.zip&nid=205426) the code that accompanies the aforementioned book. \n", 33 | "2. Unzip the downloaded source code\n", 34 | "3. Drag and Drop the file /codeForWebSite/Chapter 15/springData.txt into the KnowledgeAnyhow Workbench.\n", 35 | "4. Tag the resulting data item with \"samples, linear regression\"" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "## Extract, Transform and Load\n", 43 | "Load the test results from a text file, \"springData.txt\", and establish an array containing distances and masses." 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "collapsed": false, 49 | "input": [ 50 | "def getData(fileName):\n", 51 | " dataFile = open(fileName, 'r')\n", 52 | " distances = []\n", 53 | " masses = []\n", 54 | " discardHeader = dataFile.readline()\n", 55 | " for line in dataFile:\n", 56 | " d, m = line.split(' ')\n", 57 | " distances.append(float(d))\n", 58 | " masses.append(float(m))\n", 59 | " dataFile.close()\n", 60 | " return (masses, distances)\n", 61 | "\n", 62 | "getData('/resources/springData.txt')" 63 | ], 64 | "language": "python", 65 | "metadata": {}, 66 | "outputs": [] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "## Data Exploration\n", 73 | "As mentioned, experimental data typically does not result in a straight line. We should plot our sample data to establish a baseline of test results." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "collapsed": false, 79 | "input": [ 80 | "%matplotlib inline" 81 | ], 82 | "language": "python", 83 | "metadata": {}, 84 | "outputs": [] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "collapsed": false, 89 | "input": [ 90 | "import pylab, random\n", 91 | "\n", 92 | "def plotData(inputFile):\n", 93 | " masses, distances = getData(inputFile)\n", 94 | " masses = pylab.array(masses)\n", 95 | " distances = pylab.array(distances)\n", 96 | " forces = masses*9.81\n", 97 | " pylab.plot(forces, distances, 'bo',\n", 98 | " label = 'Measured displacements')\n", 99 | " pylab.title('Measured Displacement of Spring')\n", 100 | " pylab.xlabel('|Force| (Newtons)')\n", 101 | " pylab.ylabel('Distance (meters)')\n", 102 | "\n", 103 | "plotData('/resources/springData.txt')" 104 | ], 105 | "language": "python", 106 | "metadata": {}, 107 | "outputs": [] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "### Observations\n", 114 | "As expected the results of our experimental data do not depict a perfect slope." 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "## Predictive Analysis\n", 122 | "How do we determine the best fit line (or curve) that most accurately represents our data while assuming no measurement error? \n", 123 | "\n", 124 | "### Fit Data\n", 125 | "A common approach to this problem is to use a **least squares** function to perdict the optimal fit for our data." 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "collapsed": false, 131 | "input": [ 132 | "def fitData(inputFile):\n", 133 | " masses, distances = getData(inputFile)\n", 134 | " distances = pylab.array(distances)\n", 135 | " masses = pylab.array(masses)\n", 136 | " forces = masses*9.81\n", 137 | " pylab.plot(forces, distances, 'bo',\n", 138 | " label = 'Measured displacements')\n", 139 | " pylab.title('Measured Displacement of Spring')\n", 140 | " pylab.xlabel('|Force| (Newtons)')\n", 141 | " pylab.ylabel('Distance (meters)')\n", 142 | " #find linear fit\n", 143 | " a,b = pylab.polyfit(forces, distances, 1)\n", 144 | " predictedDistances = a*pylab.array(forces) + b\n", 145 | " k = 1.0/a\n", 146 | " pylab.plot(forces, predictedDistances,\n", 147 | " label = 'Displacements predicted by\\nlinear fit, k = '\n", 148 | " + str(round(k, 5)))\n", 149 | " pylab.legend(loc = 'best')\n", 150 | "\n", 151 | "fitData('/resources/springData.txt')" 152 | ], 153 | "language": "python", 154 | "metadata": {}, 155 | "outputs": [] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "### Fit Data Observations\n", 162 | "\n", 163 | "* Only one test result actually lies on the predicted slope.\n", 164 | "* This is possible because polyfit() does not attempt to maximize the number points on the line.\n", 165 | "\n", 166 | "### Compare Fit Data\n", 167 | "\n", 168 | "Could we improve our predictive slope by using a cubic fit function?\n" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "collapsed": false, 174 | "input": [ 175 | "def compareFitData(inputFile):\n", 176 | " masses, distances = getData(inputFile)\n", 177 | " distances = pylab.array(distances)\n", 178 | " masses = pylab.array(masses)\n", 179 | " forces = masses*9.81\n", 180 | " pylab.plot(forces, distances, 'bo',\n", 181 | " label = 'Measured displacements')\n", 182 | " pylab.title('Measured Displacement of Spring')\n", 183 | " pylab.xlabel('|Force| (Newtons)')\n", 184 | " pylab.ylabel('Distance (meters)')\n", 185 | " #find linear fit\n", 186 | " a,b = pylab.polyfit(forces, distances, 1)\n", 187 | " predictedDistances = a*pylab.array(forces) + b\n", 188 | " k = 1.0/a\n", 189 | " pylab.plot(forces, predictedDistances,\n", 190 | " label = 'Displacements predicted by\\nlinear fit, k = '\n", 191 | " + str(round(k, 5)))\n", 192 | " #find cubic fit\n", 193 | " a,b,c,d = pylab.polyfit(forces, distances, 3)\n", 194 | " predictedDistances = a*(forces**3) + b*forces**2 + c*forces + d\n", 195 | " pylab.plot(forces, predictedDistances, 'b:', label = 'cubic fit')\n", 196 | " pylab.legend(loc = 'best')\n", 197 | "\n", 198 | "compareFitData('/resources/springData.txt')\n", 199 | "\n" 200 | ], 201 | "language": "python", 202 | "metadata": {}, 203 | "outputs": [] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "### Compare Fit Data Observations\n", 210 | "\n", 211 | "Does the fitted cubic curve more accurately represent Hooke's Law which suggested a linear regression? Probably not!\n", 212 | "\n", 213 | "### Reduce to Improve\n", 214 | "\n", 215 | "Can we improve the model by eliminating results, such as the last 6 points?\n" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "collapsed": false, 221 | "input": [ 222 | "def reduceToImprove(inputFile):\n", 223 | " masses, distances = getData(inputFile)\n", 224 | " distances = pylab.array(distances[:-6])\n", 225 | " masses = pylab.array(masses[:-6])\n", 226 | " forces = masses*9.81\n", 227 | " pylab.plot(forces, distances, 'bo',\n", 228 | " label = 'Measured displacements')\n", 229 | " pylab.title('Measured Displacement of Spring')\n", 230 | " pylab.xlabel('|Force| (Newtons)')\n", 231 | " pylab.ylabel('Distance (meters)')\n", 232 | " #find linear fit\n", 233 | " a,b = pylab.polyfit(forces, distances, 1)\n", 234 | " predictedDistances = a*pylab.array(forces) + b\n", 235 | " k = 1.0/a\n", 236 | " pylab.plot(forces, predictedDistances,\n", 237 | " label = 'Displacements predicted by\\nlinear fit, k = '\n", 238 | " + str(round(k, 5)))\n", 239 | " #find cubic fit\n", 240 | " a,b,c,d = pylab.polyfit(forces, distances, 3)\n", 241 | " predictedDistances = a*(forces**3) + b*forces**2 + c*forces + d\n", 242 | " pylab.plot(forces, predictedDistances, 'b:', label = 'cubic fit')\n", 243 | " pylab.legend(loc = 'best')\n", 244 | "\n", 245 | "reduceToImprove('/resources/springData.txt')" 246 | ], 247 | "language": "python", 248 | "metadata": {}, 249 | "outputs": [] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "### Reduce to Improve Observations\n", 256 | "\n", 257 | "Yes the rendering is improved but eliminating data is not a justified action." 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "## Conclusion\n", 265 | "Based on the predictive modeling explored herein, we could conclude that we lack a sufficient sample set of experimental data to validate Hooke's Law.\n", 266 | "\n", 267 | "The reader is encouraged to expand on this data exploration lesson by implementing additional modeling techniques discussed in Chapter 15 of the book [Computation and Programming in Python](http://mitpress.mit.edu/books/introduction-computation-and-programming-using-python-0)." 268 | ] 269 | } 270 | ], 271 | "metadata": {} 272 | } 273 | ] 274 | } -------------------------------------------------------------------------------- /elasticity/LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | 203 | -------------------------------------------------------------------------------- /elasticity/README.md: -------------------------------------------------------------------------------- 1 | # Hooke's Law of Elasticity 2 | 3 | ## Overview 4 | Hooke's Law suggests that the force associated with a spring when it is released is linearly related to the distance it has been compressed. 5 | 6 | An experiment was conducted to capture data for a number of springs across various compression lengths. The indent of the experiment is demonstrate Hooke's Law by showing that the results of the experiment lie on a straight line. Yet experiment data tends not to be perfect so we expect the results to lie around a curved line not necessary on it. 7 | 8 | Could we use the results to fit a model that will allow us to depict the linear premise posited by Hooke? Can we use a linear regression to solve the problem? 9 | 10 | ## Analysis 11 | This sample notebook is based on the experiment outlined in Chapter 15 of the book [Computation and Programming in Python](http://mitpress.mit.edu/books/introduction-computation-and-programming-using-python-0). 12 | 13 | ## Acknowledgements 14 | This notebook is compatible with the IBM Knowledge Anyhow Workbench (KAWB), which is a derivative of the [IPython](http://ipython.org/) interactive environment. 15 | -------------------------------------------------------------------------------- /elasticity/springData.txt: -------------------------------------------------------------------------------- 1 | Distance (m) Mass (kg) 2 | 0.0865 0.1 3 | 0.1015 0.15 4 | 0.1106 0.2 5 | 0.1279 0.25 6 | 0.1892 0.3 7 | 0.2695 0.35 8 | 0.2888 0.4 9 | 0.2425 0.45 10 | 0.3465 0.5 11 | 0.3225 0.55 12 | 0.3764 0.6 13 | 0.4263 0.65 14 | 0.4562 0.7 15 | 0.4502 0.75 16 | 0.4499 0.8 17 | 0.4534 0.85 18 | 0.4416 0.9 19 | 0.4304 0.95 20 | 0.437 1.0 21 | -------------------------------------------------------------------------------- /hacks/IPython Parallel and R.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat_minor": 0, "cells": [{"source": "# IPy Parallel and R\n\nIn this notebook, we'll use IPython.parallel (IPP) and rpy2 as a quick-and-dirty way of parallelizing work in R. We'll use a cluster of IPP engines running on the same VM as the notebook server to demonstarate. We'll also need to install [rpy2](http://rpy.sourceforge.net/) before we can start.\n\n`!pip install rpy2`", "cell_type": "markdown", "metadata": {}}, {"source": "## Start Local IPP Engines\n\nFirst we must start a cluster of IPP engines. We can do this using the *Cluster* tab of the Jupyter dashboard. Or we can do it programmatically in the notebook.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 20, "cell_type": "code", "source": "from IPython.html.services.clusters.clustermanager import ClusterManager", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 21, "cell_type": "code", "source": "cm = ClusterManager()", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "We have to list the profiles before we can start anything, even if we know the profile name.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 60, "cell_type": "code", "source": "cm.list_profiles()", "outputs": [{"execution_count": 60, "output_type": "execute_result", "data": {"text/plain": "[{'profile': u'default',\n 'profile_dir': u'/home/notebook/.ipython/profile_default',\n 'status': 'stopped'},\n {'profile': u'remote',\n 'profile_dir': u'/home/notebook/.ipython/profile_remote',\n 'status': 'stopped'}]"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "For our demo purposes, we'll just use the default profile which starts a cluster on the local machine for us.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 61, "cell_type": "code", "source": "cm.start_cluster('default')", "outputs": [{"execution_count": 61, "output_type": "execute_result", "data": {"text/plain": "{'n': 8,\n 'profile': 'default',\n 'profile_dir': u'/home/notebook/.ipython/profile_default',\n 'status': 'running'}"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "After running the command above, we need to pause for a few moments to let all the workers come up. (Breathe and count 10 ... 9 ... 8 ...) \n\nNow we can continue to create a DirectView that can talk to all of the workers. (If you get an error, breathe, count so more, and try again in a few.)", "cell_type": "markdown", "metadata": {}}, {"execution_count": 27, "cell_type": "code", "source": "import IPython.parallel", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 62, "cell_type": "code", "source": "client = IPython.parallel.Client()", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 63, "cell_type": "code", "source": "dv = client[:]", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "In my case, I have 8 CPUs so I get 8 workers by default. Your number will likely differ.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 72, "cell_type": "code", "source": "len(dv)", "outputs": [{"execution_count": 72, "output_type": "execute_result", "data": {"text/plain": "8"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "To ensure the workers are functioning, we can ask each one to run the bash command `echo $$` to print a PID.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 64, "cell_type": "code", "source": "%%px\n!echo $$", "outputs": [{"output_type": "stream", "name": "stdout", "text": "[stdout:0] 12973\r\n[stdout:1] 12974\r\n[stdout:2] 12978\r\n[stdout:3] 12980\r\n[stdout:4] 12977\r\n[stdout:5] 12975\r\n[stdout:6] 12976\r\n[stdout:7] 12979\r\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "## Use R on the Engines\n\nNext, we'll tell each engine to load the `rpy2.ipython` extension. In our local cluster, this step is easy because all of the workers are running in the same environment as the notebook server. If the engines were remote, we'd have many more installs to do.", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "%%px\n%load_ext rpy2.ipython", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "Now we can tell every engine to run R code using the `%%R` (or `%R`) magic. Let's sample 50 random numbers from a normal distribution.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 77, "cell_type": "code", "source": "%%px\n%%R \nx <- rnorm(50)\nsummary(x)", "outputs": [{"output_type": "display_data", "data": {"text/plain": "[output:0]"}, "metadata": {}}, {"output_type": "display_data", "data": {"text/plain": " Min. 1st Qu. Median Mean 3rd Qu. Max. \n-2.03500 -0.77970 0.05572 -0.07040 0.49570 2.31900 \n"}, "metadata": {"engine": 0}}, {"output_type": "display_data", "data": {"text/plain": "[output:1]"}, "metadata": {}}, {"output_type": "display_data", "data": {"text/plain": " Min. 1st Qu. Median Mean 3rd Qu. Max. \n-2.00000 -0.74080 0.02746 -0.02496 0.53140 2.38700 \n"}, "metadata": {"engine": 1}}, {"output_type": "display_data", "data": {"text/plain": "[output:2]"}, "metadata": {}}, {"output_type": "display_data", "data": {"text/plain": " Min. 1st Qu. Median Mean 3rd Qu. Max. \n-1.96800 -0.80070 0.07342 -0.05425 0.61380 2.01700 \n"}, "metadata": {"engine": 2}}, {"output_type": "display_data", "data": {"text/plain": "[output:3]"}, "metadata": {}}, {"output_type": "display_data", "data": {"text/plain": " Min. 1st Qu. Median Mean 3rd Qu. Max. \n-2.81100 -0.44310 0.01515 -0.04424 0.49470 1.90200 \n"}, "metadata": {"engine": 3}}, {"output_type": "display_data", "data": {"text/plain": "[output:4]"}, "metadata": {}}, {"output_type": "display_data", "data": {"text/plain": " Min. 1st Qu. Median Mean 3rd Qu. Max. \n-1.45400 -0.36850 0.04397 0.04665 0.42520 1.43300 \n"}, "metadata": {"engine": 4}}, {"output_type": "display_data", "data": {"text/plain": "[output:5]"}, "metadata": {}}, {"output_type": "display_data", "data": {"text/plain": " Min. 1st Qu. Median Mean 3rd Qu. Max. \n-2.32200 -0.75890 0.01176 0.05932 0.87360 2.35100 \n"}, "metadata": {"engine": 5}}, {"output_type": "display_data", "data": {"text/plain": "[output:6]"}, "metadata": {}}, {"output_type": "display_data", "data": {"text/plain": " Min. 1st Qu. Median Mean 3rd Qu. Max. \n-1.8590 -0.2639 0.1336 0.1777 0.8915 3.2200 \n"}, "metadata": {"engine": 6}}, {"output_type": "display_data", "data": {"text/plain": "[output:7]"}, "metadata": {}}, {"output_type": "display_data", "data": {"text/plain": " Min. 1st Qu. Median Mean 3rd Qu. Max. \n-3.5150 -0.9433 -0.1412 -0.2161 0.5414 2.4960 \n"}, "metadata": {"engine": 7}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "## Pull it Back to Python\n\nWith our hack, we can't simply pull the R vectors back to the local notebook. (IPP can't pickle them.) But we can convert them to Python and pull the resulting objects back.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 78, "cell_type": "code", "source": "%%px\n%Rpull x\nx = list(x)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 79, "cell_type": "code", "source": "x = dv.gather('x', block=True)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "We should get 50 elements per engine.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 80, "cell_type": "code", "source": "assert len(x) == 50 * len(dv)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "## Clean Up the Engines\n\nWhen we're done, we can clean up any engines started using the code at the top of this notebook with the following call.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 81, "cell_type": "code", "source": "cm.stop_cluster('default')", "outputs": [{"execution_count": 81, "output_type": "execute_result", "data": {"text/plain": "{'profile': u'default',\n 'profile_dir': u'/home/notebook/.ipython/profile_default',\n 'status': 'stopped'}"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "
\n
\n
\n
This notebook was created using IBM Knowledge Anyhow Workbench. To learn more, visit us at https://knowledgeanyhow.org.
\n
\n
", "cell_type": "markdown", "metadata": {}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "Python 2", "name": "python2", "language": "python"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "version": "2.7.6", "name": "python", "file_extension": ".py", "pygments_lexer": "ipython2", "codemirror_mode": {"version": 2, "name": "ipython"}}}} -------------------------------------------------------------------------------- /hn/Hacker News Runner.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat_minor": 0, "cells": [{"source": "# Hacker News Daily Runner\n\nExecutes the `Hacker News and AlchemyAPI.ipynb` notebook to aggregate data over time. Meant to be left running indefinitely in the background so that the job scheduler can execute once a day.\n\nYou can use this pattern of running notebooks on a pre-defined schedules in your own notebooks that collect data or compute analysis over time.\n\n## Prerequisites\n\n```\n!pip install schedule\n!pip install runipy\n```", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "import schedule\nimport time\nimport subprocess", "outputs": [], "metadata": {"collapsed": false, "trusted": false}}, {"execution_count": null, "cell_type": "code", "source": "NOTEBOOK_FILENAME = 'Hacker News and AlchemyAPI.ipynb'", "outputs": [], "metadata": {"collapsed": false, "trusted": false}}, {"execution_count": null, "cell_type": "code", "source": "def run_notebook():\n subprocess.check_call(['runipy', '-o', NOTEBOOK_FILENAME])", "outputs": [], "metadata": {"collapsed": false, "trusted": false}}, {"execution_count": null, "cell_type": "code", "source": "def aggregate_hacker_news_data():\n try:\n run_notebook()\n except Exception:\n import traceback\n traceback.print_exc()", "outputs": [], "metadata": {"collapsed": false, "trusted": false}}, {"execution_count": null, "cell_type": "code", "source": "schedule.every().hour.do(aggregate_hacker_news_data)\n#schedule.every().minute.do(aggregate_hacker_news_data)", "outputs": [], "metadata": {"collapsed": false, "trusted": false}}, {"execution_count": null, "cell_type": "code", "source": "while True:\n schedule.run_pending()\n time.sleep(30)", "outputs": [], "metadata": {"collapsed": false, "trusted": false}}, {"execution_count": null, "cell_type": "code", "source": "", "outputs": [], "metadata": {"collapsed": true, "trusted": false}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "Python 2", "name": "python2", "language": "python"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "version": "2.7.6", "name": "python", "file_extension": ".py", "pygments_lexer": "ipython2", "codemirror_mode": {"version": 2, "name": "ipython"}}}} -------------------------------------------------------------------------------- /index.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Jupyter Samples\n", 8 | "\n", 9 | "This repository contains sample Jupyter notebooks ranging from tutorials on using popular open source repositories to sample analyses on public data sets to neat notebook hacks.\n", 10 | "\n", 11 | "* [Exploration of Airline On-Time Performance](airline/Exploration%20of%20Airline%20On-Time%20Performance.ipynb)\n", 12 | "* [Hacker News and AlchemyAPI](hn/)\n", 13 | "* [Hooke's Law of Elasticity](Elasticity)\n", 14 | "* NOAA Climatology Data\n", 15 | " * [NOAA Climatology Data Processing](noaa/etl/)\n", 16 | " * [High Temperature Record Frequency Notebook](noaa/tmaxfreq/)\n", 17 | " * [Temperature Record Frequency Dashboard](noaa/hdtadash/)\n", 18 | "* [Instaquery](hacks/instaquery.ipynb)\n", 19 | "* [Interactive Data Maps](tax-maps/Interactive%20Data%20Maps.ipynb)\n", 20 | "* [IPy Parallel and R](hacks/IPython%20Parallel%20and%20R.ipynb)\n", 21 | "* [MLB Modern Era Salaries](mlb/mlb-salaries.ipynb)\n", 22 | "* [Population Growth Estimates](united-nations/senegal_population_trends.ipynb)\n", 23 | "* [scikit-learn Recipes](scikit-learn/sklearn_cookbook.ipynb)\n", 24 | "* [Web Server in a Notebook](hacks/Webserver%20in%20a%20Notebook.ipynb)\n", 25 | "\n", 26 | "\n", 27 | "## License\n", 28 | "\n", 29 | "Notebooks are Copyright (c) IBM Corporation 2015 under the MIT license. See [LICENSE](LICENSE) for details.\n", 30 | "\n", 31 | "Sample data files, libraries, techniques, external publications, etc. are cited in the notebooks in which they are used. Those works remain under the copyright of their respective owners." 32 | ] 33 | } 34 | ], 35 | "metadata": { 36 | "kernelspec": { 37 | "display_name": "Python 3", 38 | "language": "python", 39 | "name": "python3" 40 | }, 41 | "language_info": { 42 | "codemirror_mode": { 43 | "name": "ipython", 44 | "version": 3 45 | }, 46 | "file_extension": ".py", 47 | "mimetype": "text/x-python", 48 | "name": "python", 49 | "nbconvert_exporter": "python", 50 | "pygments_lexer": "ipython3", 51 | "version": "3.4.3" 52 | } 53 | }, 54 | "nbformat": 4, 55 | "nbformat_minor": 0 56 | } 57 | -------------------------------------------------------------------------------- /mlb/README.md: -------------------------------------------------------------------------------- 1 | # MLB Modern Era Salary Analysis 2 | 3 | [Lahman’s Baseball Database](http://seanlahman.com/baseball-archive/statistics/) contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. 4 | 5 | ## Objective 6 | 7 | One of the topics covered in **Lahman's Baseball Database** is MLB annual salaries. This notebook provides a sample analysis that explores trends in player salaries. Several questions are addressed: 8 | 9 | 1. What is the average salary increase per league since 1985? 10 | 2. What is the average salary increase per league since 1985? 11 | 3. Can we predict future average salary increase per league? 12 | 13 | ## Prepare Environment 14 | Bootstrap notebook with necessary notebook and library dependencies. 15 | 16 | ### Prerequisites 17 | This notebook requires the installation of the following software dependencies: 18 | ``` 19 | !pip install statsmodels 20 | ``` 21 | 22 | > This notebook was created using a *Technology Preview* call IBM Knowledge Anyhow Workbench, which is based on Project Jupyter. -------------------------------------------------------------------------------- /noaa/README.md: -------------------------------------------------------------------------------- 1 | # NOAA Weather Analysis 2 | 3 | This project is an exercise data exploration based on public weather data from the [NOAA National Climatic Data Center](http://www.ncdc.noaa.gov/). 4 | 5 | ## Data Access 6 | 7 | On 9.Mar.2015 an inquiry was made to NOAA regarding access to historic weather station data: 8 | 9 | >We are seeking access to raw data or a web service that would enable the analysis of the historic daily temperature records for all weather stations in the US from earliest date available to present. 10 | 11 | >We seek access to the data in raw CSV or JSON format. 12 | 13 | >While the " Surface Data Monthly Extremes - U.S." product is interesting it does not allow individuals to explore the data and derive independent insights. 14 | 15 | >Any help would be appreciated. 16 | 17 | On 12.Mar.2015 [William Brown](mailto:william.brown@noaa.gov), Meteorologist, NOAA Climate Services and Monitoring Division provided the following advice: 18 | 19 | >Your best option is to mine our digital daily summary data base. You can access the ftp directory, including data and documentation, from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ where access is unlimited and free of charge. 20 | 21 | This implied that if individuals did not want to work from static pre-cleansed reports then a fair amount of data munging would be required to **explode** the **text file based, fixed record machine-readable format** into human readable datasets. 22 | 23 | ## Subprojects 24 | 25 | * [Data Munging](https://github.com/ibm-et/jupyter-samples/tree/master/noaa/etl) 26 | * Historical Daily Temperature Analysis 27 | * [High Temperature Record Frequency Notebook](https://github.com/ibm-et/jupyter-samples/tree/master/noaa/tmaxfreq) 28 | * [Temperature Record Frequency Dashboard](https://github.com/ibm-et/jupyter-samples/tree/master/noaa/hdtadash) 29 | 30 | -------------------------------------------------------------------------------- /noaa/etl/README.md: -------------------------------------------------------------------------------- 1 | # NOAA Climatology Data 2 | 3 | This project is a [data munging](http://en.wikipedia.org/wiki/Data_wrangling) exercise whereby we convert NOAA formatted **weather station data** into easily consumable formats that allow for more convenient access and analysis. 4 | 5 | The focus of this notebook is to perform the compute intensive activities that will explode NOAA's **text file based, fixed record machine-readable database** into the data schemas required by [related projects](https://github.com/ibm-et/jupyter-samples/tree/master/noaa). 6 | 7 | ## Usage 8 | Open the ```noaa_hdta_etl.ipynb``` notebook and follow the **Getting Started** instructions. 9 | 10 | ## Data Assessment 11 | 12 | The Global Historical Climatology Network (GHCN) - [Daily dataset](http://gis.ncdc.noaa.gov/all-records/catalog/search/resource/details.page?id=gov.noaa.ncdc:C00861) integrates daily climate observations from approximately 30 different data sources. Over 25,000 worldwide weather stations are regularly updated with observations from within roughly the last month. The dataset is also routinely reconstructed (usually every week) from its roughly 30 data sources to ensure that GHCN-Daily is generally in sync with it's growing list of constituent sources. During this process, quality assurance checks are applied to the full dataset. Where possible, GHCN-Daily station data are also updated daily from a variety of data streams. Station values for each daily update also undergo a suite of quality checks. 13 | 14 | ### Ideal datasets 15 | 16 | NOAA's [National Climatic Data Center](http://www.ncdc.noaa.gov/about-ncdc)(NCDC) is responsible for preserving, monitoring, assessing, and providing public access to the USA's climate and historical weather data and information. Since weather is something that can be observed at varying intervals, the process used by NCDC is the best that we have yet it is far from ideal. Optimally, weather metrics should be observed, streamed, stored and analyzed in real-time. Such an approach could offer the data as a service associated with a data lake. 17 | 18 | > [Data lakes](http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/features/data-lakes.jhtml) that can scale at the pace of the cloud remove integration barriers and clear a path for more timely and informed business decisions. 19 | 20 | Access to cloud-based data services that front-end a data lake would help to reduce the possibility of human error and divorce us from downstream processing that alters the data from it's native state. 21 | 22 | ### Available datasets 23 | 24 | NOAA NCDC provides public FTP access to the **GHCN-Daily dataset**, which contains a file for each US weather station. Each file contains historical daily data for the given station since that station began to observe and record. Here are some details about the available data: 25 | 26 | * The data is delivered in a **text file based, fixed record machine-readable format**. 27 | * Version 3 of the GHCN-Daily dataset was released in September 2012. 28 | * Changes to the processing system associated with the Version 3 release also allowed for updates to occur 7 days a week rather than only on most weekdays. 29 | * Version 3 contains station-based measurements from well over 90,000 land-based stations worldwide, about two thirds of which are for precipitation measurement only. 30 | * Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. 31 | 32 | ### Desired Data Schemas 33 | While this notebook is focused on the generation of new datasets pertinent to daily temperature data, we could imagine future work associated with other observation types like snow accumulations and precipitation. To support a variety of weather station related analytic endeavors we need to extract from the machine-readable content new information and store it in a human-readable format. Essentially, we seek to explode (decompress) the data for general consumption and [normalize](http://en.wikipedia.org/wiki/Database_normalization#Normal_forms) it into new datasets that may be informative to users. 34 | 35 | #### Historical Daily Summary 36 | The goal here is to capture summary information about a given day throughout history at a specific weather station in the US. This dataset contains 365 rows where each row depicts the aggregated low and high temperature data for a specific day throughout the history of the weather station. 37 | 38 | Column | Description 39 | --- | --- 40 | Station ID | Name of the US Weather Station 41 | Month |Month of the observations 42 | Day | Day of the observations 43 | FirstYearOfRecord | First year that this weather station started collecting data for this day in in history. 44 | TMin | Current record low temperature (F) for this day in history. 45 | TMinRecordYear | Year in which current record low temperature (F) occurred. 46 | TMax | Current record high temperature (F) for this day in history. 47 | TMaxRecordYear | Year in which current record high temperature occurred. 48 | CurTMinMaxDelta | Difference in degrees F between record high and low records for this day in history. 49 | CurTMinRecordDur | LifeSpan of curent low record temperature. 50 | CurTMaxRecordDur | LifeSpan of current high record temperature. 51 | MaxDurTMinRecord | Maximum years a low temperature record was held for this day in history. 52 | MinDurTMinRecord | Minimum years a low temperature record was held for this day in history. 53 | MaxDurTMaxRecord | Maximum years a high temperature record was held for this day in history. 54 | MinDurTMaxRecord | Minimum years a high temperature record was held for this day in history. 55 | TMinRecordCount | Total number of TMin records set on this day (does not include first since that may not be a record). 56 | TMaxRecordCount | Total number of TMax records set on this day (does not include first since that may not be a record). 57 | 58 | #### Historical Daily Detail 59 | The goal here is to capture details for each year that a record has changed for a specific weather station in the US. During the processing of the Historical Daily Summary dataset, we will log each occurrence of a new temperature record. Information in this file can be used to drill-down into and/or validate the summary file. 60 | 61 | Column | Description 62 | --- | --- 63 | Station ID | Name of the US Weather Station 64 | Year | Year of the observation 65 | Month |Month of the observation 66 | Day | Day of the observation 67 | Type | Type of temperature record (Low = *TMin*, High = *TMax*) 68 | OldTemp | Temperature (F) prior to change. 69 | NewTemp | New temperature (F) record for this day. 70 | TDelta | Delta between old and new temperatures. 71 | 72 | #### Historical Daily Missing Record Detail 73 | The goal here is to capture details pertaining to missing data. Each record in this dataset represents a day in history that a specific weather station in the USA failed to observe a temperature reading. 74 | 75 | Column | Description 76 | --- | --- 77 | Station ID | Name of the US Weather Station 78 | Year | Year of the missing observation 79 | Month |Month of the missing observation 80 | Day | Day of the missing observation 81 | Type | Type of temperature missing (Low = *TMin*, High = *TMax*) 82 | 83 | #### Historical Raw Detail 84 | 85 | The goal here is to capture raw daily details. Each record in this dataset represents a specific temperature observation for a day in history for a specific that a specific weather station. 86 | 87 | Column | Description 88 | --- | --- 89 | Station ID | Name of the US Weather Station 90 | Year | Year of the observation 91 | Month |Month of the observation 92 | Day | Day of the observation 93 | Type | Type of temperature reading (Low = *TMin*, High = *TMax*) 94 | FahrenheitTemp | Fahrenheit Temperature 95 | 96 | ### Derived Datasets 97 | 98 | This project will support the generation of new datasets in several formats: 99 | 100 | * [HDF5](http://docs.h5py.org/en/latest/) 101 | * CSV 102 | 103 | ## Citation Information 104 | 105 | * [GHCN-Daily journal article](doi:10.1175/JTECH-D-11-00103.1): Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910. 106 | * Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: [Global Historical Climatology Network - Daily (GHCN-Daily)](http://doi.org/10.7289/V5D21VHZ), [Version 3.20-upd-2015031605], NOAA National Climatic Data Center [March 16, 2015]. 107 | 108 | -------------------------------------------------------------------------------- /noaa/etl/noaa_hdta_etl.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat_minor": 0, "cells": [{"source": "# Historical Daily Temperature Analysis \n\n## Objective\n\n>If there is scientific evidence of extreme fluctuations in our weather patterns due to human impact to the environment then we should be able to identify factual examples of increase in the frequency in extreme temperatures.\n\nThere has been a great deal of discussion around climate change and global warming. Since NOAA has made their data public, let us explore the data ourselves and see what insights we can discover. \n\n1. How many weather stations in US?\n2. For US weather stations, what is the average years of record keeping?\n3. For each US weather station, on each day of the year, identify the frequency at which daily High and Low temperature records are broken.\n4. Does the historical frequency of daily temperature records (High or Low) in the US provide statistical evidence of dramatic climate change?\n5. What is the average life-span of a daily temperature record (High or Low) in the US?\n\nThis analytical notebook is a component of a package of notebooks. The package is intended to serve as an exercise in the applicability of IPython/Juypter Notebooks to public weather data for DIY Analytics.", "cell_type": "markdown", "metadata": {}}, {"source": "
\n WARNING: This notebook requires a minimum of 12GB of free disk space to Run All cells for all processing phases. The time required to run all phases is **1 hr 50 mins**. The notebook supports two approaches for data generation (CSV and HDF). It is recommended that you choose to run either Approach 1 or Approach 2 and avoid running all cells.\n
", "cell_type": "markdown", "metadata": {}}, {"source": "## Assumptions\n\n1. We will observe and report everything in Fahrenheit.\n2. The data we extract from NOAA may be something to republish as it may require extensive data munging.\n\n### Data\n\nThe Global Historical Climatology Network (GHCN) - [Daily dataset](http://gis.ncdc.noaa.gov/all-records/catalog/search/resource/details.page?id=gov.noaa.ncdc:C00861) integrates daily climate observations from approximately 30 different data sources. Over 25,000 worldwide weather stations are regularly updated with observations from within roughly the last month. The dataset is also routinely reconstructed (usually every week) from its roughly 30 data sources to ensure that GHCN-Daily is generally in sync with its growing list of constituent sources. During this process, quality assurance checks are applied to the full dataset. Where possible, GHCN-Daily station data are also updated daily from a variety of data streams. Station values for each daily update also undergo a suite of quality checks.", "cell_type": "markdown", "metadata": {}}, {"source": "
\nThis notebook was developed using a March 16, 2015 snapshot of USA-Only daily temperature readings from the Global Historical Climatology Network. The emphasis herein is on the generation of human readable data. Data exploration and analytical exercises are deferred to separate but related notebooks.\n
\n\n#### Ideal datasets \n\nNOAA's [National Climatic Data Center](http://www.ncdc.noaa.gov/about-ncdc)(NCDC) is responsible for preserving, monitoring, assessing, and providing public access to the USA's climate and historical weather data and information. Since weather is something that can be observed at varying intervals, the process used by NCDC is the best that we have yet it is far from ideal. Optimally, weather metrics should be observed, streamed, stored and analyzed in real-time. Such an approach could offer the data as a service associated with a data lake.\n\n> [Data lakes](http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/features/data-lakes.jhtml) that can scale at the pace of the cloud remove integration barriers and clear a path for more timely and informed business decisions.\n\nAccess to cloud-based data services that front-end a data lake would help to reduce the possibility of human error and divorce us from downstream processing that alters the data from it's native state. \n\n#### Available datasets\n\nNOAA NCDC provides public FTP access to the **GHCN-Daily dataset**, which contains a file for each US weather station. Each file contains historical daily data for the given station since that station began to observe and record. Here are some details about the available data:\n\n* The data is delivered in a **text file based, fixed record machine-readable format**. \n* Version 3 of the GHCN-Daily dataset was released in September 2012. \n* Changes to the processing system associated with the Version 3 release also allowed for updates to occur 7 days a week rather than only on most weekdays. \n* Version 3 contains station-based measurements from well over 90,000 land-based stations worldwide, about two thirds of which are for precipitation measurement only. \n* Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. \n\n##### Citation Information\n\n* [GHCN-Daily journal article](doi:10.1175/JTECH-D-11-00103.1): Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910.\n* Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: [Global Historical Climatology Network - Daily (GHCN-Daily)](http://doi.org/10.7289/V5D21VHZ), [Version 3.20-upd-2015031605], NOAA National Climatic Data Center [March 16, 2015].", "cell_type": "markdown", "metadata": {}}, {"source": "## Getting Started\n\n### Project Setup\n\nThis project is comprised of several notebooks that address various stages of the analytical process. A common theme for the project is the enablement of reproducible research. This notebook will focus on the creation of new datasets that will be used for downstream analytics.\n\n#### Analytical Workbench\n\nThis notebook is compatible with [Project Jupyter](http://jupyter.org/). ", "cell_type": "markdown", "metadata": {}}, {"source": "
\n
\n
\n
Execution of this notebook depends on one or more features found in [IBM Knowledge Anyhow Workbench (KAWB)](https://marketplace.ibmcloud.com/apps/2149#!overview). To request a free trial account, visit us at https://knowledgeanyhow.org. You can, however, load it and view it on nbviewer or in IPython / Jupyter.
\n
\n
", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "# Import special helper functions for the IBM Knowledge Anyhow Workbench.\nimport kawb", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "##### Workarea Folder Structure\n\nThe project will be comprised of several file artifacts. This project's file subfolder structure is: \n\n```\n noaa_hdta_*.ipynb - Notebooks\n noaa-hdta/data/ghcnd_hcn.tar.gz - Obtained from NCDC FTP site.\n noaa-hdta/data/usa-daily/15mar2015/*.dly - Daily weather station files\n noaa-hdta/data/hdf5/15mar2015/*.h5 - Hierarchical Data Format files \n noaa-hdta/data/derived/15mar2015/*.h5 - Comma delimited files \n```\n**Notes**:\n\n1. The initial project research used the 15-March-2015 snapshot of the GHCN Dataset. The folder structure allows for other snapshots to be used.\n2. This notebook can be used to generate both Hierarchical Data Format and comma delimited files. It is recommended to pick one or the other as disk space requirements can be as large as:\n\n * HDF5: 8GB\n * CSV: 4GB", "cell_type": "markdown", "metadata": {}}, {"source": "
\n
\n
\n
In IBM Knowledge Anyhow Workbench, the user workarea is located under the \"resources\" folder.
\n
\n
", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "%%bash\n# Create folder structure\nmkdir -p noaa-hdta/data noaa-hdta/data/usa-daily/15mar2015\nmkdir -p noaa-hdta/data/hdf5 \nmkdir -p noaa-hdta/data/derived/15mar2015/missing\nmkdir -p noaa-hdta/data/derived/15mar2015/summaries\nmkdir -p noaa-hdta/data/derived/15mar2015/raw\nmkdir -p noaa-hdta/data/derived/15mar2015/station_details\n# List all project realted files and folders\n!ls -laR noaa-*", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "##### Tags\nKAWB allows all files to be tagged for project organization and search. This project will use the following tags.\n\n* ```noaa_data```: Used to tag data files (.dly, .h5, .csv)\n* ```noaa_hdta```: Used to tag project notebooks (.ipynb)\n\nThe following inline code can be used throughout the project to tag project files.", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "import glob\n\ndata_tagdetail = ['noaa_data',\n ['/resources/noaa-hdta/data/',\n '/resources/noaa-hdta/data/usa-daily/15mar2015/',\n '/resources/noaa-hdta/data/hdf5/15mar2015/',\n '/resources/noaa-hdta/data/derived/15mar2015/'\n ]]\nnb_tagdetail = ['noaa_hdta',['/resources/noaa_hdta_*.ipynb']]\n\ndef tag_files(tagdetail):\n pathnames = tagdetail[1]\n for path in pathnames:\n for fname in glob.glob(path):\n kawb.set_tag(fname, tagdetail[0])\n\n# Tag Project files\ntag_files(data_tagdetail)\ntag_files(nb_tagdetail)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "#### Technical Awareness\n\nIf the intended use of this notebook is to generate *Hierarchical Data Formatted* files, then the user of this notebook should be familiar with the following software concepts:\n\n* [HDF5 Files](http://docs.h5py.org/en/latest/)\n", "cell_type": "markdown", "metadata": {}}, {"source": "### Obtain the data\n\n1. Copy URL below to a new browser tab and then download the ```noaa-hdta/data/ghcnd_hcn.tar.gz``` file.\n```\n ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily\n``` \n2. Upload the same file to your workbench. \n3. Move and unpack the tarball into the designated project folders\n```\n %%bash\n mv /resources/ghcnd_hcn.tar.gz /resources/noaa-hdta/data/ghcnd_hcn.tar.gz\n cd /resources/noaa-hdta/data\n tar -xzf ghcnd_hcn.tar.gz\n mv ./ghcnd_hcn/*.dly ./usa-daily/15mar2015/\n rm -R ghcnd_hcn\n ls -la\n```", "cell_type": "markdown", "metadata": {}}, {"source": "
\n
\n
\n
In IBM Knowledge Anyhow Workbench, you can drag/drop the file on your workbench browser tab to simplify the uploading process.
\n
", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "%%bash\n# Provide the inline code for obtaining the data. See step 3 above.", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "### Dependencies\n\nIf the intended use of this notebook is to generate *Hierarchical Data Formatted* files, then this notebook requires the installation of the following software dependencies:\n\n```\n $ pip install h5py\n\n```", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "# Provide the inline code necessary for loading any required libraries", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "## Tidy Data Check\n\nUpon review of the NOAA NCDC GHCN Dataset, the data **fails** to meet the requirements of [Tidy Data](http://www.jstatsoft.org/v59/i10/paper). A dataset is tidy if rows, columns and tables are matched up with observations, variables and types. In tidy data:\n1. Each variable forms a column.\n2. Each observation forms a row.\n3. Each type of observational unit forms a table.\nMessy data is any other arrangement of the data.\n\nIn the case of the GHCN Dataset, we are presented with datasets that contain observations for each day in a month for a given year. Each \".dly\" file contains data for one station. The name of the file corresponds to a station's identification code. Each record in a file contains one month of daily data. Each row contains observations for more than 20 different element types. The variables on each line include the following:\n\n```\n------------------------------\nVariable Columns Type\n------------------------------\nID 1-11 Character\nYEAR 12-15 Integer\nMONTH 16-17 Integer\nELEMENT 18-21 Character\nVALUE1 22-26 Integer\nMFLAG1 27-27 Character\nQFLAG1 28-28 Character\nSFLAG1 29-29 Character\nVALUE2 30-34 Integer\nMFLAG2 35-35 Character\nQFLAG2 36-36 Character\nSFLAG2 37-37 Character\n . . .\n . . .\n . . .\nVALUE31 262-266 Integer\nMFLAG31 267-267 Character\nQFLAG31 268-268 Character\nSFLAG31 269-269 Character\n------------------------------\n```\nA more detailed interpretation of the format of the data is outlined in ```readme.txt``` here ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily.\n\n### Sample Preview\n\nThe variables of interest to this project have the following definitions:\n\n* **ID** is the station identification code. See \"ghcnd-stations.txt\" for a complete list of stations and their metadata.\n* **YEAR** is the year of the record.\n* **MONTH** is the month of the record.\n* **ELEMENT** is the element type. There are more than 20 possible elements. The five core elements are:\n * *PRCP* = Precipitation (tenths of mm)\n * *SNOW* = Snowfall (mm)\n\t* *SNWD* = Snow depth (mm)\n * *TMAX* = Maximum temperature (tenths of degrees C)\n * *TMIN* = Minimum temperature (tenths of degrees C)\n* **VALUE(n)** is element value on the nth day of the month (missing = -9999).\n* **NAPAD(n)** contains non-applicable fields of interest.\n\nwhere **n** denotes the day of the month (1-31) If the month has less than 31 days, then the remaining variables are set to missing (e.g., for April, VALUE31 = -9999, NAPAD31 = {MFLAG31 = blank, QFLAG31 = blank, SFLAG31 = blank}). \n\nHere is a snippet depicting how the data is represented:\n```\nUSC00011084201409TMAX 350 H 350 H 344 H 339 H 306 H 333 H 328 H 339 H 339 H 322 H 339 H 339 H 339 H 333 H 339 H 333 H 339 H 328 H 322 H 328 H 283 H 317 H 317 H 272 H 283 H 272 H 272 H 272 H-9999 -9999 -9999 \nUSC00011084201409TMIN 217 H 217 H 228 H 222 H 217 H 217 H 222 H 233 H 233 H 228 H 222 H 222 H 217 H 211 H 217 H 217 H 211 H 206 H 200 H 189 H 172 H 178 H 122 H 139 H 144 H 139 H 161 H 206 H-9999 -9999 -9999 \nUSC00011084201409TOBS 217 H 256 H 233 H 222 H 217 H 233 H 239 H 239 H 233 H 278 H 294 H 256 H 250 H 228 H 222 H 222 H 211 H 206 H 211 H 194 H 217 H 194 H 139 H 161 H 144 H 194 H 217 H 228 H-9999 -9999 -9999 \nUSC00011084201409PRCP 0 H 0 H 0 H 13 H 25 H 8 H 0 H 0 H 0 H 0 H 0 H 0 H 0 H 0 H 0 H 25 H 178 H 0 H 0 H 56 H 0 H 0 H 0 H 0 H 0 H 0 H 0 H 0 H-9999 -9999 -9999\n```\n\n\n### Remove noise\n\nThe NOAA NCDC GHCN Dataset includes ```-9999``` as an indicator of missing observation data. We will take this into consideration as we transform the data into a usable format for the project.", "cell_type": "markdown", "metadata": {}}, {"source": "## Data Processing Decision\n\nBefore we can attempt to answer the questions driving this project, we must first map the data into a more reasonable format. As a result, the focus of this notebook will be the creation of new datasets that can be consumed by other notebooks in the project. \n\n> [Data munging](http://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form into another format that allows for more convenient consumption of the data.\"\n\nThis data munging endeavor is an undertaking unto itself since our goal here is to extract from the machine-readable content new information and store it in a human-readable format. Essentially, we seek to explode (decompress) the data for general consumption and [normalize](http://en.wikipedia.org/wiki/Database_normalization#Normal_forms) it into a relational model that may be informative to users.\n\n### Desired Data Schema\n\n#### Historical Daily Summary\nThe goal here is to capture summary information about a given day throughout history at a specific weather station in the US. This dataset contains 365 rows where each row depicts the aggregated low and high temperature data for a specific day throughout the history of the weather station. \n\nColumn | Description \n--- | --- \nStation ID | Name of the US Weather Station\nMonth |Month of the observations\nDay | Day of the observations\nFirstYearOfRecord | First year that this weather station started collecting data for this day in in history.\nTMin | Current record low temperature (F) for this day in history.\nTMinRecordYear | Year in which current record low temperature (F) occurred. \nTMax | Current record high temperature (F) for this day in history. \nTMaxRecordYear | Year in which current record high temperature occurred. \nCurTMinMaxDelta | Difference in degrees F between record high and low records for this day in history.\nCurTMinRecordDur | LifeSpan of curent low record temperature.\nCurTMaxRecordDur | LifeSpan of current high record temperature.\nMaxDurTMinRecord | Maximum years a low temperature record was held for this day in history.\nMinDurTMinRecord | Minimum years a low temperature record was held for this day in history.\nMaxDurTMaxRecord | Maximum years a high temperature record was held for this day in history.\nMinDurTMaxRecord | Minimum years a high temperature record was held for this day in history.\nTMinRecordCount | Total number of TMin records set on this day (does not include first since that may not be a record).\nTMaxRecordCount | Total number of TMax records set on this day (does not include first since that may not be a record).\n\n#### Historical Daily Detail\nThe goal here is to capture details for each year that a record has changed for a specific weather station in the US. During the processing of the Historical Daily Summary dataset, we will log each occurrence of a new temperature record. Information in this file can be used to drill-down into and/or validate the summary file. \n\nColumn | Description \n--- | --- \nStation ID | Name of the US Weather Station\nYear | Year of the observation\nMonth |Month of the observation\nDay | Day of the observation\nType | Type of temperature record (Low = *TMin*, High = *TMax*)\nOldTemp | Temperature (F) prior to change.\nNewTemp | New temperature (F) record for this day.\nTDelta | Delta between old and new temperatures.\n\n#### Historical Daily Missing Record Detail\nThe goal here is to capture details pertaining to missing data. Each record in this dataset represents a day in history that a specific weather station in the USA failed to observe a temperature reading.\n\nColumn | Description \n--- | --- \nStation ID | Name of the US Weather Station\nYear | Year of the missing observation\nMonth |Month of the missing observation\nDay | Day of the missing observation\nType | Type of temperature missing (Low = *TMin*, High = *TMax*)\n\n#### Historical Raw Detail\n\nThe goal here is to capture raw daily details. Each record in this dataset represents a specific temperature observation for a day in history for a specific that a specific weather station.\n\nColumn | Description \n--- | --- \nStation ID | Name of the US Weather Station\nYear | Year of the observation\nMonth |Month of the observation\nDay | Day of the observation\nType | Type of temperature reading (Low = *TMin*, High = *TMax*)\nFahrenheitTemp | Fahrenheit Temperature\n\n### Derived Datasets\n\nWhile this notebook is focused on daily temperature data, we could imagine future work associated with other observation types like snow accumulations and precipitation. Therefore, the format we choose to capture and store our desired data should also allow us to organize and append future datasets.\n\nThe [HDF5 Python Library](http://docs.h5py.org/en/latest/) provides support for the standard Hierarchical Data Format. This library will allow us to:\n\n1. Create, save and publish derived datasets for reuse.\n2. Organize our datasets into groups (folders). \n3. Create datasets that can be easily converted to/from dataframes. \n\nHowever, HDF5 files can be very large which could be a problem if we want to share the data. Alternatively, we could store the information in new collections of CSV files where each ```.csv``` contained weather station specific content for one of our target schemas.\n", "cell_type": "markdown", "metadata": {}}, {"source": "# Extract, Transform and Load (ETL)\n\nThe focus of this notebook is to perform the to compute intensive activities that will **explode** the **text file based, fixed record machine-readable format** into the derived data schemas we have specified.", "cell_type": "markdown", "metadata": {}}, {"source": "### Gather Phase 1\n\nThe purpose of this phase will be to do the following:\n\n* For each daily data file (*.dly) provided by NOAA:\n * For each record where ELEMENT == TMAX or TMIN\n * Identify missing daily temperature readings, write them to the missing dataset.\n * Convert each daily temperature reading from celcius to fahrenheit, write each daily reading to the raw dataset.\n \n#### Approach 1: Comma delimited files\n\nUse the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for *Gather Phase 1*. This will take about 20 minutes to process the **1218** or more weather station files. You should expect to see output like this:\n\n```\n>> Processing file 0: USC00207812\nExtracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.\n>> Processing Complete: 7024 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.\n>> Elapsed file execution time 0:00:00\n>> Processing file 1: USC00164700\nExtracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.\n>> Processing Complete: 9715 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.\n.\n.\n.\n>> Processing file 1217: USC00200230\nExtracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.\n>> Processing Complete: 10112 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.\n>> Elapsed file execution time 0:00:00\n>> Processing Complete.\n>> Elapsed corpus execution time 0:19:37\n```", "cell_type": "markdown", "metadata": {}}, {"source": "
\n
\n
\n
[IBM Knowledge Anyhow Workbench (KAWB)](https://marketplace.ibmcloud.com/apps/2149#!overview) provides support for importable notebooks that address the challenges of code reuse. It also supports the concept of code injection from reusable cookbooks. For more details, please refer to the Share and Reuse tutorials that come with your instance of KAWB.
\n
\n
", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "import mywb.noaa_hdta_etl_csv_tools as csvtools", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": null, "cell_type": "code", "source": "csvtools.help()", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "%inject csvtools.noaa_run_phase1_approach1", "cell_type": "markdown", "metadata": {"collapsed": false}}, {"execution_count": null, "cell_type": "code", "source": "# Approach 1 Content Layout for Gather Phases 1 and 2\nhtda_approach1_content_layout = {\n 'Content_Version': '15mar2015',\n 'Daily_Input_Files': '/resources/noaa-hdta/data/usa-daily/15mar2015/*.dly',\n 'Raw_Details': '/resources/noaa-hdta/data/derived/15mar2015/raw',\n 'Missing_Details': '/resources/noaa-hdta/data/derived/15mar2015/missing',\n 'Station_Summary': '/resources/noaa-hdta/data/derived/15mar2015/summaries', \n 'Station_Details': '/resources/noaa-hdta/data/derived/15mar2015/station_details',\n }", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": null, "cell_type": "code", "source": "# Run Gather Phase 1 for all 1218 files using approach 1 (CSV)\ncsvtools.noaa_run_phase1_approach1(htda_approach1_content_layout)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "##### Observe Output \n\nYou can compute the disk capacity of your *Gather Phase 1* results.\n```\n96M\t/resources/noaa-hdta/data/derived/15mar2015/missing\n24M\t/resources/noaa-hdta/data/derived/15mar2015/summaries\n3.2G\t/resources/noaa-hdta/data/derived/15mar2015/raw\n```", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "%%bash\n# Compute size of output folders\ndu -h --max-depth=1 /resources/noaa-hdta/data/derived/15mar2015/", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "#### Approach 2: HDF5 files\n\nUse the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for *Gather Phase 1*. It will take less than 20 minutes to process the **1218** or more weather station files. **Note**: You will need to have room for about **6.5GB**. You should expect to see output like this:\n\n```\n>> Processing file 0: USC00207812\nExtracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.\n>> Processing Complete: 7024 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00207812.dly.\n>> Elapsed file execution time 0:00:00\n>> Processing file 1: USC00164700\nExtracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.\n>> Processing Complete: 9715 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00164700.dly.\n.\n.\n.\n>> Processing file 1217: USC00200230\nExtracting content from file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.\n>> Processing Complete: 10112 lines of file /resources/noaa-hdta/data/usa-daily/15mar2015/USC00200230.dly.\n>> Elapsed file execution time 0:00:00\n>> Processing Complete.\n>> Elapsed corpus execution time 0:17:43\n```", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "import mywb.noaa_hdta_etl_hdf_tools as hdftools", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": null, "cell_type": "code", "source": "hdftools.help()", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "%inject hdftools.noaa_run_phase1_approach2", "cell_type": "markdown", "metadata": {"collapsed": false}}, {"execution_count": null, "cell_type": "code", "source": "# Approach 2 Content Layout for Gather Phases 1 and 2\nhtda_approach2_content_layout = {\n 'Content_Version': '15mar2015',\n 'Daily_Input_Files': '/resources/noaa-hdta/data/usa-daily/15mar2015/*.dly',\n 'Raw_Details': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_raw_details.h5',\n 'Missing_Details': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_missing_details.h5',\n 'Station_Summary': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_station_summaries.h5', \n 'Station_Details': '/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_station_details.h5',\n }", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": null, "cell_type": "code", "source": "# Run Gather Phase 1 for all 1218 files using approach 2 (HDF5)\nhdftools.noaa_run_phase1_approach2(htda_approach2_content_layout)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": null, "cell_type": "code", "source": "%%bash\n# Compute size of output folders\ndu -h --max-depth=1 /resources/noaa-hdta/data/hdf5/15mar2015/", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "### Gather Phase 2\n\nThe purpose of this phase will be to do the following:\n\n* For each daily data file (*.dly) provided by NOAA:\n * For each record where ELEMENT == TMAX or TMIN\n * Identify missing daily temperature readings, write them to the missing dataset.\n * Convert each daily temperature reading from celcius to fahrenheit, write each daily reading to the raw dataset.\n \n* For each raw dataset per weather station that was generated in *Gather Phase 1*:\n * For each tuple in a dataset, identify when a new temperature record has occurred (TMIN or TMAX):\n * Create a Station Detail tuple and add to list of tuples\n * Update the Summary Detail list of tuples for this day in history for this station\n * Store the lists to disk\n \n#### Approach 1: Comma delimited files\n\nUse the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for *Gather Phase 2*. This will take about 40 minutes to process the **1218** or more raw weather station files. You should expect to see output like this:\n\n```\nProcessing dataset 0 - 1218: USC00011084\nProcessing dataset 1 - 1218: USC00012813\n.\n.\n.\nProcessing dataset 1216: USC00130133\nProcessing dataset 1217: USC00204090\n>> Processing Complete.\n>> Elapsed corpus execution time 0:38:47\n```", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "# Decide if we need to generate station detail files.\ncsvtools.noaa_run_phase2_approach1.help()", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": null, "cell_type": "code", "source": "# Run Gather Phase 2 for all 1218 raw files using approach 1 (CSV)\ncsvtools.noaa_run_phase2_approach1(htda_approach1_content_layout, create_details=True)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "##### Observe Output \n\nYou can compute the disk capacity of your *Gather Phase 2* results.\n```\n96M\t/resources/noaa-hdta/data/derived/15mar2015/missing\n24M\t/resources/noaa-hdta/data/derived/15mar2015/summaries\n3.2G\t/resources/noaa-hdta/data/derived/15mar2015/raw\n129M\t/resources/noaa-hdta/data/derived/15mar2015/station_details\n3.4G\t/resources/noaa-hdta/data/derived/15mar2015/\n```", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "%%bash\n# Compute size of output folders\ndu -h --max-depth=1 /resources/noaa-hdta/data/derived/15mar2015/", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "#### Approach 2: HDF5 files\n\nUse the code below to layout your target project environment (if it should differ from what is described herein) and then run the process for *Gather Phase 2*. This will take about 30 minutes to process the **1218** or more raw weather station files. **Note**: You will need to have room for about **6.5GB**. You should expect to see output like this:\n\n```\nFetching keys for type = raw_detail\n>> Fetch Complete.\n>> Elapsed key-fetch execution time 0:00:09\nProcessing dataset 0 - 1218: USC00011084\nProcessing dataset 1 - 1218: USC00012813\n.\n.\n.\nProcessing dataset 1216 - 1218: USW00094794\nProcessing dataset 1217 - 1218: USW00094967\n>> Processing Complete.\n>> Elapsed corpus execution time 0:28:48\n```", "cell_type": "markdown", "metadata": {}}, {"source": "%inject hdftools.noaa_run_phase2_approach2", "cell_type": "markdown", "metadata": {}}, {"source": "### noaa_run_phase2_approach2\nTakes a dictionary of project folder details to drive the processing of *Gather Phase 2 Approach 2* using **HDF files**.\n\n#### Disk Storage Requirements\n\n* This function creates a **Station Summaries** dataset that requires ~2GB of free space. \n* This function can also create a **Station Details** dataset. If you require this dataset to be generated, modify the call to ```noaa_run_phase2_approach2()``` with ```create_details=True```. You will need additional free space to support this feature. Estimated requirement: **5GB**", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "# Run Gather Phase 2 for all 1218 files using approach 2 (HDF)\nhdftools.noaa_run_phase2_approach2(htda_approach2_content_layout)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "##### Observe Output TODO\n\nYou can compute the disk capacity of your *Gather Phase 2* results.\n```\nHDF File Usage (Phases 1 & 2) - Per File and Total\n4.9G\t/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_raw_details.h5\n1.4G\t/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_missing_details.h5\n1.3G\t/resources/noaa-hdta/data/hdf5/15mar2015/noaa_ncdc_station_summaries.h5\n7.4G\t/resources/noaa-hdta/data/hdf5/15mar2015/\n```", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "%%bash\n# Compute size of output folders\necho \"HDF File Usage (Phases 1 & 2) - Per File and Total\"\ndu -ah /resources/noaa-hdta/data/hdf5/15mar2015/", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "# References\n\n* [Data Analysis Workflow Navigation repository](https://github.com/vinomaster/dawn): This notebook outline was derived from the **Research Analysis Navigation Template**.\n* [NOAA National Climatic Data Center](http://www.ncdc.noaa.gov/)\n* [NOAA Data Fraud News](https://stevengoddard.wordpress.com/2013/01/11/noaa-temperature-fraud-expands-part-1/)\n* [HDF5](http://www.h5py.org/)\n* [HDF5 v. Database](http://nbviewer.ipython.org/github/dolaameng/tutorials/blob/master/ml-tutorials/BASIC_pandas_io%28specially%20hdf5%29.ipynb)\n\n# Summary\n\nThis notebook provides two approaches in the creation of human readable datasets for historical daily temperature analytics. This analytical notebook is a component of a package of notebooks. The tasks addressed herein focused on data munging activities to produce a desired set of datasets for several predefined schemas. These datasets can now be used in other package notebooks for data exploration activities. \n\nThis notebook has embraced the concepts of reproducible research and can be shared with others so that they can recreate the data locally.", "cell_type": "markdown", "metadata": {}}, {"source": "# Future Investigations \n\n1. Fix the ordering of the record lifespan calculations. See code and comments for clarity. \n2. Fix FTP link so that it can be embedded into notebook.\n2. Explore multi-level importable notebooks to allow the tools files to share common code. ", "cell_type": "markdown", "metadata": {}}, {"source": "
\n
\n
\n
This notebook was created using [IBM Knowledge Anyhow Workbench](https://knowledgeanyhow.org). To learn more, visit us at https://knowledgeanyhow.org.
\n
\n
", "cell_type": "markdown", "metadata": {"collapsed": true}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "Python 2", "name": "python2", "language": "python"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "version": "2.7.6", "name": "python", "file_extension": ".py", "pygments_lexer": "ipython2", "codemirror_mode": {"version": 2, "name": "ipython"}}}} -------------------------------------------------------------------------------- /noaa/etl/noaa_hdta_etl_csv_tools.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Tools for CSV FIle Processing" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Gather Phase Tools\n", 15 | "\n", 16 | "This importable notebook provides the tooling necessary to handle the processing for the **Gather Phases** in the ETL process for the NOAA HDTA project. This tooling supports Approaches 1 and 2 using **CSV files**. \n", 17 | "\n", 18 | "Each of the process phases require a dictionary to drive the workflow. \n", 19 | "\n", 20 | "```\n", 21 | "project_layout = { \n", 22 | " 'Content_Version': '',\n", 23 | " 'Daily_Input_Files': '',\n", 24 | " 'Raw_Details': '',\n", 25 | " 'Missing_Details': '',\n", 26 | " 'Station_Summary': '', \n", 27 | " 'Station_Details': '',\n", 28 | " }\n", 29 | "\n", 30 | "```\n", 31 | "Process Phase | Function to run \n", 32 | "--- | --- \n", 33 | "Phase 1 Approach 1 | noaa_run_phase1_approach1(project_layout)\n", 34 | "Phase 2 Approach 1 | noaa_run_phase2_approach1(project_layout)" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 8, 40 | "metadata": { 41 | "collapsed": true 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "# " 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 9, 51 | "metadata": { 52 | "collapsed": true 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "# \n", 57 | "\n", 58 | "import os\n", 59 | "import time\n", 60 | "import glob\n", 61 | "import struct\n", 62 | "import collections\n", 63 | "import pandas as pd\n", 64 | "\n", 65 | "# Create a collection for returning multiple lists of tuples\n", 66 | "approach1_bag = collections.namedtuple('GatherBag', ['raw', 'missing'])\n", 67 | "\n", 68 | "# Historical Raw Daily Detail\n", 69 | "raw_daily_detail_rec_template = {'StationID': \"\",\n", 70 | " 'Year': \"\",\n", 71 | " 'Month': \"\",\n", 72 | " 'Day': \"\",\n", 73 | " 'Type': \"\",\n", 74 | " 'FahrenheitTemp': \"\",\n", 75 | " }\n", 76 | "\n", 77 | "# Historical Daily Missing Record Detail\n", 78 | "missing_detail_rec_template = {'StationID': \"\",\n", 79 | " 'Year': \"\",\n", 80 | " 'Month': \"\",\n", 81 | " 'Day': \"\",\n", 82 | " 'Type': \"\",\n", 83 | " }\n", 84 | "\n", 85 | "def get_filename(pathname):\n", 86 | " '''Fetch filename portion of pathname.'''\n", 87 | " plist = pathname.split('/')\n", 88 | " fname, fext = os.path.splitext(plist[len(plist)-1])\n", 89 | " return fname\n", 90 | "\n", 91 | "def elapsed_time(secs):\n", 92 | " '''Compute formated time stamp given seconds elapsed. '''\n", 93 | " m, s = divmod(secs, 60)\n", 94 | " h, m = divmod(m, 60)\n", 95 | " et = \"%d:%02d:%02d\" % (h, m, s)\n", 96 | " return et\n", 97 | "\n", 98 | "def noaa_convert_c2f(noaa_temp):\n", 99 | " '''Returns Fahrenheit temperature value from a NOAA temperature (tenths of degrees C) '''\n", 100 | " celsius = int(noaa_temp)/10\n", 101 | " fahrenheit = 9.0/5.0 * celsius + 32\n", 102 | " return round(fahrenheit,1)\n", 103 | "\n", 104 | "def noaa_gather_lines(lines):\n", 105 | " '''Return dataframes for raw and missing detail from list of lines.'''\n", 106 | " # Create list of tuples \n", 107 | " raw_list = []\n", 108 | " missing_list = []\n", 109 | " for index, line in enumerate(lines):\n", 110 | " #print(\"Processing line {0}.\").format(index)\n", 111 | " r = noaa_gather_daily_detail(line)\n", 112 | " raw_list += r.raw\n", 113 | " missing_list += r.missing\n", 114 | " # Construct dataframes\n", 115 | " df_raw = pd.DataFrame(raw_list)\n", 116 | " df_missing = pd.DataFrame(missing_list)\n", 117 | " return approach1_bag(df_raw, df_missing)\n", 118 | "\n", 119 | "def noaa_gather_daily_detail(line):\n", 120 | " '''Extract content from daily record, create raw and missing tuples.'''\n", 121 | " station_time_element = struct.unpack('11s4s2s4s', line[0:21])\n", 122 | " raw_tuple_list = []\n", 123 | " missing_tuple_list = []\n", 124 | " if station_time_element[3] == 'TMIN' or station_time_element[3] == 'TMAX':\n", 125 | " values = line[21:]\n", 126 | " day_of_month = 0\n", 127 | " while(len(values) > 7):\n", 128 | " day_of_month = day_of_month + 1\n", 129 | " day_measure = struct.unpack('5ssss', values[0:8])\n", 130 | " if day_measure[0] != '-9999':\n", 131 | " raw_tuple = dict(raw_daily_detail_rec_template)\n", 132 | " # Compute degrees fahrenheit\n", 133 | " fahrenheit = noaa_convert_c2f(int(day_measure[0]))\n", 134 | " # Construct raw detail record\n", 135 | " raw_tuple['StationID'] = station_time_element[0]\n", 136 | " raw_tuple['Year'] = station_time_element[1]\n", 137 | " raw_tuple['Month']= station_time_element[2]\n", 138 | " raw_tuple['Day'] = day_of_month\n", 139 | " raw_tuple['Type'] = station_time_element[3]\n", 140 | " raw_tuple['FahrenheitTemp'] = fahrenheit\n", 141 | " raw_tuple_list.append(raw_tuple)\n", 142 | " else:\n", 143 | " # Construct missing detail record\n", 144 | " missing_tuple = dict(missing_detail_rec_template)\n", 145 | " missing_tuple['StationID'] = station_time_element[0]\n", 146 | " missing_tuple['Year'] = station_time_element[1]\n", 147 | " missing_tuple['Month']= station_time_element[2]\n", 148 | " missing_tuple['Day'] = day_of_month\n", 149 | " missing_tuple['Type'] = station_time_element[3]\n", 150 | " missing_tuple_list.append(missing_tuple)\n", 151 | " # Adjust offest for next day\n", 152 | " values = values[8:] \n", 153 | " # Return new tuples\n", 154 | " return approach1_bag(raw_tuple_list, missing_tuple_list) \n", 155 | "\n", 156 | "def noaa_process_hcn_daily_file(fname):\n", 157 | " '''Return dataframes for raw and missing detail from lines in file.'''\n", 158 | " print(\"Extracting content from file {0}.\").format(fname)\n", 159 | " x = 0\n", 160 | " raw_cols = ['StationID', 'Year', 'Month', 'Day', 'Type', 'FahrenheitTemp']\n", 161 | " missing_cols = ['StationID', 'Year', 'Month', 'Day', 'Type']\n", 162 | " # Create list of tuples \n", 163 | " raw_list = []\n", 164 | " missing_list = []\n", 165 | " # Start Timer\n", 166 | " start_time = time.time()\n", 167 | " with open(fname,'r') as f:\n", 168 | " lines = f.readlines()\n", 169 | " # Changed next 2 lines only.\n", 170 | " for line in lines:\n", 171 | " x += 1\n", 172 | " #print(\" .... Processing line {0}.\").format(x)\n", 173 | " r = noaa_gather_daily_detail(line)\n", 174 | " raw_list += r.raw\n", 175 | " missing_list += r.missing\n", 176 | " f.close() \n", 177 | " seconds = (time.time() - start_time)\n", 178 | " print(\">> Processing Complete: {0} lines of file {1}.\").format(x, fname)\n", 179 | " print(\">> Elapsed file execution time {0}\").format(elapsed_time(seconds))\n", 180 | " # Capture and Sort Results in DataFrames\n", 181 | " df_raw = pd.DataFrame(raw_list)\n", 182 | " df_missing = pd.DataFrame(missing_list)\n", 183 | " r = df_raw.sort(raw_cols).reindex(columns=raw_cols)\n", 184 | " m = df_missing.sort(missing_cols).reindex(columns=missing_cols)\n", 185 | " return approach1_bag(r, m)\n", 186 | "\n", 187 | "def noaa_run_phase1_approach1(project_layout):\n", 188 | " '''Process corpus of daily files and store results in CSV files.'''\n", 189 | " try:\n", 190 | " if not project_layout['Daily_Input_Files']:\n", 191 | " raise Exception(\"Incomplete or missing dictionary of project folder details.\")\n", 192 | " print(\">> Processing Started ...\")\n", 193 | " # Start Timer\n", 194 | " start_time = time.time()\n", 195 | " for index, fname in enumerate(glob.glob(project_layout['Daily_Input_Files'])):\n", 196 | " station_name = get_filename(fname)\n", 197 | " print(\">> Processing file {0}: {1}\").format(index, station_name)\n", 198 | " raw_file = os.path.join(project_layout['Raw_Details'], '', station_name + '_raw.csv')\n", 199 | " missing_file = os.path.join(project_layout['Missing_Details'], '', station_name + '_mis.csv')\n", 200 | " r = noaa_process_hcn_daily_file(fname)\n", 201 | " r.raw.to_csv(raw_file)\n", 202 | " r.missing.to_csv(missing_file)\n", 203 | " seconds = (time.time() - start_time)\n", 204 | " print(\">> Processing Complete.\")\n", 205 | " print(\">> Elapsed corpus execution time {0}\").format(elapsed_time(seconds))\n", 206 | " except Exception as e:\n", 207 | " print(\">> Processing Failed: Error {0}\").format(e.message)\n", 208 | " " 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "### noaa_run_phase1_approach1\n", 216 | "Takes a dictionary of project folder details to drive the processing of *Gather Phase 1 Approach 1* using **CSV files**." 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": { 223 | "collapsed": false 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "# \n", 228 | "project_layout = { \n", 229 | " 'Content_Version': '',\n", 230 | " 'Daily_Input_Files': '',\n", 231 | " 'Raw_Details': '',\n", 232 | " 'Missing_Details': '',\n", 233 | " 'Station_Summary': '', \n", 234 | " 'Station_Details': '',\n", 235 | " }\n", 236 | "noaa_run_phase1_approach1(project_layout)" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 1, 242 | "metadata": { 243 | "collapsed": true 244 | }, 245 | "outputs": [], 246 | "source": [ 247 | "# \n", 248 | "\n", 249 | "import os\n", 250 | "import glob\n", 251 | "import time\n", 252 | "import datetime\n", 253 | "import collections\n", 254 | "import pandas as pd\n", 255 | "import traceback\n", 256 | "\n", 257 | "# Create a collection for returning multiple lists of tuples\n", 258 | "approach2_bag = collections.namedtuple('GatherBag', ['DailySummary', 'DailyDetail'])\n", 259 | "\n", 260 | "# Historical Daily Summary\n", 261 | "summary_template = {'StationID': \"\", \n", 262 | " 'Month': \"\", \n", 263 | " 'Day': \"\",\n", 264 | " 'FirstYearOfRecord': \"\", \n", 265 | " 'TMin': \"\", \n", 266 | " 'TMinRecordYear': \"\", \n", 267 | " 'TMax': \"\", \n", 268 | " 'TMaxRecordYear': \"\", \n", 269 | " 'CurTMinMaxDelta': \"\", \n", 270 | " 'CurTMinRecordDur': \"\", \n", 271 | " 'CurTMaxRecordDur': \"\", \n", 272 | " 'MaxDurTMinRecord': \"\", \n", 273 | " 'MinDurTMinRecord': \"\", \n", 274 | " 'MaxDurTMaxRecord': \"\", \n", 275 | " 'MinDurTMaxRecord': \"\", \n", 276 | " 'TMinRecordCount': \"\", \n", 277 | " 'TMaxRecordCount': \"\" \n", 278 | " }\n", 279 | "\n", 280 | "summary_cols = ['StationID', 'Month', 'Day', 'FirstYearOfRecord', \n", 281 | " 'TMin', 'TMinRecordYear', 'TMax', 'TMaxRecordYear',\n", 282 | " 'CurTMinMaxDelta', 'CurTMinRecordDur','CurTMaxRecordDur',\n", 283 | " 'MaxDurTMinRecord', 'MinDurTMinRecord',\n", 284 | " 'MaxDurTMaxRecord', 'MinDurTMaxRecord',\n", 285 | " 'TMinRecordCount', 'TMaxRecordCount' \n", 286 | " ]\n", 287 | "\n", 288 | "# Historical Daily Detail\n", 289 | "detail_template = {'StationID': \"\",\n", 290 | " 'Year': \"\",\n", 291 | " 'Month': \"\",\n", 292 | " 'Day': \"\",\n", 293 | " 'Type': \"\",\n", 294 | " 'OldTemp': \"\",\n", 295 | " 'NewTemp': \"\",\n", 296 | " 'TDelta': \"\"\n", 297 | " }\n", 298 | "\n", 299 | "detail_cols = ['StationID', 'Year', 'Month', 'Day', 'Type', \n", 300 | " 'NewTemp', 'OldTemp', 'TDelta'\n", 301 | " ]\n", 302 | "\n", 303 | "def get_filename(pathname):\n", 304 | " '''Fetch filename portion of pathname.'''\n", 305 | " plist = pathname.split('/')\n", 306 | " fname, fext = os.path.splitext(plist[len(plist)-1])\n", 307 | " return fname\n", 308 | "\n", 309 | "def elapsed_time(secs):\n", 310 | " '''Compute formated time stamp given seconds elapsed. '''\n", 311 | " m, s = divmod(secs, 60)\n", 312 | " h, m = divmod(m, 60)\n", 313 | " et = \"%d:%02d:%02d\" % (h, m, s)\n", 314 | " return et\n", 315 | "\n", 316 | "def get_key_list(hdf5file, type='raw_detail'):\n", 317 | " '''Return a list of keys for requested type from specified HDF file.'''\n", 318 | " print(\"Fetching keys for type = {0}\").format(type)\n", 319 | " keylist = []\n", 320 | " store = None\n", 321 | " try:\n", 322 | " store = pd.HDFStore(hdf5file,'r')\n", 323 | " h5keys = store.keys()\n", 324 | " store.close()\n", 325 | " for k in h5keys:\n", 326 | " if k.find(type) > -1:\n", 327 | " keylist.append(k)\n", 328 | " except:\n", 329 | " if store:\n", 330 | " store.close()\n", 331 | " raise\n", 332 | " return keylist\n", 333 | "\n", 334 | "def cleans_invalid_days(df):\n", 335 | " '''Return a dataframe void of invalid days'''\n", 336 | " ShortMths = {4,6,9,11}\n", 337 | " df_clean = df.query('(((Month not in @ShortMths) & (Day != 31)) and ((Month != 2) or (Day < 30)) )')\n", 338 | " return df_clean\n", 339 | "\n", 340 | "def noaa_process_phase2_records(raw_tuples):\n", 341 | " '''Compute formated time stamp given seconds elapsed. '''\n", 342 | " # Sample Tuple:\n", 343 | " # (0, 'USC00011084', '1926', '01', 21, 'TMAX', 73.400000000000006)\n", 344 | " #\n", 345 | " # Create several 12x31 matrices to store daily detail per metric of interest.\n", 346 | " fyr_online_for_day = [[9999 for x in range(32)] for x in range(13)] \n", 347 | " tmin_history = [[-99 for x in range(32)] for x in range(13)] \n", 348 | " tmax_history = [[-99 for x in range(32)] for x in range(13)] \n", 349 | " tminyr_history = [[-99 for x in range(32)] for x in range(13)] \n", 350 | " tmaxyr_history = [[-99 for x in range(32)] for x in range(13)] \n", 351 | " tminrc_history = [[0 for x in range(32)] for x in range(13)] \n", 352 | " tmaxrc_history = [[0 for x in range(32)] for x in range(13)] \n", 353 | " tmax_max_life = [[0 for x in range(32)] for x in range(13)] \n", 354 | " tmax_min_life = [[9999 for x in range(32)] for x in range(13)] \n", 355 | " tmin_max_life = [[0 for x in range(32)] for x in range(13)] \n", 356 | " tmin_min_life = [[9999 for x in range(32)] for x in range(13)] \n", 357 | " # Capture Station ID (all raw-tuples are per station)\n", 358 | " station_ID = raw_tuples[0][1]\n", 359 | " # Process each raw daily tuple: create daily retail tuples while updating matrices.\n", 360 | " detail_list = []\n", 361 | " for t in raw_tuples:\n", 362 | " detail_row = dict(detail_template)\n", 363 | " detail_row['StationID'] = t[1]\n", 364 | " detail_row['Year'] = t[2]\n", 365 | " detail_row['Month'] = t[3]\n", 366 | " detail_row['Day'] = str(t[4])\n", 367 | " month = int(t[3])-1\n", 368 | " day = t[4]-1\n", 369 | " # For this day, what was the first year in which this station was operational?\n", 370 | " if fyr_online_for_day[month][day] > int(t[2]):\n", 371 | " fyr_online_for_day[month][day] = int(t[2])\n", 372 | " # Handle TMAX\n", 373 | " if (t[5] == 'TMAX'):\n", 374 | " # Handle TMAX for first record\n", 375 | " if (tmax_history[month][day] == -99):\n", 376 | " # Handle TMAX for first \n", 377 | " detail_row['Type'] = 'TMAX'\n", 378 | " detail_row['OldTemp'] = round(t[6],1)\n", 379 | " detail_row['NewTemp'] = round(t[6],1)\n", 380 | " detail_row['TDelta'] = 0\n", 381 | " tmax_history[month][day] = round(t[6],1)\n", 382 | " tmaxyr_history[month][day] = int(t[2])\n", 383 | " tmaxrc_history[month][day] = 1\n", 384 | " tmax_min_life[month][day] = 0\n", 385 | " tmax_max_life[month][day] = 0\n", 386 | " # Add new daily detail row\n", 387 | " detail_list.append(detail_row)\n", 388 | " # Handle TMAX for new daily record\n", 389 | " elif (round(t[6],1) > tmax_history[month][day]):\n", 390 | " detail_row['Type'] = 'TMAX'\n", 391 | " detail_row['OldTemp'] = tmax_history[month][day]\n", 392 | " detail_row['NewTemp'] = round(t[6],1)\n", 393 | " detail_row['TDelta'] = round(t[6],1) - tmax_history[month][day]\n", 394 | " current_tmin_duration = int(t[2]) - tminyr_history[month][day]\n", 395 | " current_tmax_duration = int(t[2]) - tmaxyr_history[month][day]\n", 396 | " if tmin_max_life[month][day] == 0:\n", 397 | " tmin_max_life[month][day] = int(t[2]) - fyr_online_for_day[month][day]\n", 398 | " if tmax_max_life[month][day] == 0:\n", 399 | " tmax_max_life[month][day] = int(t[2]) - fyr_online_for_day[month][day]\n", 400 | " if current_tmax_duration > tmax_max_life[month][day]:\n", 401 | " tmax_max_life[month][day] = current_tmax_duration\n", 402 | " if current_tmin_duration < tmin_max_life[month][day]:\n", 403 | " tmin_max_life[month][day] = current_tmax_duration\n", 404 | " tmax_history[month][day] = round(t[6],1)\n", 405 | " tmaxyr_history[month][day] = int(t[2])\n", 406 | " tmaxrc_history[month][day] += 1\n", 407 | " # Add new daily detail row\n", 408 | " detail_list.append(detail_row)\n", 409 | " if (t[5] == 'TMIN'):\n", 410 | " # Handle TMIN for first record\n", 411 | " if (tmin_history[month][day] == -99):\n", 412 | " # Handle TMIN for first \n", 413 | " detail_row['Type'] = 'TMIN'\n", 414 | " detail_row['OldTemp'] = round(t[6],1)\n", 415 | " detail_row['NewTemp'] = round(t[6],1)\n", 416 | " detail_row['TDelta'] = 0\n", 417 | " tmin_history[month][day] = round(t[6],1)\n", 418 | " tminyr_history[month][day] = int(t[2])\n", 419 | " tminrc_history[month][day] = 1\n", 420 | " tmin_min_life[month][day] = 0\n", 421 | " tmin_max_life[month][day] = 0\n", 422 | " # Add new daily detail row\n", 423 | " detail_list.append(detail_row)\n", 424 | " # Handle TMIN for new daily record\n", 425 | " elif (round(t[6],1) < tmin_history[month][day]):\n", 426 | " detail_row['Type'] = 'TMIN'\n", 427 | " detail_row['OldTemp'] = tmin_history[month][day]\n", 428 | " detail_row['NewTemp'] = round(t[6],1)\n", 429 | " detail_row['TDelta'] = tmin_history[month][day] - round(t[6],1)\n", 430 | " current_tmin_duration = int(t[2]) - tminyr_history[month][day]\n", 431 | " current_tmax_duration = int(t[2]) - tmaxyr_history[month][day]\n", 432 | " if tmax_min_life[month][day] == 0:\n", 433 | " tmax_min_life[month][day] = int(t[2]) - fyr_online_for_day[month][day]\n", 434 | " if tmin_min_life[month][day] == 0:\n", 435 | " tmin_min_life[month][day] = int(t[2]) - fyr_online_for_day[month][day]\n", 436 | " if current_tmax_duration > tmax_min_life[month][day]:\n", 437 | " tmax_min_life[month][day] = current_tmin_duration\n", 438 | " if current_tmin_duration < tmin_min_life[month][day]:\n", 439 | " tmin_min_life[month][day] = current_tmin_duration\n", 440 | " tmin_history[month][day] = round(t[6],1)\n", 441 | " tminyr_history[month][day] = int(t[2])\n", 442 | " tminrc_history[month][day] += 1\n", 443 | " # Add new daily detail row\n", 444 | " detail_list.append(detail_row)\n", 445 | " # Create a daily summary record for each day of the year using our matrices.\n", 446 | " summary_list = []\n", 447 | " now = datetime.datetime.now()\n", 448 | " for mth in xrange(1,13):\n", 449 | " for day in xrange(1,32):\n", 450 | " m = mth-1\n", 451 | " d= day-1\n", 452 | " summary_row = dict(summary_template)\n", 453 | " summary_row['StationID'] = station_ID\n", 454 | " summary_row['Month'] = mth\n", 455 | " summary_row['Day'] = day\n", 456 | " summary_row['FirstYearOfRecord'] = fyr_online_for_day[m][d]\n", 457 | " summary_row['TMin'] = tmin_history[m][d]\n", 458 | " summary_row['TMinRecordYear'] = tminyr_history[m][d]\n", 459 | " summary_row['TMax'] = tmax_history[m][d] \n", 460 | " summary_row['TMaxRecordYear'] = tmaxyr_history[m][d]\n", 461 | " summary_row['CurTMinMaxDelta'] = summary_row['TMax'] - summary_row['TMin']\n", 462 | " summary_row['CurTMinRecordDur'] = int(now.year) - summary_row['TMinRecordYear']\n", 463 | " summary_row['CurTMaxRecordDur'] = int(now.year) - summary_row['TMaxRecordYear']\n", 464 | " summary_row['MaxDurTMinRecord'] = tmax_min_life[m][d] # Can not explain\n", 465 | " summary_row['MinDurTMinRecord'] = tmin_min_life[m][d]\n", 466 | " summary_row['MaxDurTMaxRecord'] = tmax_max_life[m][d]\n", 467 | " summary_row['MinDurTMaxRecord'] = tmin_max_life[m][d] # Can not explain \n", 468 | " summary_row['TMinRecordCount'] = tminrc_history[m][d]\n", 469 | " summary_row['TMaxRecordCount'] = tmaxrc_history[m][d]\n", 470 | " # Add new daily summary row\n", 471 | " summary_list.append(summary_row)\n", 472 | " return approach2_bag(summary_list, detail_list)\n", 473 | "\n", 474 | "def noaa_run_phase2_approach1(project_layout,create_details=False):\n", 475 | " '''Parse raw CVS files to create derived datasets.'''\n", 476 | " summary_list = []\n", 477 | " detail_list = []\n", 478 | " try: \n", 479 | " # Start Key Processing Timer\n", 480 | " start_time = time.time()\n", 481 | " raw_files = os.path.join(project_layout['Raw_Details'], '', '*_raw.csv')\n", 482 | " for index, fname in enumerate(glob.glob(raw_files)):\n", 483 | " f = get_filename(fname).split('_')[0]\n", 484 | " print(\"Processing dataset {0}: {1}\").format(index,f)\n", 485 | " summary_file = os.path.join(project_layout['Station_Summary'], '', f + '_sum.csv')\n", 486 | " detail_file = os.path.join(project_layout['Station_Details'], '', f + '_std.csv')\n", 487 | " dataset = pd.DataFrame.from_csv(fname)\n", 488 | " raw_tuples = list(dataset.itertuples())\n", 489 | " r = noaa_process_phase2_records(raw_tuples)\n", 490 | " df_summary = pd.DataFrame(r.DailySummary).sort(summary_cols).reindex(columns=summary_cols)\n", 491 | " df_cleaned_summary = cleans_invalid_days(df_summary)\n", 492 | " df_cleaned_summary.to_csv(summary_file)\n", 493 | " df_detail = pd.DataFrame(r.DailyDetail).sort(detail_cols).reindex(columns=detail_cols)\n", 494 | " df_cleaned_detail = cleans_invalid_days(df_detail)\n", 495 | " df_cleaned_detail.to_csv(detail_file)\n", 496 | " seconds = (time.time() - start_time)\n", 497 | " print(\">> Processing Complete.\")\n", 498 | " print(\">> Elapsed corpus execution time {0}\").format(elapsed_time(seconds)) \n", 499 | " except Exception as e:\n", 500 | " var = traceback.format_exc()\n", 501 | " print var\n", 502 | " print(\">> Processing Failed: Error {0}\").format(e.message)" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": {}, 508 | "source": [ 509 | "### noaa_run_phase2_approach1\n", 510 | "Takes a dictionary of project folder details to drive the processing of *Gather Phase 2 Approach 1* using **CSV files**.\n", 511 | "\n", 512 | "#### Disk Storage Requirements\n", 513 | "\n", 514 | "* This function creates a **Station Summaries** dataset that requires 25MB of free space. \n", 515 | "* This function can also create a **Station Details** dataset. If you require this dataset to be generated, modify the call to ```noaa_run_phase2_approach2()``` with ```create_details=True```. You will need additional free space to support this feature. Estimated requirement: **150MB**" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": null, 521 | "metadata": { 522 | "collapsed": false 523 | }, 524 | "outputs": [], 525 | "source": [ 526 | "# \n", 527 | "project_layout = { \n", 528 | " 'Content_Version': '',\n", 529 | " 'Daily_Input_Files': '',\n", 530 | " 'Raw_Details': '',\n", 531 | " 'Missing_Details': '',\n", 532 | " 'Station_Summary': '', \n", 533 | " 'Station_Details': '',\n", 534 | " }\n", 535 | "noaa_run_phase2_approach1(project_layout)" 536 | ] 537 | } 538 | ], 539 | "metadata": { 540 | "kernelspec": { 541 | "display_name": "Python 2", 542 | "language": "python", 543 | "name": "python2" 544 | }, 545 | "language_info": { 546 | "codemirror_mode": { 547 | "name": "ipython", 548 | "version": 2 549 | }, 550 | "file_extension": ".py", 551 | "mimetype": "text/x-python", 552 | "name": "python", 553 | "nbconvert_exporter": "python", 554 | "pygments_lexer": "ipython2", 555 | "version": "2.7.6" 556 | } 557 | }, 558 | "nbformat": 4, 559 | "nbformat_minor": 0 560 | } 561 | -------------------------------------------------------------------------------- /noaa/etl/noaa_hdta_etl_hdf_tools.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Tools for HDF FIle Processing" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Gather Phase Tools\n", 15 | "\n", 16 | "This importable notebook provides the tooling necessary to handle the processing for the **Gather Phases** in the ETL process for the NOAA HDTA project. This tooling supports Approaches 1 and 2 using **HDF files**. \n", 17 | "\n", 18 | "Each of the process phases require a dictionary to drive the workflow. \n", 19 | "\n", 20 | "```\n", 21 | "project_layout = { \n", 22 | " 'Content_Version': '',\n", 23 | " 'Daily_Input_Files': '',\n", 24 | " 'Raw_Details': '',\n", 25 | " 'Missing_Details': '',\n", 26 | " 'Station_Summary': '', \n", 27 | " 'Station_Details': '',\n", 28 | " }\n", 29 | "\n", 30 | "```\n", 31 | "Process Phase | Function to run \n", 32 | "--- | --- \n", 33 | "Phase 1 Approach 2 | noaa_run_phase1_approach2(project_layout)\n", 34 | "Phase 2 Approach 2 | noaa_run_phase2_approach2(project_layout)\n" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 8, 40 | "metadata": { 41 | "collapsed": true 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "# " 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 1, 51 | "metadata": { 52 | "collapsed": true 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "# \n", 57 | "\n", 58 | "import os\n", 59 | "import time\n", 60 | "import glob\n", 61 | "import struct\n", 62 | "import collections\n", 63 | "import pandas as pd\n", 64 | "\n", 65 | "# Create a collection for returning multiple lists of tuples\n", 66 | "approach1_bag = collections.namedtuple('GatherBag', ['raw', 'missing'])\n", 67 | "\n", 68 | "# Historical Raw Daily Detail\n", 69 | "raw_daily_detail_rec_template = {'StationID': \"\",\n", 70 | " 'Year': \"\",\n", 71 | " 'Month': \"\",\n", 72 | " 'Day': \"\",\n", 73 | " 'Type': \"\",\n", 74 | " 'FahrenheitTemp': \"\",\n", 75 | " }\n", 76 | "\n", 77 | "# Historical Daily Missing Record Detail\n", 78 | "missing_detail_rec_template = {'StationID': \"\",\n", 79 | " 'Year': \"\",\n", 80 | " 'Month': \"\",\n", 81 | " 'Day': \"\",\n", 82 | " 'Type': \"\",\n", 83 | " }\n", 84 | "\n", 85 | "def get_filename(pathname):\n", 86 | " '''Fetch filename portion of pathname.'''\n", 87 | " plist = pathname.split('/')\n", 88 | " fname, fext = os.path.splitext(plist[len(plist)-1])\n", 89 | " return fname\n", 90 | "\n", 91 | "def elapsed_time(secs):\n", 92 | " '''Compute formated time stamp given seconds elapsed. '''\n", 93 | " m, s = divmod(secs, 60)\n", 94 | " h, m = divmod(m, 60)\n", 95 | " et = \"%d:%02d:%02d\" % (h, m, s)\n", 96 | " return et\n", 97 | "\n", 98 | "def noaa_convert_c2f(noaa_temp):\n", 99 | " '''Returns Fahrenheit temperature value from a NOAA temperature (tenths of degrees C) '''\n", 100 | " celsius = int(noaa_temp)/10\n", 101 | " fahrenheit = 9.0/5.0 * celsius + 32\n", 102 | " return round(fahrenheit,1)\n", 103 | "\n", 104 | "def noaa_gather_lines(lines):\n", 105 | " '''Return dataframes for raw and missing detail from list of lines.'''\n", 106 | " # Create list of tuples \n", 107 | " raw_list = []\n", 108 | " missing_list = []\n", 109 | " for index, line in enumerate(lines):\n", 110 | " #print(\"Processing line {0}.\").format(index)\n", 111 | " r = noaa_gather_daily_detail(line)\n", 112 | " raw_list += r.raw\n", 113 | " missing_list += r.missing\n", 114 | " # Construct dataframes\n", 115 | " df_raw = pd.DataFrame(raw_list)\n", 116 | " df_missing = pd.DataFrame(missing_list)\n", 117 | " return approach1_bag(df_raw, df_missing)\n", 118 | "\n", 119 | "def noaa_gather_daily_detail(line):\n", 120 | " '''Extract content from daily record, create raw and missing tuples.'''\n", 121 | " station_time_element = struct.unpack('11s4s2s4s', line[0:21])\n", 122 | " raw_tuple_list = []\n", 123 | " missing_tuple_list = []\n", 124 | " if station_time_element[3] == 'TMIN' or station_time_element[3] == 'TMAX':\n", 125 | " values = line[21:]\n", 126 | " day_of_month = 0\n", 127 | " while(len(values) > 7):\n", 128 | " day_of_month = day_of_month + 1\n", 129 | " day_measure = struct.unpack('5ssss', values[0:8])\n", 130 | " if day_measure[0] != '-9999':\n", 131 | " raw_tuple = dict(raw_daily_detail_rec_template)\n", 132 | " # Compute degrees fahrenheit\n", 133 | " fahrenheit = noaa_convert_c2f(int(day_measure[0]))\n", 134 | " # Construct raw detail record\n", 135 | " raw_tuple['StationID'] = station_time_element[0]\n", 136 | " raw_tuple['Year'] = station_time_element[1]\n", 137 | " raw_tuple['Month']= station_time_element[2]\n", 138 | " raw_tuple['Day'] = day_of_month\n", 139 | " raw_tuple['Type'] = station_time_element[3]\n", 140 | " raw_tuple['FahrenheitTemp'] = fahrenheit\n", 141 | " raw_tuple_list.append(raw_tuple)\n", 142 | " else:\n", 143 | " # Construct missing detail record\n", 144 | " missing_tuple = dict(missing_detail_rec_template)\n", 145 | " missing_tuple['StationID'] = station_time_element[0]\n", 146 | " missing_tuple['Year'] = station_time_element[1]\n", 147 | " missing_tuple['Month']= station_time_element[2]\n", 148 | " missing_tuple['Day'] = day_of_month\n", 149 | " missing_tuple['Type'] = station_time_element[3]\n", 150 | " missing_tuple_list.append(missing_tuple)\n", 151 | " # Adjust offest for next day\n", 152 | " values = values[8:] \n", 153 | " # Return new tuples\n", 154 | " return approach1_bag(raw_tuple_list, missing_tuple_list) \n", 155 | "\n", 156 | "def noaa_process_hcn_daily_file(fname):\n", 157 | " '''Return dataframes for raw and missing detail from lines in file.'''\n", 158 | " print(\"Extracting content from file {0}.\").format(fname)\n", 159 | " x = 0\n", 160 | " raw_cols = ['StationID', 'Year', 'Month', 'Day', 'Type', 'FahrenheitTemp']\n", 161 | " missing_cols = ['StationID', 'Year', 'Month', 'Day', 'Type']\n", 162 | " # Create list of tuples \n", 163 | " raw_list = []\n", 164 | " missing_list = []\n", 165 | " # Start Timer\n", 166 | " start_time = time.time()\n", 167 | " with open(fname,'r') as f:\n", 168 | " lines = f.readlines()\n", 169 | " # Changed next 2 lines only.\n", 170 | " for line in lines:\n", 171 | " x += 1\n", 172 | " r = noaa_gather_daily_detail(line)\n", 173 | " raw_list += r.raw\n", 174 | " missing_list += r.missing\n", 175 | " f.close() \n", 176 | " seconds = (time.time() - start_time)\n", 177 | " print(\">> Processing Complete: {0} lines of file {1}.\").format(x, fname)\n", 178 | " print(\">> Elapsed file execution time {0}\").format(elapsed_time(seconds))\n", 179 | " # Capture and Sort Results in DataFrames\n", 180 | " df_raw = pd.DataFrame(raw_list)\n", 181 | " df_missing = pd.DataFrame(missing_list)\n", 182 | " r = df_raw.sort(raw_cols).reindex(columns=raw_cols)\n", 183 | " m = df_missing.sort(missing_cols).reindex(columns=missing_cols)\n", 184 | " return approach1_bag(r, m)\n", 185 | "\n", 186 | "def noaa_run_phase1_approach2(project_layout):\n", 187 | " '''Process corpus of daily files and store results in HDF file.'''\n", 188 | " raw_store = None\n", 189 | " missing_store = None\n", 190 | " try:\n", 191 | " if os.path.isfile(project_layout['Raw_Details']):\n", 192 | " raise Exception(\"Raw Details file already exists.\")\n", 193 | " if os.path.isfile(project_layout['Missing_Details']):\n", 194 | " raise Exception(\"Missing Details file already exists.\")\n", 195 | " # Start Timer\n", 196 | " start_time = time.time()\n", 197 | " raw_store = pd.HDFStore(project_layout['Raw_Details'],'w')\n", 198 | " missing_store = pd.HDFStore(project_layout['Missing_Details'],'w')\n", 199 | " for index, fname in enumerate(glob.glob(project_layout['Daily_Input_Files'])):\n", 200 | " f = get_filename(fname)\n", 201 | " print(\">> Processing file {0}: {1}\").format(index, f)\n", 202 | " r = noaa_process_hcn_daily_file(fname)\n", 203 | " raw_store.put('noaa_hdta_raw_detail/' + \n", 204 | " project_layout['Content_Version'] + \n", 205 | " '/' + f, r.raw\n", 206 | " )\n", 207 | " missing_store.put('noaa_hdta_missing_detail/' +\n", 208 | " project_layout['Content_Version'] + '/'\n", 209 | " + f, r.missing\n", 210 | " )\n", 211 | " raw_store.close()\n", 212 | " missing_store.close()\n", 213 | " seconds = (time.time() - start_time)\n", 214 | " print(\">> Processing Complete.\")\n", 215 | " print(\">> Elapsed corpus execution time {0}\").format(elapsed_time(seconds))\n", 216 | " except Exception as e:\n", 217 | " if raw_store:\n", 218 | " raw_store.close()\n", 219 | " if missing_store:\n", 220 | " missing_store.close()\n", 221 | " print(\">> Processing Failed: Error {0}\").format(e.message)\n", 222 | " " 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "### noaa_run_phase1_approach2\n", 230 | "Takes a dictionary of project folder details to drive the processing of *Gather Phase 1 Approach 2* using **HDF files**." 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": { 237 | "collapsed": false 238 | }, 239 | "outputs": [], 240 | "source": [ 241 | "# \n", 242 | "project_layout = { \n", 243 | " 'Content_Version': '',\n", 244 | " 'Daily_Input_Files': '',\n", 245 | " 'Raw_Details': '',\n", 246 | " 'Missing_Details': '',\n", 247 | " 'Station_Summary': '', \n", 248 | " 'Station_Details': '',\n", 249 | " }\n", 250 | "noaa_run_phase1_approach2(project_layout)" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 3, 256 | "metadata": { 257 | "collapsed": true 258 | }, 259 | "outputs": [], 260 | "source": [ 261 | "# \n", 262 | "\n", 263 | "import os\n", 264 | "import time\n", 265 | "import datetime\n", 266 | "import collections\n", 267 | "import pandas as pd\n", 268 | "import traceback\n", 269 | "\n", 270 | "# Create a collection for returning multiple lists of tuples\n", 271 | "approach2_bag = collections.namedtuple('GatherBag', ['DailySummary', 'DailyDetail'])\n", 272 | "\n", 273 | "# Historical Daily Summary\n", 274 | "summary_template = {'StationID': \"\", \n", 275 | " 'Month': \"\", \n", 276 | " 'Day': \"\",\n", 277 | " 'FirstYearOfRecord': \"\", \n", 278 | " 'TMin': \"\", \n", 279 | " 'TMinRecordYear': \"\", \n", 280 | " 'TMax': \"\", \n", 281 | " 'TMaxRecordYear': \"\", \n", 282 | " 'CurTMinMaxDelta': \"\", \n", 283 | " 'CurTMinRecordDur': \"\", \n", 284 | " 'CurTMaxRecordDur': \"\", \n", 285 | " 'MaxDurTMinRecord': \"\", \n", 286 | " 'MinDurTMinRecord': \"\", \n", 287 | " 'MaxDurTMaxRecord': \"\", \n", 288 | " 'MinDurTMaxRecord': \"\", \n", 289 | " 'TMinRecordCount': \"\", \n", 290 | " 'TMaxRecordCount': \"\" \n", 291 | " }\n", 292 | "\n", 293 | "summary_cols = ['StationID', 'Month', 'Day', 'FirstYearOfRecord', \n", 294 | " 'TMin', 'TMinRecordYear', 'TMax', 'TMaxRecordYear',\n", 295 | " 'CurTMinMaxDelta', 'CurTMinRecordDur','CurTMaxRecordDur',\n", 296 | " 'MaxDurTMinRecord', 'MinDurTMinRecord',\n", 297 | " 'MaxDurTMaxRecord', 'MinDurTMaxRecord',\n", 298 | " 'TMinRecordCount', 'TMaxRecordCount' \n", 299 | " ]\n", 300 | "\n", 301 | "# Historical Daily Detail\n", 302 | "detail_template = {'StationID': \"\",\n", 303 | " 'Year': \"\",\n", 304 | " 'Month': \"\",\n", 305 | " 'Day': \"\",\n", 306 | " 'Type': \"\",\n", 307 | " 'OldTemp': \"\",\n", 308 | " 'NewTemp': \"\",\n", 309 | " 'TDelta': \"\"\n", 310 | " }\n", 311 | "\n", 312 | "detail_cols = ['StationID', 'Year', 'Month', 'Day', 'Type', \n", 313 | " 'NewTemp', 'OldTemp', 'TDelta'\n", 314 | " ]\n", 315 | "\n", 316 | "def get_filename(pathname):\n", 317 | " '''Fetch filename portion of pathname.'''\n", 318 | " plist = pathname.split('/')\n", 319 | " fname, fext = os.path.splitext(plist[len(plist)-1])\n", 320 | " return fname\n", 321 | "\n", 322 | "def elapsed_time(secs):\n", 323 | " '''Compute formated time stamp given seconds elapsed. '''\n", 324 | " m, s = divmod(secs, 60)\n", 325 | " h, m = divmod(m, 60)\n", 326 | " et = \"%d:%02d:%02d\" % (h, m, s)\n", 327 | " return et\n", 328 | "\n", 329 | "def get_key_list(hdf5file, type='raw_detail'):\n", 330 | " '''Return a list of keys for requested type from specified HDF file.'''\n", 331 | " print(\"Fetching keys for type = {0}\").format(type)\n", 332 | " keylist = []\n", 333 | " store = None\n", 334 | " try:\n", 335 | " store = pd.HDFStore(hdf5file,'r')\n", 336 | " h5keys = store.keys()\n", 337 | " store.close()\n", 338 | " for k in h5keys:\n", 339 | " if k.find(type) > -1:\n", 340 | " keylist.append(k)\n", 341 | " except:\n", 342 | " if store:\n", 343 | " store.close()\n", 344 | " raise\n", 345 | " return keylist\n", 346 | "\n", 347 | "def cleans_invalid_days(df):\n", 348 | " '''Return a dataframe void of invalid days'''\n", 349 | " ShortMths = {4,6,9,11}\n", 350 | " df_clean = df.query('(((Month not in @ShortMths) & (Day != 31)) and ((Month != 2) or (Day < 30)) )')\n", 351 | " return df_clean\n", 352 | "\n", 353 | "def noaa_gather2_process_records(raw_tuples):\n", 354 | " '''Compute formated time stamp given seconds elapsed. '''\n", 355 | " # Sample Tuple:\n", 356 | " # (0, 'USC00011084', '1926', '01', 21, 'TMAX', 73.400000000000006)\n", 357 | " #\n", 358 | " # Create several 12x31 matrices to store daily detail per metric of interest.\n", 359 | " fyr_online_for_day = [[9999 for x in range(32)] for x in range(13)] \n", 360 | " tmin_history = [[-99 for x in range(32)] for x in range(13)] \n", 361 | " tmax_history = [[-99 for x in range(32)] for x in range(13)] \n", 362 | " tminyr_history = [[-99 for x in range(32)] for x in range(13)] \n", 363 | " tmaxyr_history = [[-99 for x in range(32)] for x in range(13)] \n", 364 | " tminrc_history = [[0 for x in range(32)] for x in range(13)] \n", 365 | " tmaxrc_history = [[0 for x in range(32)] for x in range(13)] \n", 366 | " tmax_max_life = [[0 for x in range(32)] for x in range(13)] \n", 367 | " tmax_min_life = [[9999 for x in range(32)] for x in range(13)] \n", 368 | " tmin_max_life = [[0 for x in range(32)] for x in range(13)] \n", 369 | " tmin_min_life = [[9999 for x in range(32)] for x in range(13)] \n", 370 | " # Capture Station ID (all raw-tuples are per station)\n", 371 | " station_ID = raw_tuples[0][1]\n", 372 | " # Process each raw daily tuple: create daily retail tuples while updating matrices.\n", 373 | " detail_list = []\n", 374 | " for t in raw_tuples:\n", 375 | " detail_row = dict(detail_template)\n", 376 | " detail_row['StationID'] = t[1]\n", 377 | " detail_row['Year'] = t[2]\n", 378 | " detail_row['Month'] = t[3]\n", 379 | " detail_row['Day'] = str(t[4])\n", 380 | " month = int(t[3])-1\n", 381 | " day = t[4]-1\n", 382 | " # For this day, what was the first year in which this station was operational?\n", 383 | " if fyr_online_for_day[month][day] > int(t[2]):\n", 384 | " fyr_online_for_day[month][day] = int(t[2])\n", 385 | " # Handle TMAX\n", 386 | " if (t[5] == 'TMAX'):\n", 387 | " # Handle TMAX for first record\n", 388 | " if (tmax_history[month][day] == -99):\n", 389 | " # Handle TMAX for first \n", 390 | " detail_row['Type'] = 'TMAX'\n", 391 | " detail_row['OldTemp'] = round(t[6],1)\n", 392 | " detail_row['NewTemp'] = round(t[6],1)\n", 393 | " detail_row['TDelta'] = 0\n", 394 | " tmax_history[month][day] = round(t[6],1)\n", 395 | " tmaxyr_history[month][day] = int(t[2])\n", 396 | " tmaxrc_history[month][day] = 1\n", 397 | " tmax_min_life[month][day] = 0\n", 398 | " tmax_max_life[month][day] = 0\n", 399 | " # Add new daily detail row\n", 400 | " detail_list.append(detail_row)\n", 401 | " # Handle TMAX for new daily record\n", 402 | " elif (round(t[6],1) > tmax_history[month][day]):\n", 403 | " detail_row['Type'] = 'TMAX'\n", 404 | " detail_row['OldTemp'] = tmax_history[month][day]\n", 405 | " detail_row['NewTemp'] = round(t[6],1)\n", 406 | " detail_row['TDelta'] = round(t[6],1) - tmax_history[month][day]\n", 407 | " current_tmin_duration = int(t[2]) - tminyr_history[month][day]\n", 408 | " current_tmax_duration = int(t[2]) - tmaxyr_history[month][day]\n", 409 | " if tmin_max_life[month][day] == 0:\n", 410 | " tmin_max_life[month][day] = int(t[2]) - fyr_online_for_day[month][day]\n", 411 | " if tmax_max_life[month][day] == 0:\n", 412 | " tmax_max_life[month][day] = int(t[2]) - fyr_online_for_day[month][day]\n", 413 | " if current_tmax_duration > tmax_max_life[month][day]:\n", 414 | " tmax_max_life[month][day] = current_tmax_duration\n", 415 | " if current_tmin_duration < tmin_max_life[month][day]:\n", 416 | " tmin_max_life[month][day] = current_tmax_duration\n", 417 | " tmax_history[month][day] = round(t[6],1)\n", 418 | " tmaxyr_history[month][day] = int(t[2])\n", 419 | " tmaxrc_history[month][day] += 1\n", 420 | " # Add new daily detail row\n", 421 | " detail_list.append(detail_row)\n", 422 | " if (t[5] == 'TMIN'):\n", 423 | " # Handle TMIN for first record\n", 424 | " if (tmin_history[month][day] == -99):\n", 425 | " # Handle TMIN for first \n", 426 | " detail_row['Type'] = 'TMIN'\n", 427 | " detail_row['OldTemp'] = round(t[6],1)\n", 428 | " detail_row['NewTemp'] = round(t[6],1)\n", 429 | " detail_row['TDelta'] = 0\n", 430 | " tmin_history[month][day] = round(t[6],1)\n", 431 | " tminyr_history[month][day] = int(t[2])\n", 432 | " tminrc_history[month][day] = 1\n", 433 | " tmin_min_life[month][day] = 0\n", 434 | " tmin_max_life[month][day] = 0\n", 435 | " # Add new daily detail row\n", 436 | " detail_list.append(detail_row)\n", 437 | " # Handle TMIN for new daily record\n", 438 | " elif (round(t[6],1) < tmin_history[month][day]):\n", 439 | " detail_row['Type'] = 'TMIN'\n", 440 | " detail_row['OldTemp'] = tmin_history[month][day]\n", 441 | " detail_row['NewTemp'] = round(t[6],1)\n", 442 | " detail_row['TDelta'] = tmin_history[month][day] - round(t[6],1)\n", 443 | " current_tmin_duration = int(t[2]) - tminyr_history[month][day]\n", 444 | " current_tmax_duration = int(t[2]) - tmaxyr_history[month][day]\n", 445 | " if tmax_min_life[month][day] == 0:\n", 446 | " tmax_min_life[month][day] = int(t[2]) - fyr_online_for_day[month][day]\n", 447 | " if tmin_min_life[month][day] == 0:\n", 448 | " tmin_min_life[month][day] = int(t[2]) - fyr_online_for_day[month][day]\n", 449 | " if current_tmax_duration > tmax_min_life[month][day]:\n", 450 | " tmax_min_life[month][day] = current_tmin_duration\n", 451 | " if current_tmin_duration < tmin_min_life[month][day]:\n", 452 | " tmin_min_life[month][day] = current_tmin_duration\n", 453 | " tmin_history[month][day] = round(t[6],1)\n", 454 | " tminyr_history[month][day] = int(t[2])\n", 455 | " tminrc_history[month][day] += 1\n", 456 | " # Add new daily detail row\n", 457 | " detail_list.append(detail_row)\n", 458 | " # Create a daily summary record for each day of the year using our matrices.\n", 459 | " summary_list = []\n", 460 | " now = datetime.datetime.now()\n", 461 | " for mth in xrange(1,13):\n", 462 | " for day in xrange(1,32):\n", 463 | " m = mth-1\n", 464 | " d= day-1\n", 465 | " summary_row = dict(summary_template)\n", 466 | " summary_row['StationID'] = station_ID\n", 467 | " summary_row['Month'] = mth\n", 468 | " summary_row['Day'] = day\n", 469 | " summary_row['FirstYearOfRecord'] = fyr_online_for_day[m][d]\n", 470 | " summary_row['TMin'] = tmin_history[m][d]\n", 471 | " summary_row['TMinRecordYear'] = tminyr_history[m][d]\n", 472 | " summary_row['TMax'] = tmax_history[m][d] \n", 473 | " summary_row['TMaxRecordYear'] = tmaxyr_history[m][d]\n", 474 | " summary_row['CurTMinMaxDelta'] = summary_row['TMax'] - summary_row['TMin']\n", 475 | " summary_row['CurTMinRecordDur'] = int(now.year) - summary_row['TMinRecordYear']\n", 476 | " summary_row['CurTMaxRecordDur'] = int(now.year) - summary_row['TMaxRecordYear']\n", 477 | " summary_row['MaxDurTMinRecord'] = tmax_min_life[m][d] # Can not explain\n", 478 | " summary_row['MinDurTMinRecord'] = tmin_min_life[m][d]\n", 479 | " summary_row['MaxDurTMaxRecord'] = tmax_max_life[m][d]\n", 480 | " summary_row['MinDurTMaxRecord'] = tmin_max_life[m][d] # Can not explain \n", 481 | " summary_row['TMinRecordCount'] = tminrc_history[m][d]\n", 482 | " summary_row['TMaxRecordCount'] = tmaxrc_history[m][d]\n", 483 | " # Add new daily summary row\n", 484 | " summary_list.append(summary_row)\n", 485 | " return approach2_bag(summary_list, detail_list)\n", 486 | "\n", 487 | "def noaa_run_phase2_approach2(project_layout,create_details=False):\n", 488 | " '''Parse H5 dataset to create derived datasets.'''\n", 489 | " raw_store = None\n", 490 | " summary_store = None\n", 491 | " detail_store = None\n", 492 | " try:\n", 493 | " if not os.path.isfile(project_layout['Raw_Details']):\n", 494 | " raise Exception(\"Raw Details file does not exist.\")\n", 495 | " if os.path.isfile(project_layout['Station_Summary']):\n", 496 | " raise Exception(\"Station Summary file already exists.\")\n", 497 | " if create_details and os.path.isfile(project_layout['Station_Details']):\n", 498 | " raise Exception(\"Station Details file already exists.\")\n", 499 | " # Start Key Fetch Timer\n", 500 | " start_time = time.time()\n", 501 | " keys = get_key_list(project_layout['Raw_Details'])\n", 502 | " seconds = (time.time() - start_time)\n", 503 | " print(\">> Fetch Complete.\")\n", 504 | " print(\">> Elapsed key-fetch execution time {0}\").format(elapsed_time(seconds)) \n", 505 | " # Start Key Processing Timer\n", 506 | " start_time = time.time()\n", 507 | " summary_store = pd.HDFStore(project_layout['Station_Summary'],'w')\n", 508 | " if create_details:\n", 509 | " detail_store = pd.HDFStore(project_layout['Station_Details'],'w')\n", 510 | " raw_store = pd.HDFStore(project_layout['Raw_Details'],'r')\n", 511 | " for index, k in enumerate(keys):\n", 512 | " f = get_filename(k)\n", 513 | " print(\"Processing dataset {0} - {1}: {2}\").format(index,len(keys),f)\n", 514 | " ds = raw_store.get(k)\n", 515 | " raw_tuples = list(ds.itertuples())\n", 516 | " r = noaa_gather2_process_records(raw_tuples)\n", 517 | " # Capture results and store dataframes in H5 files\n", 518 | " df_summary = pd.DataFrame(r.DailySummary).sort(summary_cols).reindex(columns=summary_cols)\n", 519 | " df_cleaned_summary = cleans_invalid_days(df_summary)\n", 520 | " summary_store.put('noaa_hdta_station_summary/' + \n", 521 | " project_layout['Content_Version'] + \n", 522 | " '/' + f, df_cleaned_summary\n", 523 | " )\n", 524 | " if create_details:\n", 525 | " df_detail = pd.DataFrame(r.DailyDetail).sort(detail_cols).reindex(columns=detail_cols)\n", 526 | " df_cleaned_detail = cleans_invalid_days(df_detail)\n", 527 | " detail_store.put('noaa_hdta_station_daily_detail/' +\n", 528 | " project_layout['Content_Version'] + '/'\n", 529 | " + f, df_cleaned_detail\n", 530 | " )\n", 531 | " raw_store.close()\n", 532 | " summary_store.close()\n", 533 | " if create_details:\n", 534 | " detail_store.close()\n", 535 | " seconds = (time.time() - start_time)\n", 536 | " print(\">> Processing Complete.\")\n", 537 | " print(\">> Elapsed corpus execution time {0}\").format(elapsed_time(seconds)) \n", 538 | " except Exception as e:\n", 539 | " if raw_store:\n", 540 | " raw_store.close()\n", 541 | " if summary_store:\n", 542 | " summary_store.close()\n", 543 | " if detail_store:\n", 544 | " detail_store.close()\n", 545 | " var = traceback.format_exc()\n", 546 | " print var\n", 547 | " print(\">> Processing Failed: Error {0}\").format(e.message)" 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "### noaa_run_phase2_approach2\n", 555 | "Takes a dictionary of project folder details to drive the processing of *Gather Phase 2 Approach 2* using **HDF files**.\n", 556 | "\n", 557 | "#### Disk Storage Requirements\n", 558 | "\n", 559 | "* This function creates a **Station Summaries** dataset that requires ~2GB of free space. \n", 560 | "* This function can also create a **Station Details** dataset. If you require this dataset to be generated, modify the call to ```noaa_run_phase2_approach2()``` with ```create_details=True```. You will need additional free space to support this feature. Estimated requirement: **5GB**\n" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": null, 566 | "metadata": { 567 | "collapsed": false 568 | }, 569 | "outputs": [], 570 | "source": [ 571 | "# \n", 572 | "project_layout = { \n", 573 | " 'Content_Version': '',\n", 574 | " 'Daily_Input_Files': '',\n", 575 | " 'Raw_Details': '',\n", 576 | " 'Missing_Details': '',\n", 577 | " 'Station_Summary': '', \n", 578 | " 'Station_Details': '',\n", 579 | " }\n", 580 | "noaa_run_phase2_approach2(project_layout)" 581 | ] 582 | } 583 | ], 584 | "metadata": { 585 | "kernelspec": { 586 | "display_name": "Python 2", 587 | "language": "python", 588 | "name": "python2" 589 | }, 590 | "language_info": { 591 | "codemirror_mode": { 592 | "name": "ipython", 593 | "version": 2 594 | }, 595 | "file_extension": ".py", 596 | "mimetype": "text/x-python", 597 | "name": "python", 598 | "nbconvert_exporter": "python", 599 | "pygments_lexer": "ipython2", 600 | "version": "2.7.6" 601 | } 602 | }, 603 | "nbformat": 4, 604 | "nbformat_minor": 0 605 | } 606 | -------------------------------------------------------------------------------- /noaa/hdtadash/README.md: -------------------------------------------------------------------------------- 1 | # Temperature Record Frequency Dashboard 2 | 3 | >STATUS NOTES: 4 | > 5 | 1. Used ```urth-core-watch patch``` and a few hacks. 6 | 2. ToDo: revisit code after v0.1.1 is release of ```urth`` components. 7 | 8 | This analytical notebook is a component of a [package of notebooks](https://github.com/ibm-et/jupyter-samples/tree/master/noaa). The package is intended to serve as an exercise in the applicability of Juypter Notebooks to public weather data for DIY Analytics. 9 | 10 | ## Demo Concepts 11 | This notebook makes use of the following Project Jupyter features: 12 | 13 | * [jupyter-incubator/declarativewidgets](https://github.com/jupyter-incubator/declarativewidgets) 14 | * [jupyter-incubator/dashboards](https://github.com/jupyter-incubator/dashboards) 15 | * [Polymer Widgets](https://www.polymer-project.org/1.0/) 16 | 17 | ## Objective 18 | 19 | There has been a great deal of discussion around climate change and global warming. Since NOAA has made their data public, let us explore the data ourselves and see what insights we can discover. 20 | 21 | 1. How many weather stations in US? 22 | 2. For US weather stations, what is the average years of record keeping? 23 | 3. For each US weather station, on each day of the year, identify the frequency at which daily High and Low temperature records are broken. 24 | 4. Does the historical frequency of daily temperature records (High or Low) in the US provide statistical evidence of dramatic climate change? 25 | 5. What is the average life-span of a daily temperature record (High or Low) in the US? 26 | 27 | >If there is scientific evidence of extreme fluctuations in our weather patterns due to human impact to the environment then we should be able to identify factual examples of increase in the frequency in extreme temperatures. 28 | 29 | ## Data 30 | This notebook was developed using a **March 16, 2015** snapshot of USA-Only daily temperature readings from the Global Historical Climatology Network. The [Data Munging](https://github.com/ibm-et/jupyter-samples/tree/master/noaa/etl) project was used to generate datasets in CSV format. 31 | 32 | ## Usage 33 | 34 | ### Data Preparation Options 35 | 36 | 1. Use the [NOAA data Munging](https://github.com/ibm-et/jupyter-samples/tree/master/noaa/etl) project to generate CSV files for the latest NOAA data. 37 | 2. Use the sample **March 16, 2015** snapshot provided in this repo. Open a terminal session and run these commands: 38 | ``` 39 | $ cd /home/main/notebooks/noaa/hdtadash/data/ 40 | $ tar -xvf station_summaries.tar 41 | ``` 42 | 43 | ### Run Dashboard 44 | 45 | 1. Open the ```weather_dashboard.ipynb``` notebook 46 | 2. Run all cells 47 | 3. Change view to dashboard mode 48 | 49 | ## Citation Information 50 | 51 | * [GHCN-Daily journal article](http://doi:10.1175/JTECH-D-11-00103.1): Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910. 52 | * Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: [Global Historical Climatology Network - Daily (GHCN-Daily)](http://doi.org/10.7289/V5D21VHZ), [Version 3.20-upd-2015031605], NOAA National Climatic Data Center [March 16, 2015]. 53 | 54 | -------------------------------------------------------------------------------- /noaa/hdtadash/folium_map.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "!pip install folium" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": { 18 | "collapsed": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "%matplotlib inline" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": { 29 | "collapsed": false 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "%%html\n", 34 | "\n", 35 | "\n", 36 | "\n", 37 | "\n", 38 | "\n", 39 | "\n", 40 | "\n", 41 | "\n", 42 | "\n", 43 | "\n", 44 | "\n", 45 | "" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": { 52 | "collapsed": false 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "import os\n", 57 | "import struct\n", 58 | "import glob\n", 59 | "import pandas as pd\n", 60 | "import datetime as dt\n", 61 | "\n", 62 | "# Use this global variable to specify the path for station summary files.\n", 63 | "NOAA_STATION_SUMMARY_PATH = \"/home/jovyan/work/noaa/data/\"\n", 64 | "\n", 65 | "# Use this global variable to specify the path for the GHCND Station Directory\n", 66 | "STATION_DETAIL_FILE = '/home/jovyan/work/noaa/data/ghcnd-stations.txt'\n", 67 | "\n", 68 | "# Station detail structures for building station lists\n", 69 | "station_detail_colnames = ['StationID','State','Name',\n", 70 | " 'Latitude','Longitude','QueryTag']\n", 71 | "\n", 72 | "station_detail_rec_template = {'StationID': \"\",\n", 73 | " 'State': \"\",\n", 74 | " 'Name': \"\",\n", 75 | " 'Latitude': \"\",\n", 76 | " 'Longitude': \"\",\n", 77 | " 'QueryTag': \"\"\n", 78 | " }\n", 79 | "\n", 80 | "# -----------------------------------\n", 81 | "# Station Detail Processing\n", 82 | "# -----------------------------------\n", 83 | "def get_filename(pathname):\n", 84 | " '''Fetch filename portion of pathname.'''\n", 85 | " plist = pathname.split('/')\n", 86 | " fname, fext = os.path.splitext(plist[len(plist)-1])\n", 87 | " return fname\n", 88 | "\n", 89 | "def fetch_station_list():\n", 90 | " '''Return list of available stations given collection of summary files on disk.'''\n", 91 | " station_list = []\n", 92 | " raw_files = os.path.join(NOAA_STATION_SUMMARY_PATH,'','*_sum.csv')\n", 93 | " for index, fname in enumerate(glob.glob(raw_files)):\n", 94 | " f = get_filename(fname).split('_')[0]\n", 95 | " station_list.append(str(f))\n", 96 | " return station_list\n", 97 | "\n", 98 | "USA_STATION_LIST = fetch_station_list()\n", 99 | "\n", 100 | "def gather_states(fname,stations): \n", 101 | " '''Return a list of unique State abbreviations. Weather station data exists for these states.'''\n", 102 | " state_list = []\n", 103 | " with open(fname, 'r', encoding='utf-8') as f:\n", 104 | " lines = f.readlines()\n", 105 | " f.close()\n", 106 | " for line in lines:\n", 107 | " r = noaa_gather_station_detail(line,stations)\n", 108 | " state_list += r\n", 109 | " df_unique_states = pd.DataFrame(state_list,columns=station_detail_colnames).sort('State').State.unique()\n", 110 | " return df_unique_states.tolist()\n", 111 | "\n", 112 | "def noaa_gather_station_detail(line,slist):\n", 113 | " '''Build a list of station tuples for stations in the USA.'''\n", 114 | " station_tuple_list = []\n", 115 | " station_id_key = line[0:3]\n", 116 | " if station_id_key == 'USC' or station_id_key == 'USW': \n", 117 | " fields = struct.unpack('12s9s10s7s2s30s', line[0:70].encode())\n", 118 | " if fields[0].decode().strip() in slist:\n", 119 | " station_tuple = dict(station_detail_rec_template)\n", 120 | " station_tuple['StationID'] = fields[0].decode().strip()\n", 121 | " station_tuple['State'] = fields[4].decode().strip()\n", 122 | " station_tuple['Name'] = fields[5].decode().strip()\n", 123 | " station_tuple['Latitude'] = fields[1].decode().strip()\n", 124 | " station_tuple['Longitude'] = fields[2].decode().strip()\n", 125 | " qt = \"{0} at {1} in {2}\".format(fields[0].decode().strip(),fields[5].decode().strip(),fields[4].decode().strip())\n", 126 | " station_tuple['QueryTag'] = qt\n", 127 | " station_tuple_list.append(station_tuple)\n", 128 | " return station_tuple_list\n", 129 | "\n", 130 | "USA_STATES_WITH_STATIONS = gather_states(STATION_DETAIL_FILE,USA_STATION_LIST)\n", 131 | "\n", 132 | "def process_station_detail_for_state(fname,stations,statecode): \n", 133 | " '''Return dataframe of station detail for specified state.'''\n", 134 | " station_list = []\n", 135 | " with open(fname, 'r', encoding='utf-8') as f:\n", 136 | " lines = f.readlines()\n", 137 | " f.close()\n", 138 | " for line in lines:\n", 139 | " r = noaa_build_station_detail_for_state(line,stations,statecode)\n", 140 | " station_list += r\n", 141 | " return pd.DataFrame(station_list,columns=station_detail_colnames)\n", 142 | "\n", 143 | "def noaa_build_station_detail_for_state(line,slist,statecode):\n", 144 | " '''Build a list of station tuples for the specified state in the USA.'''\n", 145 | " station_tuple_list = []\n", 146 | " station_id_key = line[0:3]\n", 147 | " if station_id_key == 'USC' or station_id_key == 'USW':\n", 148 | " fields = struct.unpack('12s9s10s7s2s30s', line[0:70].encode())\n", 149 | " if ((fields[0].decode().strip() in slist) and (fields[4].decode().strip() == statecode)): \n", 150 | " station_tuple = dict(station_detail_rec_template)\n", 151 | " station_tuple['StationID'] = fields[0].decode().strip()\n", 152 | " station_tuple['State'] = fields[4].decode().strip()\n", 153 | " station_tuple['Name'] = fields[5].decode().strip()\n", 154 | " station_tuple['Latitude'] = fields[1].decode().strip()\n", 155 | " station_tuple['Longitude'] = fields[2].decode().strip()\n", 156 | " qt = \"Station {0} in {1} at {2}\".format(fields[0].decode().strip(),fields[4].decode().strip(),fields[5].decode().strip())\n", 157 | " station_tuple['QueryTag'] = qt\n", 158 | " station_tuple_list.append(station_tuple)\n", 159 | " return station_tuple_list\n", 160 | "\n", 161 | "df = process_station_detail_for_state(STATION_DETAIL_FILE,USA_STATION_LIST,\"NY\")\n", 162 | "df.tail()" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": { 169 | "collapsed": false 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "import numpy as np\n", 174 | "import folium\n", 175 | "from IPython.display import HTML\n", 176 | "\n", 177 | "def display_map(m, height=500):\n", 178 | " \"\"\"Takes a folium instance and embed HTML.\"\"\"\n", 179 | " m._build_map()\n", 180 | " srcdoc = m.HTML.replace('\"', '"')\n", 181 | " embed = ''.format(srcdoc, height)\n", 182 | " return embed\n", 183 | "\n", 184 | "def render_map(df,height=500):\n", 185 | " centerpoint_latitude = np.mean(df.Latitude.astype(float))\n", 186 | " centerpoint_longitude = np.mean(df.Longitude.astype(float))\n", 187 | " map_obj = folium.Map(location=[centerpoint_latitude, centerpoint_longitude],zoom_start=6)\n", 188 | " for index, row in df.iterrows():\n", 189 | " map_obj.simple_marker([row.Latitude, row.Longitude], popup=row.QueryTag)\n", 190 | " return display_map(map_obj)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": { 197 | "collapsed": false 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "map_doc = render_map(df)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": { 208 | "collapsed": true 209 | }, 210 | "outputs": [], 211 | "source": [ 212 | "from urth.widgets.widget_channels import channel\n", 213 | "channel(\"noaaquery\").set(\"theMap\", map_doc)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": { 220 | "collapsed": false 221 | }, 222 | "outputs": [], 223 | "source": [ 224 | "%%html\n", 225 | "" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": { 236 | "collapsed": false 237 | }, 238 | "outputs": [], 239 | "source": [ 240 | "map_doc" 241 | ] 242 | } 243 | ], 244 | "metadata": { 245 | "kernelspec": { 246 | "display_name": "Python 3", 247 | "language": "python", 248 | "name": "python3" 249 | }, 250 | "language_info": { 251 | "codemirror_mode": { 252 | "name": "ipython", 253 | "version": 3 254 | }, 255 | "file_extension": ".py", 256 | "mimetype": "text/x-python", 257 | "name": "python", 258 | "nbconvert_exporter": "python", 259 | "pygments_lexer": "ipython3", 260 | "version": "3.4.3" 261 | } 262 | }, 263 | "nbformat": 4, 264 | "nbformat_minor": 0 265 | } 266 | -------------------------------------------------------------------------------- /noaa/hdtadash/urth-core-watch.html: -------------------------------------------------------------------------------- 1 | 2 | 5 | 29 | -------------------------------------------------------------------------------- /noaa/hdtadash/urth-raw-html.html: -------------------------------------------------------------------------------- 1 | 2 | 5 | 20 | -------------------------------------------------------------------------------- /noaa/hdtadash/urth_env_test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [ 10 | { 11 | "data": { 12 | "text/html": [ 13 | "\n", 14 | "\n", 15 | "\n", 16 | "\n", 17 | "\n", 18 | "\n", 19 | "\n", 20 | "\n", 21 | "\n", 22 | "\n", 23 | "" 24 | ], 25 | "text/plain": [ 26 | "" 27 | ] 28 | }, 29 | "metadata": {}, 30 | "output_type": "display_data" 31 | } 32 | ], 33 | "source": [ 34 | "%%html\n", 35 | "\n", 36 | "\n", 37 | "\n", 38 | "\n", 39 | "\n", 40 | "\n", 41 | "\n", 42 | "\n", 43 | "\n", 44 | "\n", 45 | "" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 4, 51 | "metadata": { 52 | "collapsed": false 53 | }, 54 | "outputs": [ 55 | { 56 | "data": { 57 | "text/html": [ 58 | "
\n", 59 | "\n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | "
FirstLast NameRoleAmountBigger NumberWebsite
0JohnJohnsonWeb Developer131234325431http://javi.er
1JaneDoeSoftware Engineer4561434215411http://www.ibm.us
2JoeSmithRockstar Dev45261237328421http://cooldevs.org/xavier
\n", 101 | "
" 102 | ], 103 | "text/plain": [ 104 | " First Last Name Role Amount Bigger Number \\\n", 105 | "0 John Johnson Web Developer 13 1234325431 \n", 106 | "1 Jane Doe Software Engineer 456 1434215411 \n", 107 | "2 Joe Smith Rockstar Dev 4526 1237328421 \n", 108 | "\n", 109 | " Website \n", 110 | "0 http://javi.er \n", 111 | "1 http://www.ibm.us \n", 112 | "2 http://cooldevs.org/xavier " 113 | ] 114 | }, 115 | "execution_count": 4, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "import pandas as pd\n", 122 | "aDataFrame = pd.DataFrame([\n", 123 | " [\"John\", \"Johnson\",\"Web Developer\", \"13\", \"1234325431\", \"http://javi.er\"], \n", 124 | " [\"Jane\", \"Doe\",\"Software Engineer\", \"456\", \"1434215411\", \"http://www.ibm.us\"],\n", 125 | " [\"Joe\", \"Smith\",\"Rockstar Dev\", \"4526\", \"1237328421\", \"http://cooldevs.org/xavier\"]\n", 126 | " ], columns=[\"First \", \"Last Name\", \"Role\", \"Amount\", \"Bigger Number\", \"Website\"]\n", 127 | ")\n", 128 | "aDataFrame" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 14, 134 | "metadata": { 135 | "collapsed": true 136 | }, 137 | "outputs": [], 138 | "source": [ 139 | "from urth.widgets.widget_channels import channel\n", 140 | "channel(\"urthenv\").set(\"showMoreInfo\", \"\")" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "#### Run code below and then select a row." 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 15, 153 | "metadata": { 154 | "collapsed": false 155 | }, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/html": [ 160 | "" 183 | ], 184 | "text/plain": [ 185 | "" 186 | ] 187 | }, 188 | "metadata": {}, 189 | "output_type": "display_data" 190 | } 191 | ], 192 | "source": [ 193 | "%%html\n", 194 | "" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "#### Now run the code below to see how it effects the table widget." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 17, 229 | "metadata": { 230 | "collapsed": true 231 | }, 232 | "outputs": [], 233 | "source": [ 234 | "channel(\"urthenv\").set(\"showMoreInfo\", True)" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 18, 240 | "metadata": { 241 | "collapsed": true 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "channel(\"urthenv\").set(\"showMoreInfo\", \"\")" 246 | ] 247 | } 248 | ], 249 | "metadata": { 250 | "kernelspec": { 251 | "display_name": "Python 3", 252 | "language": "python", 253 | "name": "python3" 254 | }, 255 | "language_info": { 256 | "codemirror_mode": { 257 | "name": "ipython", 258 | "version": 3 259 | }, 260 | "file_extension": ".py", 261 | "mimetype": "text/x-python", 262 | "name": "python", 263 | "nbconvert_exporter": "python", 264 | "pygments_lexer": "ipython3", 265 | "version": "3.4.3" 266 | } 267 | }, 268 | "nbformat": 4, 269 | "nbformat_minor": 0 270 | } 271 | -------------------------------------------------------------------------------- /noaa/tmaxfreq/README.md: -------------------------------------------------------------------------------- 1 | # High Temperature Record Frequency 2 | 3 | This analytical notebook is a component of a [package of notebooks](https://github.com/ibm-et/jupyter-samples/tree/master/noaa). The package is intended to serve as an exercise in the applicability of Juypter Notebooks to public weather data for DIY Analytics. 4 | 5 | 6 | >This notebook was created using a technology preview based on Project Jupyter called [IBM Knowledge Anyhow Workbench](https://knowledgeanyhow.org). 7 | > While the technology preview pilot has completed, several special helper functions used by this notebook (e.g., reusable notebooks) can now be found in [jupyter-incubator/contentmanagement](https://github.com/jupyter-incubator/contentmanagement) 8 | 9 | ## Demo Concepts 10 | This notebook makes use of the following Project Jupyter features: 11 | 12 | * [Python Widgets](https://github.com/ipython/ipywidgets) 13 | 14 | ## Objective 15 | 16 | There has been a great deal of discussion around climate change and global warming. Since NOAA has made their data public, let us explore the data ourselves and see what insights we can discover. 17 | 18 | 1. How many weather stations in US? 19 | 2. For US weather stations, what is the average years of record keeping? 20 | 3. For each US weather station, on each day of the year, identify the frequency at which daily High and Low temperature records are broken. 21 | 4. Does the historical frequency of daily temperature records (High or Low) in the US provide statistical evidence of dramatic climate change? 22 | 5. What is the average life-span of a daily temperature record (High or Low) in the US? 23 | 24 | >If there is scientific evidence of extreme fluctuations in our weather patterns due to human impact to the environment then we should be able to identify factual examples of increase in the frequency in extreme temperatures. 25 | 26 | ## Data 27 | This notebook was developed using a **March 16, 2015** snapshot of USA-Only daily temperature readings from the Global Historical Climatology Network. The [Data Munging](https://github.com/ibm-et/jupyter-samples/tree/master/noaa/etl) project was used to generate datasets in CSV format. 28 | 29 | ## Usage 30 | 31 | >WARNING: This notebook requires modifications if it is to work outside of [IBM Knowledge Anyhow Workbench](https://knowledgeanyhow.org). 32 | 33 | ### Prepare Data 34 | 35 | 1. Use the [data Munging](https://github.com/ibm-et/jupyter-samples/tree/master/noaa/etl) project to generate CSV files. 36 | 2. Modify the ```NOAA_STATION_SUMMARY_PATH``` in ```noaaquery_tmaxfreq_tools.ipynb```. 37 | 38 | 39 | ### Run Analysis 40 | Open the ```noaaquery_tmaxfreq.ipynb``` notebook and follow the instructions. 41 | 42 | ## Citation Information 43 | 44 | * [GHCN-Daily journal article](doi:10.1175/JTECH-D-11-00103.1): Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910. 45 | * Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: [Global Historical Climatology Network - Daily (GHCN-Daily)](http://doi.org/10.7289/V5D21VHZ), [Version 3.20-upd-2015031605], NOAA National Climatic Data Center [March 16, 2015]. 46 | 47 | 48 | 49 | -------------------------------------------------------------------------------- /noaa/tmaxfreq/noaaquery_tmaxfreq_tools.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Tools for NOAA TMAX Record Frequency Analysis" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 3, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "# " 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [ 28 | { 29 | "name": "stderr", 30 | "output_type": "stream", 31 | "text": [ 32 | ":0: FutureWarning: IPython widgets are experimental and may change in the future.\n" 33 | ] 34 | } 35 | ], 36 | "source": [ 37 | "# \n", 38 | "from IPython.display import display, Image\n", 39 | "from IPython.html.widgets import interact_manual\n", 40 | "import matplotlib.pyplot as plt\n", 41 | "import os\n", 42 | "import struct\n", 43 | "import glob\n", 44 | "import seaborn as sns\n", 45 | "import pandas as pd\n", 46 | "import datetime as dt\n", 47 | "\n", 48 | "## Use this global variable to specify the path for station summary files.\n", 49 | "NOAA_STATION_SUMMARY_PATH = \"/resources/noaa-hdta/data/derived/15mar2015/summaries/\"\n", 50 | "\n", 51 | "## Build Station List\n", 52 | "station_detail_colnames = ['StationID','State','Name',\n", 53 | " 'Latitude','Longitude','QueryTag']\n", 54 | "\n", 55 | "station_detail_rec_template = {'StationID': \"\",\n", 56 | " 'State': \"\",\n", 57 | " 'Name': \"\",\n", 58 | " 'Latitude': \"\",\n", 59 | " 'Longitude': \"\",\n", 60 | " 'QueryTag': \"\"\n", 61 | " }\n", 62 | "\n", 63 | "STATION_DETAIL = '/resources/ghcnd-stations.txt'\n", 64 | "\n", 65 | "def get_filename(pathname):\n", 66 | " '''Fetch filename portion of pathname.'''\n", 67 | " plist = pathname.split('/')\n", 68 | " fname, fext = os.path.splitext(plist[len(plist)-1])\n", 69 | " return fname\n", 70 | "\n", 71 | "def fetch_station_list():\n", 72 | " station_list = []\n", 73 | " raw_files = os.path.join(NOAA_STATION_SUMMARY_PATH,'','*_sum.csv')\n", 74 | " for index, fname in enumerate(glob.glob(raw_files)):\n", 75 | " f = get_filename(fname).split('_')[0]\n", 76 | " station_list.append(str(f))\n", 77 | " return station_list\n", 78 | "\n", 79 | "def process_station_detail(fname,stations): \n", 80 | " '''Return dataframe of station detail.'''\n", 81 | " station_list = []\n", 82 | " with open(fname,'r') as f:\n", 83 | " lines = f.readlines()\n", 84 | " f.close()\n", 85 | " for line in lines:\n", 86 | " r = noaa_gather_station_detail(line,stations)\n", 87 | " station_list += r\n", 88 | " return pd.DataFrame(station_list,columns=station_detail_colnames)\n", 89 | " \n", 90 | "def noaa_gather_station_detail(line,slist):\n", 91 | " '''Build a list of stattion tuples.'''\n", 92 | " station_tuple_list = []\n", 93 | " station_id_key = line[0:3]\n", 94 | " if station_id_key == 'USC' or station_id_key == 'USW': \n", 95 | " fields = struct.unpack('12s9s10s7s2s30s', line[0:70])\n", 96 | " if fields[0].strip() in slist:\n", 97 | " station_tuple = dict(station_detail_rec_template)\n", 98 | " station_tuple['StationID'] = fields[0].strip()\n", 99 | " station_tuple['State'] = fields[4].strip()\n", 100 | " station_tuple['Name'] = fields[5].strip()\n", 101 | " station_tuple['Latitude'] = fields[1].strip()\n", 102 | " station_tuple['Longitude'] = fields[2].strip()\n", 103 | " qt = \"{0} at {1} in {2}\".format(fields[0].strip(),fields[5].strip(),fields[4].strip())\n", 104 | " station_tuple['QueryTag'] = qt\n", 105 | " station_tuple_list.append(station_tuple)\n", 106 | " return station_tuple_list\n", 107 | "\n", 108 | "# Exploration Widget\n", 109 | "month_abbrev = { 1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr',\n", 110 | " 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug',\n", 111 | " 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'\n", 112 | " }\n", 113 | "\n", 114 | "def compute_years_of_station_data(df):\n", 115 | " yrs = dt.date.today().year-min(df['FirstYearOfRecord'])\n", 116 | " print(\"This weather station has been in service and collecting data for {0} years.\").format(yrs)\n", 117 | " return\n", 118 | " \n", 119 | "def compute_tmax_record_quantity(df,freq):\n", 120 | " threshold = int(freq)\n", 121 | " df_result = df.query('(TMaxRecordCount > @threshold)')\n", 122 | " return df_result\n", 123 | " \n", 124 | "def fetch_station_data(stationid):\n", 125 | " fname = os.path.join(NOAA_STATION_SUMMARY_PATH,'',stationid+'_sum.csv')\n", 126 | " return pd.DataFrame.from_csv(fname)\n", 127 | "\n", 128 | "def create_day_identifier(month,day):\n", 129 | " return str(day)+'-'+month_abbrev[int(month)]\n", 130 | " \n", 131 | "def create_date_list(mlist,dlist):\n", 132 | " mv = mlist.values()\n", 133 | " dv = dlist.values()\n", 134 | " new_list = []\n", 135 | " for index, value in enumerate(mv):\n", 136 | " new_list.append(create_day_identifier(value,dv[index]))\n", 137 | " return new_list\n", 138 | "\n", 139 | "def create_record_date_list(mlist,dlist,ylist):\n", 140 | " mv = mlist.values()\n", 141 | " dv = dlist.values()\n", 142 | " yv = ylist.values()\n", 143 | " new_list = []\n", 144 | " for index, value in enumerate(mv): \n", 145 | " new_list.append(dt.date(yv[index],value,dv[index]))\n", 146 | " return new_list\n", 147 | "\n", 148 | "def compute_max_record_durations(df):\n", 149 | " dates = create_date_list(df['Month'].to_dict(),df['Day'].to_dict())\n", 150 | " s_dates = pd.Series(dates)\n", 151 | " s_values = pd.Series(df['MaxDurTMaxRecord'].to_dict().values())\n", 152 | " df_new = pd.concat([s_dates, s_values], axis=1)\n", 153 | " df_new.columns = {\"Duration\",\"Date\"}\n", 154 | " return df_new\n", 155 | "\n", 156 | "def plot_tmax_record_results(df):\n", 157 | " dates = create_record_date_list(df['Month'].to_dict(),\n", 158 | " df['Day'].to_dict(),\n", 159 | " df['TMaxRecordYear'].to_dict()\n", 160 | " )\n", 161 | " s_dates = pd.Series(dates)\n", 162 | " s_tempvalues = pd.Series(df['TMax'].to_dict().values())\n", 163 | " df_new = pd.concat([s_dates,s_tempvalues], axis=1)\n", 164 | " df_new.columns = {\"RecordDate\",\"RecordHighTemp\"}\n", 165 | " plt.figure(figsize = (9,9), dpi = 72)\n", 166 | " plt.xticks(rotation=90)\n", 167 | " sns.pointplot(df_new[\"RecordDate\"],df_new[\"RecordHighTemp\"])\n", 168 | " return df_new\n", 169 | "\n", 170 | "def plot_duration_results(df):\n", 171 | " plt.figure(figsize = (9,9), dpi = 72)\n", 172 | " plt.xlabel('Day')\n", 173 | " plt.ylabel('Record Duration in Years')\n", 174 | " plt.title('Maximum Duration for TMax Records')\n", 175 | " ax = plt.gca()\n", 176 | " colors= ['r', 'b']\n", 177 | " df.plot(kind='bar',color=colors, alpha=0.75, ax=ax)\n", 178 | " ax.xaxis.set_ticklabels( ['%s' % i for i in df.Date.values] )\n", 179 | " plt.grid(b=True, which='major', linewidth=1.0)\n", 180 | " plt.grid(b=True, which='minor')\n", 181 | " return\n", 182 | "\n", 183 | "def explore_tmaxfreq(station,hirec):\n", 184 | " df_station_detail = fetch_station_data(station)\n", 185 | " df_station_address_detail = process_station_detail(STATION_DETAIL,fetch_station_list())\n", 186 | " df_station_name = df_station_address_detail.query(\"(StationID == @station)\")\n", 187 | " qt = df_station_name.iloc[0][\"QueryTag\"]\n", 188 | " print(\"Historical high temperature record analysis for weather station {0}.\").format(qt)\n", 189 | " display(df_station_name)\n", 190 | " print(\"Station detail, quick glimpse.\")\n", 191 | " display(df_station_detail.head()) \n", 192 | " compute_years_of_station_data(df_station_detail)\n", 193 | " df_record_days = compute_tmax_record_quantity(df_station_detail,hirec)\n", 194 | " if not df_record_days.empty:\n", 195 | " print(\"This station has experienced {0} days of new record highs where a new record has been set more than {1} times throughout the operation of the station.\").format(len(df_record_days),hirec)\n", 196 | " display(df_record_days.head(10))\n", 197 | " print(\"Displayed above are the details for up to the first 10 new record high events. All records are ploted below.\")\n", 198 | " df_rec_results = plot_tmax_record_results(df_record_days)\n", 199 | " df_durations = compute_max_record_durations(df_record_days)\n", 200 | " plot_duration_results(df_durations)\n", 201 | " else:\n", 202 | " print(\"This weather station has not experienced any days with greater than {0} new record highs.\").format(hirec)\n", 203 | " return \n", 204 | " \n", 205 | "def noaaquery(renderer=lambda station,hirec : explore_tmaxfreq(station,hirec)):\n", 206 | " '''\n", 207 | " Creates an interactive query widget with an optional custom renderer.\n", 208 | " \n", 209 | " station: Weather Station ID\n", 210 | " hirec: Query indicator for the maximum number of TMAX records for a given day.\n", 211 | " '''\n", 212 | " df_station_detail = process_station_detail(STATION_DETAIL,fetch_station_list())\n", 213 | " station_vals = tuple(df_station_detail.StationID)\n", 214 | " hirec_vals = tuple(map(str, range(1,51)))\n", 215 | "\n", 216 | " @interact_manual(station=station_vals, hirec=hirec_vals)\n", 217 | " def noaaquery(station, hirec):\n", 218 | " '''Inner function that gets called when the user interacts with the widgets.'''\n", 219 | " try:\n", 220 | " station_id = station.strip()\n", 221 | " high_rec_freq = hirec\n", 222 | " except Exception as e:\n", 223 | " print(\"Widget Error: {0}\").format(e.message)\n", 224 | " renderer(station_id, high_rec_freq)" 225 | ] 226 | } 227 | ], 228 | "metadata": { 229 | "celltoolbar": "Dashboard", 230 | "kernelspec": { 231 | "display_name": "Python 2", 232 | "language": "python", 233 | "name": "python2" 234 | }, 235 | "language_info": { 236 | "codemirror_mode": { 237 | "name": "ipython", 238 | "version": 2 239 | }, 240 | "file_extension": ".py", 241 | "mimetype": "text/x-python", 242 | "name": "python", 243 | "nbconvert_exporter": "python", 244 | "pygments_lexer": "ipython2", 245 | "version": "2.7.6" 246 | } 247 | }, 248 | "nbformat": 4, 249 | "nbformat_minor": 0 250 | } 251 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | pandas 3 | matplotlib 4 | folium 5 | xlrd 6 | seaborn 7 | cloudant 8 | statsmodels 9 | jupyter_dashboards==0.1.0 10 | jupyter_declarativewidgets==0.1.0 11 | jupyter_cms==0.1.2 12 | -------------------------------------------------------------------------------- /united-nations/README.md: -------------------------------------------------------------------------------- 1 | # Population Growth Estimates 2 | A sample exploratory excercise using public data from the Population Division of the **United Nations** Department of Economic and Social Affairs. 3 | 4 | ## Objective 5 | 6 | Provide an introductory analysis into the growth rates within Senegal due to migration trends. 7 | 8 | > [Senegal](http://en.wikipedia.org/wiki/Senegal) has a population of over 13.5 million,[36] about 42 percent of whom live in rural areas. Density in these areas varies from about 77 inhabitants per square kilometre (200/sq mi) in the west-central region to 2 per square kilometre (5.2/sq mi) in the arid eastern section. 9 | 10 | ## References 11 | 12 | * [Demographics of Senegal](http://en.wikipedia.org/wiki/Demographics_of_Senegal) 13 | * [United Nations World Population Prospects](http://esa.un.org/unpd/wpp/index.htm) --------------------------------------------------------------------------------