├── Lab_1 ├── .gitignore ├── README.md ├── Lab_1.ipynb └── Lab_1-Solutions.ipynb ├── Lab_3 ├── orig_ip_bytes.log.zip ├── README.md └── Lab_3.ipynb ├── Lab_4 ├── README.md └── Lab_4.ipynb ├── Lab_2 ├── README.md ├── Lab_2.ipynb └── Lab_2.1.ipynb └── README.md /Lab_1/.gitignore: -------------------------------------------------------------------------------- 1 | conn.log* 2 | conn_sample* 3 | .ipynb_checkpoints/* 4 | *.swp 5 | -------------------------------------------------------------------------------- /Lab_3/orig_ip_bytes.log.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/7h3rAm/Security-Data-Analysis/master/Lab_3/orig_ip_bytes.log.zip -------------------------------------------------------------------------------- /Lab_1/README.md: -------------------------------------------------------------------------------- 1 | # Goal 2 | This lab is an introduction to IPython and Pandas. Hopefully you'll get comfortable enough with the basic building blocks for the rest of the labs. 3 | 4 | # Requirements 5 | * IPython 6 | * Pandas 7 | 8 | # Data 9 | Full dataset available [here](http://www.secrepo.com/Security-Data-Analysis/Lab_1/conn.log.zip). This is the *conn.log* referenced in this lab (Lab 1) 10 | -------------------------------------------------------------------------------- /Lab_4/README.md: -------------------------------------------------------------------------------- 1 | # Goal 2 | In the first part of this lab you'll get to explore some real live threat data(tm)*. All sorts of fun and exciting things will be yours to experiment with. You'll start by reading and cleaning up data (surprise), and you'll finish by exploring K-Means clustering and PCA! 3 | 4 | 5 | *Real live threat data is domains and IPs 6 | 7 | # Requirements 8 | * IPython 9 | * Pandas 10 | * NumPy 11 | * Matplotlib/Pylab 12 | * Scipy 13 | * Scikit Learn 14 | 15 | # Data 16 | * host_detections.csv - Included in this repo 17 | * mal_domains.csv - Included in this repo 18 | -------------------------------------------------------------------------------- /Lab_3/README.md: -------------------------------------------------------------------------------- 1 | # Goal 2 | This lab aims to show basic statistical techniques in Python. This includes how to get the mean, median, mode, and standard deviation from a data set. In addition to looking at data distributions, some commonly used plot types for data exploration and summarization are covered as well as a brief introduction to some iPython magic functions. 3 | 4 | 5 | # Requirements 6 | * IPython 7 | * Pandas 8 | * NumPy 9 | * Matplotlib/Pylab 10 | * Scipy 11 | * Scikit Learn 12 | 13 | # Data 14 | * Uses conn_sample.log from Lab_1 15 | * orig_ip_bytes.log is provided in this folder, and should be unzipped prior to use. 16 | -------------------------------------------------------------------------------- /Lab_2/README.md: -------------------------------------------------------------------------------- 1 | # Goal 2 | In this lab, the techniques in the first lab are built on to do basic timeseries analysis (graphing and summarization). For the first part of the lab some Bro data (http.log) is used. Additionally the groupby() capability of Pandas will be examined. In the second part of the lab (2.1) honeypot data will be used to demonstrate looking for patterns over time in honeypot data. More timeseries graphing will be covered as well as a technique for determining if two variables are correlated over time. Make sure to let us know if you find anything interesting in the data! 3 | 4 | 5 | # Requirements 6 | * IPython 7 | * Pandas 8 | * NumPy 9 | * Matplotlib 10 | * GeoIP 11 | 12 | # Data 13 | * Full dataset available [here](http://www.secrepo.com/Security-Data-Analysis/Lab_2/http.log.zip). This is the *http.log* referenced in this lab (Lab 2). 14 | * Honeypot data is available [here](http://www.secrepo.com/honeypot/honeypot.json.zip). This is the *honeypot.json* referenced in this lab (Lab 2.1). You must unzip honeypot.json.zip prior to running through the lab. 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Security Data Analysis 2 | 3 | ## Introduction 4 | 5 | This is a two day class I, [Mike Sconzo](https://twitter.com/sooshie), put together a while back with some help from [David Dorsey](https://twitter.com/trogdorsey). I gave this workshop at BSidesDFW and it seemed to go ok. However, it's being re-written so I'm open sourcing this version for everybody to enjoy, and hopefully learn from. 6 | 7 | The goal over all five labs will be to get comfortable doing data analysis as well as machine learning in Python. Ideally there will be some hand holding initially and then as things progress it begins to assume you remember the techniques that were covered in prior labs. However, if you have questions/problems/fixes feel free to open issues and pull requests. Or worst case contact me via twitter. 8 | 9 | Good luck and enjoy! 10 | 11 | ## Labs 12 | * Lab 1 - Introduction to IPython and Pandas 13 | * Lab 2 - Introduction to time series analysis 14 | * Lab 3 - Introduction to basic statistics in Python 15 | 16 | ## License 17 | All of these labs are released under the [Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/). 18 | -------------------------------------------------------------------------------- /Lab_1/Lab_1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:9ce307aa0ff6a7d080281b4fa4583cbd9f9bc5cd5d2c9d4508c4d739745fc2a7" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Lab 1\n", 16 | "\n", 17 | "## Introduction\n", 18 | "This is a basic introduction to IPython and pandas functionality. Pandas (Python Data Analysis Library) \"is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.\" It (pandas) provides dataframe functionality for reading/accessing/manipulating data in memory. You can think of a data frame as a table of indexed values.\n", 19 | "\n", 20 | "What you're currently looking at is an IPython Notebook, this acts as a way to interactively use the python interpreter as well as a way to display graphs/charts/images/markdown along with code. IPython is commonly used in scientific computing due to its flexibility. Much more information is available on the IPython website.\n", 21 | "\n", 22 | "Often data is stored in files, and the first goal is to get that information off of disk and into a dataframe. Since we're working with limited resources in this VM we'll have to use samples of some of the files. Don't worry though, the same techniques apply if you're not sampling the files for exploration.\n", 23 | "\n", 24 | "## Tip\n", 25 | "If you ever want to know the various keyboard shortcuts, just click on a (non-code) cell or the text \"In []\" to the left of the cell, and press the *H* key. Or select *Help* from the menu above, and then *Keyboard Shortcuts*.\n", 26 | "___" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## Exercises\n", 34 | "\n", 35 | "### File sampling\n", 36 | "First off, let's take a look at a log file generated from Bro this log is similar to netflow logs as well. However, this log file is rather large and doesn't fit in memory.\n", 37 | "\n", 38 | "As part of the first exercise, figure out what setting the variable **sample_percent** should be in order to read in between 200k and 300k worth of (randomly selected) lines from the file. Change the variable, after doing that either click the *play* button above (it's the arrow) or hit the *[Shift]+[Enter]* keys as the same time." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "collapsed": false, 44 | "input": [ 45 | "import random\n", 46 | "logfile = 'conn.log'\n", 47 | "sample_percent = .01\n", 48 | "num_lines = sum(1 for line in open(logfile))\n", 49 | "slines = set(sorted(random.sample(xrange(num_lines), int(num_lines * sample_percent))))\n", 50 | "print \"%s lines in %s, using a sample of %s lines\" %(num_lines, logfile, len(slines))" 51 | ], 52 | "language": "python", 53 | "metadata": {}, 54 | "outputs": [] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "### File Creation\n", 61 | "Awesome! Now that you have a subset of lines to work with, let's write them to another file so we'll have something to practice reading in. Simply hit *[Shift]+[Enter]* below to run the code in the cell and create a new file." 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "collapsed": false, 67 | "input": [ 68 | "outfile = 'conn_sample.log'\n", 69 | "f = open(outfile, 'w+')\n", 70 | "i = open(logfile, 'r+')\n", 71 | "linecount = 0\n", 72 | "for line in i:\n", 73 | " if linecount in slines:\n", 74 | " f.write(line)\n", 75 | " linecount += 1\n", 76 | "f.close()\n", 77 | "i.close()" 78 | ], 79 | "language": "python", 80 | "metadata": {}, 81 | "outputs": [] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "### File Input (CSV)\n", 88 | "This next cell does a couple of things, first it imports pandas so we can create a dataframe, and then it reads our newly created file from above into memory. You can see the separator is specified to \"\\t\" because Bro produces tab-delimited files by default. In this case we've also specified what we should call the columns in the dataframe." 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "collapsed": false, 94 | "input": [ 95 | "import pandas as pd\n", 96 | "conn_df = pd.read_csv(outfile, sep=\"\\t\", header=None, names=['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','proto','service','duration','orig_bytes','resp_bytes','conn_state','local_orig','missed_bytes','history','orig_pkts','orig_ip_bytes','resp_pkts','resp_ip_bytes','tunnel_parents','threat','sample'])" 97 | ], 98 | "language": "python", 99 | "metadata": {}, 100 | "outputs": [] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "### Verifying Input\n", 107 | "Now (in theory) the contents of the file should be in a nicely laid-out dataframe.\n", 108 | "\n", 109 | "For this next exercise, experiment with calling the **head()** and **tail()** method to see the values at the beginning and end of the dataframe. You can also pass a number to **head()** and **tail()** to specify the number of lines you want to see. Remember to click *play* or press *[Shift]+[Enter]* to execute the code in the cell after you change it." 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "collapsed": false, 115 | "input": [ 116 | "conn_df.head()" 117 | ], 118 | "language": "python", 119 | "metadata": {}, 120 | "outputs": [] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "### Data Summarization\n", 127 | "Now create a new cell below this one. This can be accomplished by clicking on this cell once, and then clicking the *+* icon towards the top or selecting *Insert* from above and then selecting *Insert Cell Below*. After creating the new cell, it's time to learn about the **describe()** method that can be called on dataframes. This will give you a numeric summarization of all columns that contain numbers.\n", 128 | "\n", 129 | "Try it out!" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "collapsed": false, 135 | "input": [ 136 | "conn_df.describe()" 137 | ], 138 | "language": "python", 139 | "metadata": {}, 140 | "outputs": [] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "### Data Types\n", 147 | "Wait a second, isn't the ts column supposed to be a timestamp? Perhaps this column would be better suited as a time data type vs. a number.\n", 148 | "\n", 149 | "Run the cell below to see what type of information Python stored in each column." 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "collapsed": false, 155 | "input": [ 156 | "conn_df.dtypes" 157 | ], 158 | "language": "python", 159 | "metadata": {}, 160 | "outputs": [] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "### Converting Column Types\n", 167 | "Time to change the ts column to a datetime object! We will accomplish that by using a simple function provided called *to_datetime()*. The cell below runs this function on the ts column (what should be a time stamp), and then re-assigns this column back to the dataframe in the same place. A new timestamp column could have been added to the dataframe as well so both the float value and the datetime object columns are present.\n", 168 | "\n", 169 | "Run the cell below to convert the column type." 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "collapsed": false, 175 | "input": [ 176 | "from datetime import datetime\n", 177 | "conn_df['ts'] = [datetime.fromtimestamp(float(date)) for date in conn_df['ts'].values]" 178 | ], 179 | "language": "python", 180 | "metadata": {}, 181 | "outputs": [] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "collapsed": false, 186 | "input": [ 187 | "conn_df.dtypes" 188 | ], 189 | "language": "python", 190 | "metadata": {}, 191 | "outputs": [] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "### Data Value Exploration\n", 198 | "Verify that the conversion was successful. What is the datatype of the column now?\n", 199 | "\n", 200 | "Scroll back up the page and note where you ran the **describe()** function. You'll see under the threat and sample columns there is likely the value of *NaN*. This stands for Not a Number and is a special value assigned to empty column values. There are a few ways to explore what values a column has. Two of these are **value_counts()** and **unique()**. \n", 201 | "\n", 202 | "Try them below on different columns. You can create new cells or if you want to get more than the last command worth of output you can put a print statement in front. \n", 203 | "\n", 204 | "What happens when you run them on a column with IPs (*id.orig_h, id.resp_h*)? What about sample or threat?" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "collapsed": false, 210 | "input": [ 211 | "conn_df['sample'].unique()" 212 | ], 213 | "language": "python", 214 | "metadata": {}, 215 | "outputs": [] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "### Remove Columns\n", 222 | "Another useful operation on a dataframe is removing and adding columns. Since the threat and sample columns contain only *NaNs*, we can safely remove them and not impact any analysis that may be performed. \n", 223 | "\n", 224 | "Below the sample column is removed (dropped), add a similar line to drop the *threat* column and use a method from above to verify they are no longer in the dataframe." 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "collapsed": false, 230 | "input": [ 231 | "conn_df.drop('sample', axis=1, inplace=True)" 232 | ], 233 | "language": "python", 234 | "metadata": {}, 235 | "outputs": [] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "Can you think of other columns to remove? Select a few and remove them as well. What does your dataframe look like now? (Insert additional cells as needed)" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "collapsed": false, 247 | "input": [ 248 | "conn_df.drop('threat', axis=1, inplace=True)" 249 | ], 250 | "language": "python", 251 | "metadata": {}, 252 | "outputs": [] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "### Row Selection\n", 259 | "\n", 260 | "You can use column values to select rows from the dataframes (and even only view specific columns). First, select all rows that contain *SSL* traffic by running the cell below." 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "collapsed": false, 266 | "input": [ 267 | "conn_df[conn_df['service'] == 'ssl'].head()" 268 | ], 269 | "language": "python", 270 | "metadata": {}, 271 | "outputs": [] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "Next we can assign that result to a dataframe, and then look at all all the *SSL* connections that happen over ports other than 443." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "collapsed": false, 283 | "input": [ 284 | "ssl_df = conn_df[conn_df['service'] == 'ssl']\n", 285 | "ssl_df[ssl_df['id.resp_p'] != 443].head()" 286 | ], 287 | "language": "python", 288 | "metadata": {}, 289 | "outputs": [] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "metadata": {}, 294 | "source": [ 295 | "You can see the individual column selections above eg: *conn_df['service']*, and *ssl_df['id.resp_p']* respectively. You can use these to view output of specific columns. \n", 296 | "\n", 297 | "For example, run the cell below to see all the individual values of originator bytes associated with a *SSL* connection over port 443." 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "collapsed": false, 303 | "input": [ 304 | "ssl_df[ssl_df['id.resp_p'] == 443][['orig_bytes','proto']].head()" 305 | ], 306 | "language": "python", 307 | "metadata": {}, 308 | "outputs": [] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "## Final Exercise\n", 315 | "Use all of the techniques above to display the unique ports and originator IPs (bonus points for the number of connections of each) associated with all *HTTP* connections **NOT** over port 80. (Hint, create a new dataframe for easier manipulation)" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "collapsed": false, 321 | "input": [], 322 | "language": "python", 323 | "metadata": {}, 324 | "outputs": [] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "collapsed": false, 329 | "input": [], 330 | "language": "python", 331 | "metadata": {}, 332 | "outputs": [] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "collapsed": false, 337 | "input": [], 338 | "language": "python", 339 | "metadata": {}, 340 | "outputs": [] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "collapsed": false, 345 | "input": [], 346 | "language": "python", 347 | "metadata": {}, 348 | "outputs": [] 349 | } 350 | ], 351 | "metadata": {} 352 | } 353 | ] 354 | } 355 | -------------------------------------------------------------------------------- /Lab_2/Lab_2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:07f1250410d997c0e0c4f4c222283b13aee9623461a8009e0eeaa5ccbbf35917" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Lab 2\n", 16 | "\n", 17 | "## Introduction\n", 18 | "With this lab data grouping and graphing will be explored. There is always a great deal of information you can gather by grouping and comparing various columns within a dataframe. In addition for data summarization graphing is an important tool to have.\n", 19 | "\n", 20 | "HTTP data will be used for this exercise. The data was generated from various PCAPs that have been collected that contain both legitimate traffic as well as traffic relating to exploit kits. While no malicious traffic is contained within the log file there are malicious domains and URLS (it's recommended you don't visit them). While this traffic was generated by running Bro over a series of PCAPS, similar data can be obtained from various Web Proxies, this is a nice cross over example of what is possible with your own data.\n", 21 | "\n", 22 | "Some goals will be understand when the data was generated, what systems generated, high-level stats about the traffic, and the types of data transferred within the connections.\n", 23 | "\n", 24 | "___" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## Exercises\n", 32 | "\n", 33 | "### File Input\n", 34 | "Using what you learned in the last lab cread in the log (csv) file provided for you.\n", 35 | "\n", 36 | "#### Hints\n", 37 | "* The file name is in the current directory and is called *http.log*\n", 38 | "* There is no header to the file\n", 39 | "* It's *[TAB]* seperated\n", 40 | "* The fields are: 'ts', 'uid', 'id.orig_h', 'id.orig_p', 'id.resp_h', 'id.resp_p', 'trans_depth', 'method', 'host', 'uri', 'referrer', 'user_agent', 'request_body_len', 'response_body_len', 'status_code', 'status_msg', 'info_code', 'info_msg', 'filename', 'tags', 'username', 'password', 'proxied', 'orig_fuids', 'orig_mime_types', 'resp_fuids', 'resp_mime_types', 'sample'" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "collapsed": false, 46 | "input": [ 47 | "import pandas as pd\n", 48 | "http_df = pd.read_csv(\"./http.log\", header=None, sep=\"\\t\", names=[])" 49 | ], 50 | "language": "python", 51 | "metadata": {}, 52 | "outputs": [] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "### Clean-up the timestamp\n", 59 | "Now that you've got the data imported, cleanup up the timestamp column *ts*. Don't forget to re-assign back to the *ts* column." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "collapsed": false, 65 | "input": [ 66 | "from datetime import datetime" 67 | ], 68 | "language": "python", 69 | "metadata": {}, 70 | "outputs": [] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "collapsed": false, 75 | "input": [ 76 | "http_df.head()" 77 | ], 78 | "language": "python", 79 | "metadata": {}, 80 | "outputs": [] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "collapsed": false, 85 | "input": [ 86 | "http_df.shape" 87 | ], 88 | "language": "python", 89 | "metadata": {}, 90 | "outputs": [] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "In the next cell the timestamp column is set to be the new index of the dataframe. By default dataframes are indexed by the row number, and by indexing by timestamp it's easier to perform various types of time series analysis. After the assignment a quick **head()** is performed, and in the output you'll see that the *ts* has moved \"down and to the left\". The new location of the *ts* heading indicates that it is now the index, and has replaced the default." 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "collapsed": false, 102 | "input": [ 103 | "http_df = http_df.set_index('ts')\n", 104 | "http_df.head()" 105 | ], 106 | "language": "python", 107 | "metadata": {}, 108 | "outputs": [] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "collapsed": false, 113 | "input": [ 114 | "http_df.shape" 115 | ], 116 | "language": "python", 117 | "metadata": {}, 118 | "outputs": [] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "### Selecting Based on Index Values\n", 125 | "\n", 126 | "With the default indexing in dataframes you can select elements or slices of elements based on numbers (similarly to how lists work in Python). With a time indexed dataframe various parts of dates, or whole dates, can be used to select rows.\n", 127 | "\n", 128 | "A year can be used eg: 2012, a year and month '2012-02' or a year, month and day '2012-02-20'." 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "collapsed": false, 134 | "input": [ 135 | "http_df['2012-02-20':'2012-02-23'][http_df.columns.tolist()[2:7]].head()" 136 | ], 137 | "language": "python", 138 | "metadata": {}, 139 | "outputs": [] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "collapsed": false, 144 | "input": [ 145 | "http_df['2012-02-20':'2012-02-23']['id.orig_h'].head()" 146 | ], 147 | "language": "python", 148 | "metadata": {}, 149 | "outputs": [] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "collapsed": false, 154 | "input": [ 155 | "len(http_df.index)" 156 | ], 157 | "language": "python", 158 | "metadata": {}, 159 | "outputs": [] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "### Time Resampling\n", 166 | "Time indexed information can be resampled and summarized.\n", 167 | "\n", 168 | "Below is a resampling on day *D* that will count up the number of occurrences per day. Try various date selections to get a feel for how the sampling works and how it can be used to summarize data." 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "collapsed": false, 174 | "input": [ 175 | "%time\n", 176 | "temp_df = http_df['2012-02-20':'2012-02-25']\n", 177 | "temp_df.resample(\"D\", how='count').head()" 178 | ], 179 | "language": "python", 180 | "metadata": {}, 181 | "outputs": [] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "Try the resample example from above with some of the other options for resample. Replace the *D* with some other values to get different frequencies.\n", 188 | "\n", 189 | "| Alias | Description\n", 190 | "| ------| -----------\n", 191 | "| B | business day frequency\n", 192 | "| C | custom business day frequency (experimental)\n", 193 | "| D | calendar day frequency\n", 194 | "| W | weekly frequency\n", 195 | "| M | month end frequency\n", 196 | "| BM | business month end frequency\n", 197 | "| CBM | custom business month end frequency\n", 198 | "| MS | month start frequency\n", 199 | "| BMS | business month start frequency\n", 200 | "| CBMS | custom business month start frequency\n", 201 | "| Q | quarter end frequency\n", 202 | "| BQ | business quarter end frequency\n", 203 | "| QS | quarter start frequency\n", 204 | "| BQS | business quarter start frequency\n", 205 | "| A | year end frequency\n", 206 | "| BA | business year end frequency\n", 207 | "| ASyear | start frequency\n", 208 | "| BAS | business year start frequency\n", 209 | "| H | hourly frequency\n", 210 | "| T | minutely frequency\n", 211 | "| S | secondly frequency\n", 212 | "| L | milliseonds\n", 213 | "| U | microseconds\n" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "collapsed": false, 219 | "input": [ 220 | "http_df['2012-02-20':'2012-02-25'].resample(\"H\", how='count').head()" 221 | ], 222 | "language": "python", 223 | "metadata": {}, 224 | "outputs": [] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "### Graphing Time Series Data\n", 231 | "After getting a grasp on the various ways to look at time indexed data, it's useful to display it visually. With this section the same progression as above will be used.\n", 232 | "\n", 233 | "The cell below will generate a time series graph of both the request and response bodies, summed up over their timestamps." 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "collapsed": false, 239 | "input": [ 240 | "import matplotlib.pyplot as plt\n", 241 | "pylab.rcParams['figure.figsize'] = (16.0, 5.0)\n", 242 | "\n", 243 | "df = http_df[['request_body_len','response_body_len']]\n", 244 | "df.plot();" 245 | ], 246 | "language": "python", 247 | "metadata": {}, 248 | "outputs": [] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "A small copy of the HTTP dataframe **http_df** is stored in **df** it only contains 2 columns *[request_body_len, response_body_len]*, this enables the comparison of the request and response body lengths.\n", 255 | "\n", 256 | "Below is another way to graph the resampled data. In this case the data is resampled on *month*, and when no **how** parameter is passed to **resample()** it defaults to *mean*.\n", 257 | "\n", 258 | "#### Hint\n", 259 | "If you're going to use anything but *'count'* as the paremeter to **how**, you need to make sure it's numeric data." 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "collapsed": false, 265 | "input": [ 266 | "resamp = df.resample(\"M\")\n", 267 | "resamp.plot(style='g--')" 268 | ], 269 | "language": "python", 270 | "metadata": {}, 271 | "outputs": [] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "### Graphing Multiple Views (Time Series)\n", 278 | "It's possible to graph multiple **how** methods as well. This can help identify different patterns in the data." 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "collapsed": false, 284 | "input": [ 285 | "resamp = df.resample(\"M\", how=['mean', 'count', 'sum'])\n", 286 | "resamp.plot(subplots=True)\n", 287 | "resamp.plot()" 288 | ], 289 | "language": "python", 290 | "metadata": {}, 291 | "outputs": [] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "Above, the response_body_len sum really sticks out in the bottom graph, but viewing the smaller graphs above it's possible to see that it's not because of an increase in the number of requests (it's simply because more information was received, in aggregate, over all the connections).\n", 298 | "\n", 299 | "___\n", 300 | "\n", 301 | "Below, the *count* column was added for you, try doing the same type graph as above and incorporate **np.min** and **np.max** into your resampling.\n", 302 | "\n", 303 | "#### Hint\n", 304 | "Do not put single quotes around **np.min** or **np.max**." 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "collapsed": false, 310 | "input": [ 311 | "df['count'] = 1\n", 312 | "df = df['count'].cumsum()\n", 313 | "resamp = df.resample(\"M\", how=[])\n", 314 | "resamp.plot(subplots=True)\n", 315 | "resamp.plot()" 316 | ], 317 | "language": "python", 318 | "metadata": {}, 319 | "outputs": [] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "### Dataframe Grouping\n", 326 | "Another useful way to look at data is how different groups of values look with one another. For this pandas offers the **groupby()** command. It allows you to specify an arbitrary list of columns in your dataframe that are evaluated left-to-right in terms of grouping.\n", 327 | "\n", 328 | "The example below shows how *resp_mime_type* (filetype returned by the server) breaks down, and then per-filetype what *user_agents* requested those files. You can see the number of entries per-column per-value in the table." 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "collapsed": false, 334 | "input": [ 335 | "http_df.groupby(['resp_mime_types','user_agent']).count()" 336 | ], 337 | "language": "python", 338 | "metadata": {}, 339 | "outputs": [] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "It's also possible to select rows (as learned in Lab 1) and then do a **groupby()** to look at various sub views of the data.\n", 346 | "\n", 347 | "This looks at all rows associated with those 2 different *user_agent*s, and then shows how many entries are in the data per-user-agent per-filetype." 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "collapsed": false, 353 | "input": [ 354 | "http_df[http_df['user_agent'].isin(\n", 355 | " ['Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)',\n", 356 | " 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)'])].groupby(['resp_mime_types','user_agent']).count()" 357 | ], 358 | "language": "python", 359 | "metadata": {}, 360 | "outputs": [] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "Use the pre-defined filetypes below to come up with a couple of interesting views on the data that involve the **groupby()** function." 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "collapsed": false, 372 | "input": [ 373 | "executable_types = set(['application/x-dosexec', 'application/octet-stream', 'binary', 'application/vnd.ms-cab-compressed'])\n", 374 | "common_exploit_types = set(['application/x-java-applet','application/pdf','application/zip','application/jar','application/x-shockwave-flash'])" 375 | ], 376 | "language": "python", 377 | "metadata": {}, 378 | "outputs": [] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "### Graphing Groupby Data\n", 385 | "One advantage of the **groupby()** command is being able to graph the output to get another view into the data. One popular way to do this is via a bar graph.\n", 386 | "\n", 387 | "In the next cell there are a couple of things going on. First, a column named *count* is created and every row in that column is assigned the value of 1. This creates a column that we can use pandas to sum on since it has a value of 1 for each row.\n", 388 | "\n", 389 | "In the second line, the dataframe is grouped by *resp_mime_types* and then only the *count* column is viewed/returned from the **groupby()** command. The result is then passed to the **sum()** function, this causes the sum on the *count* column, which due to the trick above has a value of 1 for each row. Combined this gets the number of files in each filetype. The result is simply plotted with **plot()**.\n", 390 | "\n", 391 | "Any surprising results? What could they possibly indicate?" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "collapsed": false, 397 | "input": [ 398 | "http_df['count'] = 1\n", 399 | "http_df.groupby('resp_mime_types')['count'].sum().plot(kind='bar')" 400 | ], 401 | "language": "python", 402 | "metadata": {}, 403 | "outputs": [] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "The technique above can be used to look at the number of samples associated with each IP in dataset.\n", 410 | "\n", 411 | "What kinds of conclusions can you draw based on the graph below?" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "collapsed": false, 417 | "input": [ 418 | "http_df.groupby('id.orig_h')['count'].sum().order(ascending=False).plot(kind='bar')" 419 | ], 420 | "language": "python", 421 | "metadata": {}, 422 | "outputs": [] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "That's a lot of IP addresses! Bonus question: How many different source IP addresses are in the data set?\n", 429 | "\n", 430 | "Remember above when we said you could access elements in a dataframe that weren't indexed by a timestamp like a regular Python array? Well, it's possible to do the same with dataframes produced by **groupby()**. By sliding around the dataset below what can you learn about the IP addresses that wasn't possible to see because of the resolution of the graph above?" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "collapsed": false, 436 | "input": [ 437 | "http_df.groupby('id.orig_h')['count'].sum().order(ascending=False)[10:20].plot(kind='bar')" 438 | ], 439 | "language": "python", 440 | "metadata": {}, 441 | "outputs": [] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "### Stacked Groupby Graphs With Bonus Colors\n", 448 | "Once you learn the basic bar graph, it's time to kick it up a notch by looking at the relationship between two different columns. This can be done with stacked bar charts.\n", 449 | "\n", 450 | "In this case the relationship between filetype and HTTP method can be explored. Perhaps one or two HTTP methods are more responsible for specific filetypes vs. others. Custom colors can be created with the values in the **colors** list, and passed into the graph with **color=colors**. Similar to the examples above we're using our added *count* column to get the number of things. In this example multiple tiers are used for **groupby()** since the breakdown of methods by filetype are being explored. The **unstack()** function \"expands\" the grouped columns, and **fillna(0)** fills all non-values with zero (since these won't impact the sum).\n", 451 | "\n", 452 | "What happens when you remove the custom color labels?\n", 453 | "\n", 454 | "Bonus: Create a new cell and take a look at what the dataframe looks like without the **plot()** or other commands stacked on top of one another. This is useful to do to understand the output of each function in the chain." 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "collapsed": false, 460 | "input": [ 461 | "colors = [(x/10.0, x/40.0, 0.75) for x in range(len(http_df['method'].unique().tolist()))]\n", 462 | "http_df.groupby(['resp_mime_types','method'])['count'].sum().unstack('method').fillna(0).plot(\n", 463 | " color=colors, kind='bar', stacked=True, grid=False)" 464 | ], 465 | "language": "python", 466 | "metadata": {}, 467 | "outputs": [] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "## Final Exercise\n", 474 | "The final challenge is using all the techniques from above to create a stacked bar chart that shows: only for destination ports 81, 88, and 8080 how many of each HTTP method is associated with each port. The **plot()** command is present to show another way to specify different custom colors." 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "collapsed": false, 480 | "input": [ 481 | "http_df.plot(colormap='GnBu', kind='bar', stacked=True, grid=False)" 482 | ], 483 | "language": "python", 484 | "metadata": {}, 485 | "outputs": [] 486 | } 487 | ], 488 | "metadata": {} 489 | } 490 | ] 491 | } -------------------------------------------------------------------------------- /Lab_3/Lab_3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:bc38999154088d6217bc42f5480ddf9348305b1d450ed62b2967fdf95dd066d7" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Lab 3\n", 16 | "\n", 17 | "## Introduction\n", 18 | "Since the previous labs have provided a (hopefully) good foundation in the various tools that will be used, this lab will explore some of the Statistics functions available for analysis. Overall this should be a gentle introduction (or reminder) about basic statistical analysis.This lab will pick up a the dataset used in Lab 1, building on your knowledge of dataframes, this lab gives the opportunity to explore what types of functions they export for data analysis.\n", 19 | "\n", 20 | "Some goals will be how to quickly summarize data, know how to get at specific values/features of data, understand how the data looks (statistically), and how to understand the layout of the data.\n", 21 | "\n", 22 | "### Useful Terminology\n", 23 | "**Mean (mu)** - The average, the sum of the numbers divided by the number of numbers.\n", 24 | "\n", 25 | "**Mode** - The number that occurs most frequently.\n", 26 | "\n", 27 | "**Median** - The middle number when the numbers are sorted, or with an even number of numbers the average of the two middle numbers.\n", 28 | "\n", 29 | "**Standard Deviation (sigma)** - The dispersion from the mean. The larger the standard deviation, the more spread out the numbers are.\n", 30 | "\n", 31 | "**Variance** - How spread out the numbers are. A Variance of zero means all numbers are the same. Similar to Standard Deviation.\n", 32 | "___" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## Exercises\n", 40 | "\n", 41 | "### File Input\n", 42 | "Using what you learned in the last lab read in the log (csv) file provided for you.\n", 43 | "\n", 44 | "#### Hints\n", 45 | "* The file name is in the \"../Lab 1/ directory and is called *conn_sample.log*\n", 46 | "* There is no header to the file\n", 47 | "* It's *[TAB]* separated\n", 48 | "* The fields are: 'ts', 'uid', 'id.orig_h', 'id.orig_p', 'id.resp_h', 'id.resp_p', 'proto', 'service', 'duration', 'orig_bytes', 'resp_bytes', 'conn_state', 'local_orig', 'missed_bytes', 'history', 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'tunnel_parents', 'threat', 'sample'" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "collapsed": false, 54 | "input": [ 55 | "import pandas as pd\n", 56 | "\n", 57 | "df = pd.read_csv('../Lab 1/conn_sample.log', sep=\"\\t\", header=None, names=['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','proto','service','duration','orig_bytes','resp_bytes','conn_state','local_orig','missed_bytes','history','orig_pkts','orig_ip_bytes','resp_pkts','resp_ip_bytes','tunnel_parents','threat','sample'])" 58 | ], 59 | "language": "python", 60 | "metadata": {}, 61 | "outputs": [] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "After running the above cell, run this one to verify that you've got data in the dataframe, and that it looks \"correct enough\"." 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "collapsed": false, 73 | "input": [ 74 | "df.head()" 75 | ], 76 | "language": "python", 77 | "metadata": {}, 78 | "outputs": [] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "Once again time for data cleanup!\n", 85 | "\n", 86 | "The cell below will, if you remember, fill all NaN valued cells with 0. The assumption here is that if Bro didn't fill in a value it's safe to set that value to zero. After that let's see what pandas determined the columns to be." 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "collapsed": false, 92 | "input": [ 93 | "df = df.fillna(0)" 94 | ], 95 | "language": "python", 96 | "metadata": {}, 97 | "outputs": [] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "collapsed": false, 102 | "input": [ 103 | "df.dtypes" 104 | ], 105 | "language": "python", 106 | "metadata": {}, 107 | "outputs": [] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "### More Data Cleanup\n", 114 | "\n", 115 | "In data produced by Bro, it will often put a *-* if it can't determine a value or one wasn't seen. It's likely you saw quite a few of these in the _head()_ command above. These have a couple of different effects on the data, they can cause pandas to not recognize the column as being purely numeric and because of that it won't compute data statistics for us.\n", 116 | "\n", 117 | "\n", 118 | "#### Value substitution\n", 119 | "\n", 120 | "The following columns need to be cleaned up this way: *orig_bytes*, *duration*, and *resp_bytes*.\n", 121 | "\n", 122 | "It's important to understand the changes that you make to the underlying data by substituting values. First let's take a look at some of the differences, then make the changes to the rest of the columns.\n", 123 | "\n", 124 | "First, make all Bro unknowns *-* into numpy unknowns *np.nan*, and see what that does." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "collapsed": false, 130 | "input": [ 131 | "df['orig_bytes'].apply(lambda x: np.nan if x == '-' else x).astype(float64).describe()" 132 | ], 133 | "language": "python", 134 | "metadata": {}, 135 | "outputs": [] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "Then a similar change to make all Bro unknowns *-* into zeros, and check the output. \n", 142 | "\n", 143 | "What are some of the differences?" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "collapsed": false, 149 | "input": [ 150 | "df['orig_bytes'].apply(lambda x: 0 if x == '-' else x).astype(float64).describe()" 151 | ], 152 | "language": "python", 153 | "metadata": {}, 154 | "outputs": [] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "Pick one method, come up with your justification, and do the assignment for all the columns listed above. This can be done using a lambda function inside the **apply()** function. A lambda (function) is an anonymous function or one that is not bound to a specific name.\n", 161 | "\n", 162 | "A partial sample has been provided." 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "collapsed": false, 168 | "input": [], 169 | "language": "python", 170 | "metadata": {}, 171 | "outputs": [] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "#### Remove un-needed columns\n", 178 | "\n", 179 | "The columns: *ts*, *uid*, *proto*, *service*, *conn_state*, *local_orig*, *history*, *tunnel_parents*, *threat*, *sample* aren't needed for this lab. Let's get rid of them." 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "collapsed": false, 185 | "input": [ 186 | "df.drop([], axis=1, inplace=True)" 187 | ], 188 | "language": "python", 189 | "metadata": {}, 190 | "outputs": [] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "#### Change column types\n", 197 | "\n", 198 | "Once again, in order to do analysis, columns should be set to the correct data type.\n", 199 | "\n", 200 | "Use the information/examples above to set orig_pkts, resp_pkts, and missed_bytes to *float64* and id.orig_p and id.resp_p to *object*" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "collapsed": false, 206 | "input": [], 207 | "language": "python", 208 | "metadata": {}, 209 | "outputs": [] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "Last but not least, The values of (empty) can creep into the data as well (it was seen in orig_ip_bytes and resp_ip_bytes), substitute these out for your chosen value above. Also, because later on we'll do some division, set all zeros to **np.nan** in the *resp_ip_bytes* column.\n", 216 | "\n", 217 | "Make sure to verify that all your changes have held! \n", 218 | "\n", 219 | "*Hint: use the dtypes property*" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "collapsed": false, 225 | "input": [], 226 | "language": "python", 227 | "metadata": {}, 228 | "outputs": [] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "### Feature Engineering for More Stats!\n", 235 | "\n", 236 | "A simple exercise on how you can combine (now 2 numeric) features to create yet another numeric feature that can give you more insight into the data.\n", 237 | "\n", 238 | "You can perform mathematical operations on columns and assign them to a new column. Run the cell below to see how it's done, and check out some of the initial values." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "collapsed": false, 244 | "input": [ 245 | "df['out_in_ratio'] = df['orig_ip_bytes']/df['resp_ip_bytes']\n", 246 | "df['out_in_ratio'].head()" 247 | ], 248 | "language": "python", 249 | "metadata": {}, 250 | "outputs": [] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "#### Statistical Summarization\n", 257 | "\n", 258 | "As we've explored before, the **describe()** function can be used to summarize all the numerical columns in a dataframe. What happens run this is run on our our modified (and nicely cleaned up) dataframe?\n", 259 | "\n", 260 | "Did the newly created column appear as well? Anything interesting about it?" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "collapsed": false, 266 | "input": [ 267 | "df.describe()" 268 | ], 269 | "language": "python", 270 | "metadata": {}, 271 | "outputs": [] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "There are specific functions that allow for computing the various values one at a time. \n", 278 | "\n", 279 | "Try computing the Standard Deviation of *orig_ip_bytes* using the **std()** function." 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "collapsed": false, 285 | "input": [ 286 | "df.orig_ip_bytes" 287 | ], 288 | "language": "python", 289 | "metadata": {}, 290 | "outputs": [] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "Try computing the Variance of *orig_ip_bytes* using the **var()** function." 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "collapsed": false, 302 | "input": [ 303 | "df.orig_ip_bytes" 304 | ], 305 | "language": "python", 306 | "metadata": {}, 307 | "outputs": [] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "What were the results? Why do you think that's the case?\n", 314 | "\n", 315 | "*Hint: Wikipedia has a great example of how to computer the Standard Deviation, and remember the variance is just the standard deviation squared*\n", 316 | "\n", 317 | "Below the scipy stats module can be used to compute the mode. Any surprises with the result?" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "collapsed": false, 323 | "input": [ 324 | "from scipy.stats.mstats import mode\n", 325 | "f = lambda x: mode(x, axis=None)[0]\n", 326 | "#[value, count] returned by mode()\n", 327 | "mode(df.orig_ip_bytes)" 328 | ], 329 | "language": "python", 330 | "metadata": {}, 331 | "outputs": [] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "### Box Plots\n", 338 | "\n", 339 | "Also known as box whisker plots, these can be used to get a good feel for how distributed the data looks. By default the box will cover the upper and lower quartiles (eg. the 25th - 75th percentile), and a red line will be at the 50th percentile. Whiskers (lines) will extend out to show the rest of the data, with (occasionally) filler points to show outliers.\n", 340 | "\n", 341 | "Here's how to create a simple boxplot so summarize the column that was added above." 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "collapsed": false, 347 | "input": [ 348 | "df.boxplot(column='orig_ip_bytes')\n", 349 | "plt.ylabel('Out-In Byte Ratio')" 350 | ], 351 | "language": "python", 352 | "metadata": {}, 353 | "outputs": [] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "#### Slicing data\n", 360 | "\n", 361 | "It's possible to run these functions on slices or sub-selections of the data.\n", 362 | "\n", 363 | "Below, what happens when you run the **describe()** function on the set of numbers in *orig_ip_bytes* that are less than 200?\n", 364 | "\n", 365 | "What about if you pass the option of **percentiles=[.3,.5,.7]** to the **describe()** function?" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "collapsed": false, 371 | "input": [ 372 | "df[df.orig_ip_bytes < 200]['orig_ip_bytes'].describe()" 373 | ], 374 | "language": "python", 375 | "metadata": {}, 376 | "outputs": [] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "collapsed": false, 381 | "input": [], 382 | "language": "python", 383 | "metadata": {}, 384 | "outputs": [] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "#### Box plots on on slices\n", 391 | "\n", 392 | "Run the **boxplot()** function on the *orig_ip_bytes* column, after selecting all of the values from *orig_ip_bytes* that are less than 200 (like above). \n", 393 | "\n", 394 | "How does the plot look different from the one above?" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "collapsed": false, 400 | "input": [], 401 | "language": "python", 402 | "metadata": {}, 403 | "outputs": [] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "### Data Distribution\n", 410 | "\n", 411 | "It's useful to know the layout of your data for several reasons. One of them is due to some of the underlying assumptions some algorithms make on having continuous values. Or perhaps the data follows a Gaussian (normal) distribution. Using some of the techniques and functions from above you can begin to see how the data might be laid out. However, it's usful to compare it to known data sets (that follow a specific distribution).\n", 412 | "\n", 413 | "The following examples are based on some really nice code that The Glowing Python put together, that we've hacked up to suit our specific use case.\n", 414 | "\n", 415 | "In the first example, the numbers in the *orig_ip_bytes* are scaled (recentered around the mean), and with scikit learn (more on this later) the mean is 0 as well as the unit variance. This is a common cleaning step for Machine Learning algorithms. The scaled numbers are then compared to a randomly generated list of numbers that have the same number of numbers, and are bounded by the same min and max. The list of generated values is also computed with a given standard deviation and mean, as well as the defaults (to show how close the generated list is to \"ideal\". Both samples are compared against the numbers in scaled version of *orig_ip_bytes*.\n", 416 | "\n", 417 | "What happens to the graph when you remove **scale()** from around the **df.orig_ip_bytes.tolist()** section?" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "collapsed": false, 423 | "input": [ 424 | "# original code from: http://glowingpython.blogspot.com/2012/07/distribution-fitting-with-scipy.html\n", 425 | "from scipy.stats import norm\n", 426 | "from numpy import linspace\n", 427 | "from pylab import plot,show,hist,figure,title\n", 428 | "from sklearn.preprocessing import scale\n", 429 | "\n", 430 | "samp = scale(df.orig_ip_bytes.tolist())\n", 431 | "\n", 432 | "param = norm.fit(samp) # distribution fitting\n", 433 | "\n", 434 | "# now, param[0] and param[1] are the mean and \n", 435 | "# the standard deviation of the fitted distribution\n", 436 | "x = linspace(min(samp),max(samp),len(samp))\n", 437 | "# fitted distribution\n", 438 | "pdf_fitted = norm.pdf(x,loc=param[0],scale=param[1])\n", 439 | "# original distribution\n", 440 | "pdf = norm.pdf(x)\n", 441 | "\n", 442 | "title('Normal distribution vs. Bytes')\n", 443 | "plot(x,pdf_fitted,'r-')\n", 444 | "plot(x,pdf,'b-')\n", 445 | "hist(samp,normed=1,alpha=.3)\n", 446 | "show()" 447 | ], 448 | "language": "python", 449 | "metadata": {}, 450 | "outputs": [] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": {}, 455 | "source": [ 456 | "Same as above, except only looking at the first 100 entries in the list (to get a prettier graph).\n", 457 | "\n", 458 | "Do you get a better insight into what happens when you remove **scale()**? What happens?" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "collapsed": false, 464 | "input": [ 465 | "samp = scale(df.orig_ip_bytes.tolist()[:100])\n", 466 | "\n", 467 | "param = norm.fit(samp) # distribution fitting\n", 468 | "\n", 469 | "# now, param[0] and param[1] are the mean and \n", 470 | "# the standard deviation of the fitted distribution\n", 471 | "x = linspace(min(samp),max(samp),len(samp))\n", 472 | "# fitted distribution\n", 473 | "pdf_fitted = norm.pdf(x,loc=param[0],scale=param[1])\n", 474 | "# original distribution\n", 475 | "pdf = norm.pdf(x)\n", 476 | "\n", 477 | "title('Normal distribution vs. Bytes')\n", 478 | "plot(x,pdf_fitted,'r-')\n", 479 | "plot(x,pdf,'b-')\n", 480 | "hist(samp,normed=1,alpha=.3)\n", 481 | "show()" 482 | ], 483 | "language": "python", 484 | "metadata": {}, 485 | "outputs": [] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "metadata": {}, 490 | "source": [ 491 | "### More Data\n", 492 | "\n", 493 | "Run through the exercises above (up until the Box Plot section) on the full list of numbers in the *orig_ip_bytes* column. \n", 494 | "\n", 495 | "What are some of the differences? How good was the random sample that was taken in the first lab?\n", 496 | "\n", 497 | "*Hint: the file you want to read in is \"./orig_ip_bytes.log\". This file only contains one column, so take that into account when reading the file in. Also, no need to add the out_in_ratio columns since the other columns are present*\n", 498 | "\n", 499 | "However, before you begin there's one last thing that will be useful to know. IPython supports cell magic functions. You can get a list of them by creating a cell and executing **%lsmagic** in it. \n", 500 | "\n", 501 | "First create a new cell and run *%reset out* in it (don't forget to hit 'y'). This will clear all the output, and free up a bit of memory for this next set.\n", 502 | "\n", 503 | "Since this is a bigger dataset don't worry when some of the steps require waiting for a couple of minutes." 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "collapsed": false, 509 | "input": [ 510 | "%lsmagic" 511 | ], 512 | "language": "python", 513 | "metadata": {}, 514 | "outputs": [] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "collapsed": false, 519 | "input": [], 520 | "language": "python", 521 | "metadata": {}, 522 | "outputs": [] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "collapsed": false, 527 | "input": [], 528 | "language": "python", 529 | "metadata": {}, 530 | "outputs": [] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "collapsed": false, 535 | "input": [], 536 | "language": "python", 537 | "metadata": {}, 538 | "outputs": [] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "collapsed": false, 543 | "input": [], 544 | "language": "python", 545 | "metadata": {}, 546 | "outputs": [] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "collapsed": false, 551 | "input": [], 552 | "language": "python", 553 | "metadata": {}, 554 | "outputs": [] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "collapsed": false, 559 | "input": [], 560 | "language": "python", 561 | "metadata": {}, 562 | "outputs": [] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "collapsed": false, 567 | "input": [], 568 | "language": "python", 569 | "metadata": {}, 570 | "outputs": [] 571 | } 572 | ], 573 | "metadata": {} 574 | } 575 | ] 576 | } -------------------------------------------------------------------------------- /Lab_2/Lab_2.1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:f8268f3bf79e9e08ce73a92ccbe58e0dff61ad19868cfb5f2de67a03b38cb1b8" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Lab 2.1\n", 16 | "\n", 17 | "## Introduction\n", 18 | "More timeseries analysis, but different data. Instead of flow data this lab will examine some data that was gathered from various honeypots. Three different honeypot packages were used to generate this data: Snort, Amun, and Glastopf. Snort looks for patterns in network traffic and can be run in addition to the other types of honeypots. Amun is a low-interaction honeypot that listens on several ports and records connections to those ports. Glastopf is another low-interaction honeypot that runs a web server and records client requests.\n", 19 | "\n", 20 | "Timeseries graphs and other exploration techniques will be used to understand the types and frequency of scans/attacks against the honeypot infrastructure.\n", 21 | "___" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## Exercises\n", 29 | "\n", 30 | "### File Input\n", 31 | "Instead of parsing a CSV file, the JSON output from *mongoexport* will be used." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "collapsed": false, 37 | "input": [ 38 | "import pandas as pd\n", 39 | "import json" 40 | ], 41 | "language": "python", 42 | "metadata": {}, 43 | "outputs": [] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "Execute the following cell to read in one JSON entry from *mongoexport*." 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "collapsed": false, 55 | "input": [ 56 | "df = pd.read_json(\"./1.json\")" 57 | ], 58 | "language": "python", 59 | "metadata": {}, 60 | "outputs": [] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "What does the data look like? Is it in a usable format?" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "collapsed": false, 72 | "input": [ 73 | "df" 74 | ], 75 | "language": "python", 76 | "metadata": {}, 77 | "outputs": [] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "Since the data doesn't quite look ready-to-use one of IPython's other features of loading an external Python script can be used to fire up some parsing code. In the following cell use the **%load** magic word, and the file you'll want to load is *readhoneydata.py*." 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "collapsed": false, 89 | "input": [ 90 | "%load readhoneydata.py" 91 | ], 92 | "language": "python", 93 | "metadata": {}, 94 | "outputs": [] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "collapsed": false, 99 | "input": [ 100 | "f = open('./honeypot.json', 'r')\n", 101 | "count = 0\n", 102 | "glastopf = []\n", 103 | "amun = []\n", 104 | "snort = []\n", 105 | "for line in f:\n", 106 | " j = json.loads(line)\n", 107 | " temp = []\n", 108 | " temp.append(j[\"_id\"][\"$oid\"])\n", 109 | " temp.append(j[\"ident\"])\n", 110 | " temp.append(j[\"normalized\"])\n", 111 | " temp.append(j[\"timestamp\"][\"$date\"])\n", 112 | " temp.append(j[\"channel\"])\n", 113 | " payload = json.loads(j[\"payload\"])\n", 114 | " if j[\"channel\"] == \"glastopf.events\":\n", 115 | " temp.append(payload[\"pattern\"])\n", 116 | " temp.append(payload[\"filename\"])\n", 117 | " temp.append(payload[\"request_raw\"])\n", 118 | " temp.append(payload[\"request_url\"])\n", 119 | " temp.append(payload[\"source\"][0])\n", 120 | " temp.append(payload[\"source\"][1])\n", 121 | " glastopf.append(temp)\n", 122 | " elif j[\"channel\"] == \"amun.events\":\n", 123 | " temp.append(payload[\"attackerIP\"])\n", 124 | " temp.append(payload[\"attackerPort\"])\n", 125 | " temp.append(payload[\"victimIP\"])\n", 126 | " temp.append(payload[\"victimPort\"])\n", 127 | " temp.append(payload[\"connectionType\"])\n", 128 | " amun.append(temp)\n", 129 | " elif j[\"channel\"] == \"snort.alerts\":\n", 130 | " temp.append(payload[\"source_ip\"])\n", 131 | " if \"source_port\" in payload:\n", 132 | " temp.append(payload[\"source_port\"])\n", 133 | " else:\n", 134 | " temp.append(\"0\")\n", 135 | " temp.append(payload[\"destination_ip\"])\n", 136 | " if \"destination_port\" in payload:\n", 137 | " temp.append(payload[\"destination_port\"])\n", 138 | " else:\n", 139 | " temp.append(\"0\")\n", 140 | " temp.append(payload[\"signature\"])\n", 141 | " temp.append(payload[\"classification\"])\n", 142 | " temp.append(payload[\"proto\"])\n", 143 | " snort.append(temp)\n", 144 | " else:\n", 145 | " print j\n", 146 | "f.close()\n" 147 | ], 148 | "language": "python", 149 | "metadata": {}, 150 | "outputs": [] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "Quickly build the dataframes from the lists of lists." 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "collapsed": false, 162 | "input": [ 163 | "amun_df = pd.DataFrame(amun, columns=['id','ident','normalized','timestamp','channel','attackerIP','attackerPort','victimIP','victimPort','connectionType'])\n", 164 | "glastopf_df = pd.DataFrame(glastopf, columns=['id','ident','normalized','timestamp','channel','pattern','filename','request_raw','request_url','attackerIP','attackerPort'])\n", 165 | "snort_df = pd.DataFrame(snort, columns=['id','ident','normalized','timestamp','channel','attackerIP','attackerPort','victimIP','victimPort','signature','classification','proto'])" 166 | ], 167 | "language": "python", 168 | "metadata": {}, 169 | "outputs": [] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "Check out the dataframes (amun_df, glastopf_df, and snort_df) and get a quick feel to see the types of data in them.\n", 176 | "\n", 177 | "**Hint**: Try running the **head()** and **dtypes** functions on the dataframes." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "collapsed": false, 183 | "input": [], 184 | "language": "python", 185 | "metadata": {}, 186 | "outputs": [] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "### Data Cleanup\n", 193 | "**SPOILER ALERT**\n", 194 | "\n", 195 | "Since the timestamp column isn't a datetime data type we need to fix that. Below is an example that shows what had to be done to the amun dataframe, add in the glastopf and snort ones as well." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "collapsed": false, 201 | "input": [ 202 | "amun_df['timestamp'] = amun_df['timestamp'].apply(lambda x: str(x).replace('T', 'T '))" 203 | ], 204 | "language": "python", 205 | "metadata": {}, 206 | "outputs": [] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "collapsed": false, 211 | "input": [ 212 | "amun_df['timestamp'] = pd.to_datetime(amun_df['timestamp'])" 213 | ], 214 | "language": "python", 215 | "metadata": {}, 216 | "outputs": [] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "Don't forget to double check that the timestamp column is now a datetime object type.\n", 223 | "\n", 224 | "### Data Augmentation\n", 225 | "1. Add a column to *glastopf_df* called *victimPort* and assign it the value **80**.\n", 226 | "2. Add country name to all three dataframes by using the GeoIP module, one example has already been provided.\n", 227 | "\n", 228 | "This product includes GeoLite2 data created by MaxMind, available from http://www.maxmind.com." 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "collapsed": false, 234 | "input": [ 235 | "import GeoIP\n", 236 | "\n", 237 | "gi = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)\n", 238 | "\n", 239 | "amun_df['attackerCountry'] = amun_df['attackerIP'].apply(lambda x: gi.country_name_by_addr(x))" 240 | ], 241 | "language": "python", 242 | "metadata": {}, 243 | "outputs": [] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "Create a new dataframe that has some common information from the other three dataframes." 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "collapsed": false, 255 | "input": [ 256 | "cols = ['channel','timestamp','attackerIP','victimPort','attackerCountry']\n", 257 | "attacker_df = pd.DataFrame()\n", 258 | "attacker_df = attacker_df.append(snort_df[cols], ignore_index=True)\n", 259 | "attacker_df = attacker_df.append(amun_df[cols], ignore_index=True)\n", 260 | "attacker_df = attacker_df.append(glastopf_df[cols], ignore_index=True)" 261 | ], 262 | "language": "python", 263 | "metadata": {}, 264 | "outputs": [] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "### Reindex\n", 271 | "Using what you learned in the first part of the lab set the index for the *attacker_df* to the *timestamp* column." 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "collapsed": false, 277 | "input": [], 278 | "language": "python", 279 | "metadata": {}, 280 | "outputs": [] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "### Basic Exploration\n", 287 | "What are the top 10 most active IPs in the *attacker_df*? What honeypot type picked up this attacker, and what port(s) was this attacker especially fond of.\n", 288 | "\n", 289 | "**Hint** Incase you forgot, the honeypot type is stored in the *channel* column, and the port(s) are stored in the *victimPort* column." 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "collapsed": false, 295 | "input": [], 296 | "language": "python", 297 | "metadata": {}, 298 | "outputs": [] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "collapsed": false, 303 | "input": [], 304 | "language": "python", 305 | "metadata": {}, 306 | "outputs": [] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "This is one way that values can be pulled out from other column values. In this instance a new column called *user-agent* is created from the header captured in the Glastopf honeypot." 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "collapsed": false, 318 | "input": [ 319 | "import re\n", 320 | "\n", 321 | "regex = re.compile('.*[Uu][Ss][Ee][Rr]-[Aa][Gg][Ee][Nn][Tt]:(.*?)(?:\\\\r|$)')\n", 322 | "glastopf_df['user-agent'] = glastopf_df['request_raw'].apply(lambda x: re.search(regex, x).group(1) if re.search(regex, x) else None)" 323 | ], 324 | "language": "python", 325 | "metadata": {}, 326 | "outputs": [] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "What are some of the more popular user-agent strings? Find any interesting patterns?" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "collapsed": false, 338 | "input": [ 339 | "glastopf_df['user-agent'].value_counts()" 340 | ], 341 | "language": "python", 342 | "metadata": {}, 343 | "outputs": [] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "When you've found some patterns worth exploring, what else can you come up with.\n", 350 | "\n", 351 | "You can use the **str.contains()** function to see what rows contain a specific substring. One example has been provided, the query in the cell below is an easy way to query for all entries that may contain a shellshock exploit attempt." 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "collapsed": false, 357 | "input": [ 358 | "glastopf_df[glastopf_df['request_raw'].str.contains('{ :;}')]['request_raw'].value_counts()" 359 | ], 360 | "language": "python", 361 | "metadata": {}, 362 | "outputs": [] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "collapsed": false, 367 | "input": [], 368 | "language": "python", 369 | "metadata": {}, 370 | "outputs": [] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "collapsed": false, 375 | "input": [], 376 | "language": "python", 377 | "metadata": {}, 378 | "outputs": [] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "collapsed": false, 383 | "input": [], 384 | "language": "python", 385 | "metadata": {}, 386 | "outputs": [] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "collapsed": false, 391 | "input": [], 392 | "language": "python", 393 | "metadata": {}, 394 | "outputs": [] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "### Timeseries Graphs (again)\n", 401 | "\n", 402 | "1. Check out some of the timeseries graphs below, see if you can find any interesting patterns in the graphs/data.\n", 403 | "2. Re-run the graphs and see what happens when you filter out the top talker from above.\n", 404 | "\n", 405 | "**Hint** remove the top talker from above by using the following code: *attacker_df = attacker_df[attacker_df['attackerIP'] != '']*." 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "collapsed": false, 411 | "input": [ 412 | "import matplotlib.pyplot as plt\n", 413 | "\n", 414 | "plt.plot(attacker_df['attackerIP'].resample(\"D\", how='count'), label=\"Total Events\")\n", 415 | "plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) \n", 416 | "plt.show()" 417 | ], 418 | "language": "python", 419 | "metadata": {}, 420 | "outputs": [] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "collapsed": false, 425 | "input": [ 426 | "for port in attacker_df['victimPort'].value_counts().index:\n", 427 | " if port < 10000:\n", 428 | " plt.plot(attacker_df[attacker_df == port]['victimPort'].resample(\"D\", how='count'), label=port)\n", 429 | "plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) \n", 430 | "plt.show()" 431 | ], 432 | "language": "python", 433 | "metadata": {}, 434 | "outputs": [] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "collapsed": false, 439 | "input": [ 440 | "tempdf = attacker_df[attacker_df['channel'] != 'amun.events']\n", 441 | "for port in tempdf['victimPort'].value_counts().index:\n", 442 | " plt.plot(tempdf[tempdf == port]['victimPort'].resample(\"D\", how='count'), label=port)\n", 443 | "plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) \n", 444 | "plt.show()" 445 | ], 446 | "language": "python", 447 | "metadata": {}, 448 | "outputs": [] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "collapsed": false, 453 | "input": [ 454 | "for channel in attacker_df['channel'].value_counts().index:\n", 455 | " plt.plot(attacker_df[attacker_df['channel'] == channel]['channel'].resample(\"D\", how='count'), label=channel)\n", 456 | "plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) \n", 457 | "plt.show()" 458 | ], 459 | "language": "python", 460 | "metadata": {}, 461 | "outputs": [] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "collapsed": false, 466 | "input": [ 467 | "for channel in attacker_df['channel'].value_counts().index:\n", 468 | " if channel != \"amun.events\":\n", 469 | " plt.plot(attacker_df[attacker_df['channel'] == channel]['channel'].resample(\"D\", how='count'), label=channel)\n", 470 | "plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) \n", 471 | "plt.show()" 472 | ], 473 | "language": "python", 474 | "metadata": {}, 475 | "outputs": [] 476 | }, 477 | { 478 | "cell_type": "markdown", 479 | "metadata": {}, 480 | "source": [ 481 | "### Further Exploration\n", 482 | "It's possible to not only look at the top 20 countries hitting the honeypots, but other queries can be combined with the GeoIP info to get a different view on how information is laid out." 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "collapsed": false, 488 | "input": [ 489 | "attacker_df['attackerCountry'].value_counts()[:20]" 490 | ], 491 | "language": "python", 492 | "metadata": {}, 493 | "outputs": [] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "Below is a snapshot of all the countries that hit the honeypots with shellshock requests. What other types of queries can you think of?" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "collapsed": false, 505 | "input": [ 506 | "glastopf_df[glastopf_df['request_raw'].str.contains('};')]['attackerCountry'].value_counts()" 507 | ], 508 | "language": "python", 509 | "metadata": {}, 510 | "outputs": [] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "### You can learn a lot from a URL\n", 517 | "One of the things you can learn from a URL is the types of vulnerabilites people are scanning for.\n", 518 | "\n", 519 | "1. What types of vulnerabilites are scanners looking for?\n", 520 | "2. How many requests for phpMyAdmin were there, and who's making them?" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "collapsed": false, 526 | "input": [ 527 | "glastopf_df['request_url'].value_counts()" 528 | ], 529 | "language": "python", 530 | "metadata": {}, 531 | "outputs": [] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "collapsed": false, 536 | "input": [ 537 | "len(glastopf_df[glastopf_df['request_raw'].str.contains('phpMyAdmin')]['channel'].tolist())" 538 | ], 539 | "language": "python", 540 | "metadata": {}, 541 | "outputs": [] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "collapsed": false, 546 | "input": [ 547 | "glastopf_df[glastopf_df['request_raw'].str.contains('phpMyAdmin')]['attackerIP'].unique()" 548 | ], 549 | "language": "python", 550 | "metadata": {}, 551 | "outputs": [] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "collapsed": false, 556 | "input": [ 557 | "for ip in glastopf_df[glastopf_df['request_raw'].str.contains('phpMyAdmin')]['attackerIP'].unique().tolist():\n", 558 | " print \"%s - %s\" %(ip, glastopf_df[glastopf_df['attackerIP'] == ip]['attackerCountry'].unique())" 559 | ], 560 | "language": "python", 561 | "metadata": {}, 562 | "outputs": [] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "### Correlation over time\n", 569 | "This is a technique to determine if multiple countries are active across all the honeypots at/around the same time.\n", 570 | "\n", 571 | "Since we're just intersted in *attackerCountry*, a dataframe that contains just that data will be created for clarity." 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "collapsed": false, 577 | "input": [ 578 | "plt.rcParams['figure.figsize'] = (10, 10)\n", 579 | "subset = attacker_df[['attackerCountry']]\n", 580 | "subset['count'] = 1\n", 581 | "pivot = pd.pivot_table(subset, values='count', index=subset.index, columns=['attackerCountry'], fill_value=0)" 582 | ], 583 | "language": "python", 584 | "metadata": {}, 585 | "outputs": [] 586 | }, 587 | { 588 | "cell_type": "markdown", 589 | "metadata": {}, 590 | "source": [ 591 | "Let's take a look at the 20 most active contries. Make sure to change the cell below to reflect this." 592 | ] 593 | }, 594 | { 595 | "cell_type": "code", 596 | "collapsed": false, 597 | "input": [ 598 | "topN = subset['attackerCountry'].value_counts()[:20].index" 599 | ], 600 | "language": "python", 601 | "metadata": {}, 602 | "outputs": [] 603 | }, 604 | { 605 | "cell_type": "markdown", 606 | "metadata": {}, 607 | "source": [ 608 | "Below illustrates how the correlation matrix can be graphed. The X and Y axis are sorted for ease of understanding/viewing of the data." 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "collapsed": false, 614 | "input": [ 615 | "grouped = pivot.groupby([(lambda x: x.month), (lambda x: x.day)]).sum()\n", 616 | "corr_df = grouped[topN].corr()\n", 617 | "\n", 618 | "import statsmodels.api as sm\n", 619 | "\n", 620 | "corr_df.sort(axis=0, inplace=True)\n", 621 | "corr_df.sort(axis=1, inplace=True)\n", 622 | "corr_matrix = corr_df.as_matrix()\n", 623 | "sm.graphics.plot_corr(corr_matrix, ynames=corr_df.index.tolist(), xnames=corr_df.columns.tolist(), cmap='binary')\n", 624 | "plt.show()" 625 | ], 626 | "language": "python", 627 | "metadata": {}, 628 | "outputs": [] 629 | }, 630 | { 631 | "cell_type": "markdown", 632 | "metadata": {}, 633 | "source": [ 634 | "Next up, dig into the correlations a bit more. We can look at the behavior across honeypots over time by using the grouped information, and plotting it.\n", 635 | "\n", 636 | "Pick 4 countries from above that appear to be highly correlated (the squares where the countries meet are closer to black) and graph them." 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "collapsed": false, 642 | "input": [ 643 | "print grouped[['France','Germany','Netherlands']].corr()\n", 644 | "grouped[['France','Germany','Netherlands']].plot()\n", 645 | "pylab.ylabel('Probes')\n", 646 | "pylab.xlabel('Date Scanned')" 647 | ], 648 | "language": "python", 649 | "metadata": {}, 650 | "outputs": [] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "metadata": {}, 655 | "source": [ 656 | "Pick some of the ones that appear to not be highly correlated, what does their graph look like?" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "collapsed": false, 662 | "input": [ 663 | "print grouped[['France','Poland','Taiwan']].corr()\n", 664 | "grouped[['France','Poland','Taiwan']].plot()\n", 665 | "pylab.ylabel('Probes')\n", 666 | "pylab.xlabel('Date Scanned')" 667 | ], 668 | "language": "python", 669 | "metadata": {}, 670 | "outputs": [] 671 | } 672 | ], 673 | "metadata": {} 674 | } 675 | ] 676 | } -------------------------------------------------------------------------------- /Lab_4/Lab_4.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:477fc219d113063a9afe3a0e9a83e8d6972877145d084807f217be7d385fe5da" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Lab 4\n", 16 | "\n", 17 | "## Introduction\n", 18 | "\n", 19 | "### PCA/Clustering\n", 20 | "This marks the first unsupervised learning lab. There are several aspects to unsupervised learning:\n", 21 | "\n", 22 | "* Data has no labels\n", 23 | "* The goal is to find structure\n", 24 | "* The most \"popular\" aspect is clustering\n", 25 | "* It also includes dimensionality reduction and feature extraction\n", 26 | "\n", 27 | "This lab will focus on dimensionality reduction via PCA (Principal Component Analysis). As well as an introduction to K-means clustering.\n", 28 | "\n", 29 | "### Lab\n", 30 | "In this lab you, as analyst, have a list of domains and the related blacklists they appear on. In addition some of these domains were responsible for sending a file to the client. These files have been run through VirusTotal and the AV results are also available with the domains. The goal is to explore the data, find some structure and attempt to find a way to gain confidence in what domains are more malicious as a means of prioritization. As with any type of data exploration, it's not a silver bullet but perhaps you'll gain an understanding of the data\n", 31 | "___\n", 32 | "## Exercises\n", 33 | "### File Input - Blacklist Data\n", 34 | "The data for the lab is contained in *host_detections.csv* and has columns: *host*, *detections*, and *detection_count*." 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "collapsed": false, 40 | "input": [ 41 | "import pandas as pd\n" 42 | ], 43 | "language": "python", 44 | "metadata": {}, 45 | "outputs": [] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "### Cleanup - Blacklist Data\n", 52 | "Drop the duplicates on the *df* dataframe, for column *host*" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "collapsed": false, 58 | "input": [ 59 | "df.drop_duplicates(subset='host', inplace=True)" 60 | ], 61 | "language": "python", 62 | "metadata": {}, 63 | "outputs": [] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "This next section cleans up the detections column. It removes the text formatting and puts the information into a Python list, and places the Python list back into the dataframe in place of the text. It also creates a multi-dimensional list that represents the the various blacklists and if there was a hit for the domain *1* or not *0*. " 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "collapsed": false, 75 | "input": [ 76 | "black_list_sources = set()\n", 77 | "def get_list(x):\n", 78 | " detections = []\n", 79 | " if not (len(x) == 1 and int(x) == 0):\n", 80 | " x = x.replace(\" \", \"\")\n", 81 | " x = x.replace(\"u'\", \"\")\n", 82 | " x = x.replace(\"'\", \"\")\n", 83 | " x = x.replace(\"[\", \"\")\n", 84 | " x = x.replace(\"]\", \"\")\n", 85 | " [black_list_sources.add(i) for i in x.split(',') if len(i) > 1]\n", 86 | " [detections.append(i) for i in x.split(',') if len(i) > 1]\n", 87 | " return detections\n", 88 | "df.detections = df.detections.apply(lambda x: get_list(x))" 89 | ], 90 | "language": "python", 91 | "metadata": {}, 92 | "outputs": [] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "Join the resulting multi-dimensional list to the \"side\" of the existing dataframe. \n", 99 | "\n", 100 | "You can see the host **02b123c.netsolhost.com** has 0 detections, and has *0*s in place for all of the blacklist values. Where **0lilioo0l0o00lilil.info** has 7 detections and a *1* in place of it's detections (e.g. hpHosts)." 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "collapsed": false, 106 | "input": [ 107 | "df = df.join(pd.DataFrame(index=df.index, columns=black_list_sources))\n", 108 | "df = df.fillna(0)\n", 109 | "for i in df.index:\n", 110 | " for x in df.xs(i)['detections']:\n", 111 | " df.ix[i, x] = 1\n", 112 | "df.head()" 113 | ], 114 | "language": "python", 115 | "metadata": {}, 116 | "outputs": [] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "### File Input - VirusTotal\n", 123 | "The data is in a file named *mal_domains.csv* and has columns: *host*, *count*, and *detections*. This data has been pre-processed to save some pain on parsing and assembling massive amounts of JSON data." 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "collapsed": false, 129 | "input": [], 130 | "language": "python", 131 | "metadata": {}, 132 | "outputs": [] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "### Cleanup - VirusTotal\n", 139 | "Similar to the above we clean-up the detections column." 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "collapsed": false, 145 | "input": [ 146 | "av_list_sources = set()\n", 147 | "def get_list(x):\n", 148 | " detections = []\n", 149 | " if not (len(x) == 1 and int(x) == 0):\n", 150 | " x = x.replace(\" \", \"\")\n", 151 | " x = x.replace(\"u'\", \"\")\n", 152 | " x = x.replace(\"'\", \"\")\n", 153 | " x = x.replace(\"[\", \"\")\n", 154 | " x = x.replace(\"]\", \"\")\n", 155 | " [av_list_sources.add(i) for i in x.split(',') if len(i) > 1]\n", 156 | " [detections.append(i) for i in x.split(',') if len(i) > 1]\n", 157 | " return detections\n", 158 | "av_domains.detections = av_domains.detections.apply(lambda x: get_list(x))\n", 159 | "av_domains.head()" 160 | ], 161 | "language": "python", 162 | "metadata": {}, 163 | "outputs": [] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "A little massaging is necessary here because there are blacklists and AV engines that have the same name. This renames the columns and places an *av_* prefix to the name ensuring there are no duplicates, and has the extra advantage of allow easy distinction in analysis.\n", 170 | "\n", 171 | "Also, join the AV dataframe to the blacklist one created above." 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "collapsed": false, 177 | "input": [ 178 | "new_cols = av_domains.columns - ['host']\n", 179 | "new_cols = ['av_' + x for x in new_cols.tolist()]\n", 180 | "df = df.join(pd.DataFrame(index=df.index, columns=new_cols))" 181 | ], 182 | "language": "python", 183 | "metadata": {}, 184 | "outputs": [] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "This is where the expansion, and then filling in of values, *1* for detection and *0* for no detection, happens." 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "collapsed": false, 196 | "input": [ 197 | "for i in df.index:\n", 198 | " host = df.xs(i)['host']\n", 199 | " avs = av_domains[av_domains['host'] == host]['detections']\n", 200 | " if len(avs) > 0:\n", 201 | " for a in avs.values.tolist()[0]:\n", 202 | " df.ix[i, 'av_' + a] = 1\n", 203 | " df.ix[i, 'av_count'] = av_domains[av_domains['host'] == host]['count'].values[0]\n", 204 | " df.ix[i, 'av_detections'] = av_domains[av_domains['host'] == host]['detections'].values" 205 | ], 206 | "language": "python", 207 | "metadata": {}, 208 | "outputs": [] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "collapsed": false, 213 | "input": [ 214 | "df.av_detections = df.av_detections.apply(lambda x: [] if isinstance(x, float) or len(x) < 1 else x[0])" 215 | ], 216 | "language": "python", 217 | "metadata": {}, 218 | "outputs": [] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "collapsed": false, 223 | "input": [ 224 | "df = df.fillna(0)\n", 225 | "#del df['None']" 226 | ], 227 | "language": "python", 228 | "metadata": {}, 229 | "outputs": [] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "For consistency's sake, set all of the columns but *host*, *detections*, and *av_detections* to type **int**" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "collapsed": false, 241 | "input": [ 242 | "int_cols = list(df.columns - ['host','detections','av_detections'])\n", 243 | "df[int_cols] = df[int_cols].astype(int)" 244 | ], 245 | "language": "python", 246 | "metadata": {}, 247 | "outputs": [] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "Take a look at the resulting dataframe, you'll see a similar structure to the one above.\n", 254 | "\n", 255 | "The cell below shows how to print the dimensions of the dataframe, in this case it has 346 rows and 97 columns (e.g. dimensions). This is due to the selection clause, it looks for domains that have zero AV results against it, and more than one blacklist hit.\n", 256 | "\n", 257 | "Try reversing the query *av_count* > 0 and *detection_count* == 0." 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "collapsed": false, 263 | "input": [ 264 | "print df[(df['av_count'] == 0) & (df['detection_count'] > 0)].shape\n", 265 | "df[(df['av_count'] == 0) & (df['detection_count'] > 0)].head()" 266 | ], 267 | "language": "python", 268 | "metadata": {}, 269 | "outputs": [] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "collapsed": false, 274 | "input": [], 275 | "language": "python", 276 | "metadata": {}, 277 | "outputs": [] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "In your exploration you might have run across an IP address or 2, let's split these up into two different dataframes. This will allow and apples-to-apples comparison." 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "collapsed": false, 289 | "input": [ 290 | "domains = df[~df.host.str.contains(\"^\\d+\\.\\d+\\.\\d+\\.\\d+$\")]\n", 291 | "ips = df[df.host.str.contains(\"^\\d+\\.\\d+\\.\\d+\\.\\d+$\")]" 292 | ], 293 | "language": "python", 294 | "metadata": {}, 295 | "outputs": [] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "How many elements (rows) are in each dataframe (*domains*, *ips*)?" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "collapsed": false, 307 | "input": [ 308 | "domains.shape" 309 | ], 310 | "language": "python", 311 | "metadata": {}, 312 | "outputs": [] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "collapsed": false, 317 | "input": [ 318 | "ips.shape" 319 | ], 320 | "language": "python", 321 | "metadata": {}, 322 | "outputs": [] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "### Analysis\n", 329 | "The cell below pulls out the list of features what we want to use. In this case it's all of the columns that don't (or appear not to) add any value to the analysis. The hostname is what is being analyzed, the *detections* and *av_detections* are sparse text that can't be use in this lab, and the counts should be summed-up/accounted for by the presence or lack of a qualifying detection event (AV or blacklist)." 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "collapsed": false, 335 | "input": [ 336 | "cols = list(domains.columns - ['host','detections','av_detections','av_count','detection_count'])" 337 | ], 338 | "language": "python", 339 | "metadata": {}, 340 | "outputs": [] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "### K-Means Clustering\n", 347 | "K-Means works on a fairly simple idea. You provide the algorithm with **K**, the number of clusters you think are in the dataset. The algorithm will attempt to find points that have the minimum distance to the other points, the centroids dictate the center of the cluster.\n", 348 | "\n", 349 | "Below, the **K** for K-means was set to two. There are many ways to determine an optimal K, but for this exercise we're only interested in two labels, good and bad. By doing this we can guide the algorithm into picking two centers and giving us a \"good\" group and a \"bad\" group of domains.\n", 350 | "\n", 351 | "The data is clustered two times. One time with both the blacklist and AV features, and another time with just the blacklist features. The labels for the clusters are stored in *bl_vt_labels* and *bl_labels* respectively. This allows an easy way to reference the labels without re-clustering the data later on.\n", 352 | "\n", 353 | "You should add a third cluster section that stores the labels in *vt_labels*, and is only a cluster of columns from the AV set. Remember the AV results are prefixed with *av_* making the columns easy to pick out." 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "collapsed": false, 359 | "input": [ 360 | "#Initial labeling of the data with 2 different datasets (URLS + VT, and just URLS)\n", 361 | "from sklearn.cluster import KMeans\n", 362 | "from sklearn.preprocessing import scale\n", 363 | "\n", 364 | "X = domains.as_matrix(cols)\n", 365 | "\n", 366 | "k_clusters = 2\n", 367 | "kmeans = KMeans(n_clusters=k_clusters)\n", 368 | "kmeans.fit(X)\n", 369 | "bl_vt_labels = kmeans.labels_\n", 370 | "\n", 371 | "# Blacklist only columns\n", 372 | "bl_cols = [x for x in cols if not 'av_' in x]\n", 373 | "X = domains.as_matrix(bl_cols)\n", 374 | "\n", 375 | "k_clusters = 2\n", 376 | "kmeans = KMeans(n_clusters=k_clusters)\n", 377 | "kmeans.fit(X)\n", 378 | "bl_labels = kmeans.labels_\n" 379 | ], 380 | "language": "python", 381 | "metadata": {}, 382 | "outputs": [] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "Check your work! Make sure to print out at least a few elements of *vt_labels*." 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "collapsed": false, 394 | "input": [ 395 | "print bl_labels[:10]\n", 396 | "print bl_vt_labels[:10]" 397 | ], 398 | "language": "python", 399 | "metadata": {}, 400 | "outputs": [] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": {}, 405 | "source": [ 406 | "Remember, the algorithm doesn't know what's malicious or not, so don't place any inherent value in a label of *1* or *0*. It's only a label of what group the algorithm thinks the data belongs in. Although, you as an analyst, might be able to infer if it's in the malicious or benign cluster.\n", 407 | "\n", 408 | "----\n", 409 | "\n", 410 | "Below is a way to spot check domains, explore a couple more on your own. You can see what blacklists and AV engines, if any, are associated with the domain." 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "collapsed": false, 416 | "input": [ 417 | "d = \"0lilioo0l0o00lilil.info\"\n", 418 | "print \"Domain %s has bl_label: %d\" %(d, bl_labels[domains[domains['host'] == d].index[0]])\n", 419 | "print \"Domain %s has bl_vt_label: %d\" %(d, bl_vt_labels[domains[domains['host'] == d].index[0]])\n", 420 | "R = zip(list(domains.columns), domains[domains['host'] == d].values.tolist()[0])\n", 421 | "for r in R:\n", 422 | " if r[1] == 1:\n", 423 | " print r" 424 | ], 425 | "language": "python", 426 | "metadata": {}, 427 | "outputs": [] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "collapsed": false, 432 | "input": [], 433 | "language": "python", 434 | "metadata": {}, 435 | "outputs": [] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "collapsed": false, 440 | "input": [], 441 | "language": "python", 442 | "metadata": {}, 443 | "outputs": [] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "collapsed": false, 448 | "input": [], 449 | "language": "python", 450 | "metadata": {}, 451 | "outputs": [] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "### PCA\n", 458 | "PCA is used for dimensionality reduction, one of the major advantages of this is being able to visualize data. Our current dataset has 92 features/dimensions, which unless you have super powers is pretty hard to visualize. One awesome use of PCA is to reduce these dimensions down into something that we as mortals can see.\n", 459 | "\n", 460 | "The first exercise is reducing all 92 dimensions down to three for easy and pretty graphing. The colors in the graph are set by the labels from the K-Means clustering above.\n", 461 | "\n", 462 | "Do the same as the cell below but one set of graphs for the blacklist only data and one set of graphs for the VirusTotal only data. What kinds of patterns emerge?\n", 463 | "\n", 464 | "**Hint** don't forget to use the right labels for the right columns." 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "collapsed": false, 470 | "input": [ 471 | "import pylab\n", 472 | "from matplotlib import pyplot as plt\n", 473 | "from sklearn.decomposition import PCA\n", 474 | "from mpl_toolkits.mplot3d import Axes3D\n", 475 | "\n", 476 | "pylab.rcParams['figure.figsize'] = (16.0, 5.0)\n", 477 | "\n", 478 | "X = PCA(n_components=3).fit_transform(domains.as_matrix(cols))\n", 479 | "colors = ['green' if x == 1 else 'red' for x in bl_vt_labels]\n", 480 | "\n", 481 | "figsize(12,8)\n", 482 | "fig = plt.figure(figsize=plt.figaspect(.5))\n", 483 | "fig.suptitle(\"Exploding Tacos!\")\n", 484 | "ax = fig.add_subplot(1, 2, 1, projection='3d')\n", 485 | "ax.scatter(X[:,0], X[:,1], X[:,2], alpha=.5, color=colors, s=50)\n", 486 | "ax.set_title(\"Kmeans Clusters\")\n", 487 | "ax = fig.add_subplot(1, 2, 2, projection='3d')\n", 488 | "ax.set_xlim(-5,2)\n", 489 | "ax.set_ylim(-2,2)\n", 490 | "ax.set_zlim(-2,2)\n", 491 | "ax.scatter(X[:,0], X[:,1], X[:,2], alpha=.5, color=colors, s=50)\n", 492 | "ax.set_title(\"KMeans Clusters (zoomed in)\")\n", 493 | "plt.show()" 494 | ], 495 | "language": "python", 496 | "metadata": {}, 497 | "outputs": [] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "collapsed": false, 502 | "input": [], 503 | "language": "python", 504 | "metadata": {}, 505 | "outputs": [] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "collapsed": false, 510 | "input": [], 511 | "language": "python", 512 | "metadata": {}, 513 | "outputs": [] 514 | }, 515 | { 516 | "cell_type": "markdown", 517 | "metadata": {}, 518 | "source": [ 519 | "### 2D\n", 520 | "Now that you're a wiz at reducing various dimensions to three, it's possible to reduce down to two and graph that. Perhaps some more or different structure will pop out at you.\n", 521 | "\n", 522 | "Once again the blacklist and VirusTotal scenario is done for you, do the same as above and examine the blacklist only and VirusTotal cases in 2D." 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "collapsed": false, 528 | "input": [ 529 | "colors = ['green' if x == 1 else 'red' for x in bl_vt_labels]\n", 530 | "DD = PCA(n_components=2).fit_transform(domains.as_matrix(cols))\n", 531 | "figsize(12,8)\n", 532 | "fig = plt.figure(figsize=plt.figaspect(.5))\n", 533 | "ax = fig.add_subplot(1, 1, 1)\n", 534 | "ax.scatter(DD[:,0], DD[:,1], s=50, alpha=.5, color=colors)\n", 535 | "ax.set_title(\"Raw Data 2D\")\n", 536 | "plt.show()" 537 | ], 538 | "language": "python", 539 | "metadata": {}, 540 | "outputs": [] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "collapsed": false, 545 | "input": [], 546 | "language": "python", 547 | "metadata": {}, 548 | "outputs": [] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "collapsed": false, 553 | "input": [], 554 | "language": "python", 555 | "metadata": {}, 556 | "outputs": [] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "### 1D\n", 563 | "Our last stop on this journey is 1D. The insights gained by visualizing the data in both three and two dimensions can be pretty helpful. As the beginning of the lab stated our goal is to create some kind of ranking or prioritization of the domains which is just a one-dimensional task. We'll cheat a little bit since looking at a list of numbers isn't that pretty. We'll cheat a bit for the graphing and plot our points along the X-axis with a Y value of 0 for each point.\n", 564 | "\n", 565 | "The case of all the features has been provided for you, repeat the process for blacklist only and AV only." 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "collapsed": false, 571 | "input": [ 572 | "import numpy as np\n", 573 | "\n", 574 | "colors = ['green' if x == 1 else 'red' for x in bl_vt_labels]\n", 575 | "D = PCA(n_components=1).fit_transform(domains.as_matrix(cols))\n", 576 | "print len(D)\n", 577 | "DD = np.ndarray(shape=(len(D),2), dtype=float, order='F')\n", 578 | "for i in range(0,len(D)):\n", 579 | " DD[i] = [D[i], 0.0]\n", 580 | "\n", 581 | "figsize(12,8)\n", 582 | "fig = plt.figure(figsize=plt.figaspect(.5))\n", 583 | "ax = fig.add_subplot(1, 1, 1)\n", 584 | "ax.scatter(DD[:,0], DD[:,1], s=50, color=colors)\n", 585 | "ax.set_title(\"Line 'em up!\")\n", 586 | "plt.show()" 587 | ], 588 | "language": "python", 589 | "metadata": {}, 590 | "outputs": [] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "collapsed": false, 595 | "input": [], 596 | "language": "python", 597 | "metadata": {}, 598 | "outputs": [] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "collapsed": false, 603 | "input": [], 604 | "language": "python", 605 | "metadata": {}, 606 | "outputs": [] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "metadata": {}, 611 | "source": [ 612 | "## Scaled Data\n", 613 | "One of the final things we can do with this information is scale the feature returned by PCA in this instance. This shifts the data so all values are between zero and one. Giving a really nice scale.\n", 614 | "\n", 615 | "The case of both AV and blacklist is once again provided, perform the same operation/graph for AV only and blacklist only." 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "collapsed": false, 621 | "input": [ 622 | "D = PCA(n_components=1).fit_transform(domains.as_matrix(cols))\n", 623 | "D = [(x - D.min())/(D.max() - D.min()) for x in D]\n", 624 | "DD = np.ndarray(shape=(len(D),2), dtype=float, order='F')\n", 625 | "for i in range(0,len(D)):\n", 626 | " DD[i] = [D[i], 0.0]\n", 627 | "\n", 628 | "figsize(12,8)\n", 629 | "fig = plt.figure(figsize=plt.figaspect(.5))\n", 630 | "ax = fig.add_subplot(1, 1, 1)\n", 631 | "ax.scatter(DD[:,0], DD[:,1], s=50, alpha=.5, color=colors)\n", 632 | "ax.set_title(\"Normalized/Scaled between 0 and 1\")\n", 633 | "plt.show()" 634 | ], 635 | "language": "python", 636 | "metadata": {}, 637 | "outputs": [] 638 | }, 639 | { 640 | "cell_type": "markdown", 641 | "metadata": {}, 642 | "source": [ 643 | "## Putting It All Together\n", 644 | "\n", 645 | "After doing all that work to attempt to order and group data, it's time to make use of the results. Remember, that the labels *0* and *1* are arbitrary so it will take assigning the values back and you interpreting the data to understand what's going on.\n", 646 | "\n", 647 | "Here's one of the ways to assign and look at domains. This is just for the AV and blacklist results, so you should do the same with the other labels/values.\n", 648 | "\n", 649 | "When does this seem to work, when does it seem to fail? How valuable do you think this kind of technique is?" 650 | ] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "collapsed": false, 655 | "input": [ 656 | "D = PCA(n_components=1).fit_transform(domains.as_matrix(cols))\n", 657 | "D = [(x - D.min())/(D.max() - D.min()) for x in D]\n", 658 | "domains['bl_vt_scaled'] = D\n", 659 | "domains[['host','bl_vt_scaled']].head()" 660 | ], 661 | "language": "python", 662 | "metadata": {}, 663 | "outputs": [] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "collapsed": false, 668 | "input": [ 669 | "domains[domains['host'] == '0td4nbde7.ttl60.com'][['detections','detection_count','av_detections','av_count']]" 670 | ], 671 | "language": "python", 672 | "metadata": {}, 673 | "outputs": [] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "collapsed": false, 678 | "input": [ 679 | "domains[domains['bl_vt_scaled'] == 1]['host'].unique()" 680 | ], 681 | "language": "python", 682 | "metadata": {}, 683 | "outputs": [] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "collapsed": false, 688 | "input": [ 689 | "domains[domains['host'] == 'turningsbyterry.com'][['detections','detection_count','av_detections','av_count']]" 690 | ], 691 | "language": "python", 692 | "metadata": {}, 693 | "outputs": [] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "collapsed": false, 698 | "input": [ 699 | "domains[domains['bl_vt_scaled'] == 0]['host'].unique()" 700 | ], 701 | "language": "python", 702 | "metadata": {}, 703 | "outputs": [] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "collapsed": false, 708 | "input": [ 709 | "domains[domains['host'] == 'download.yspbrsz.net'][['detections','detection_count','av_detections','av_count']]" 710 | ], 711 | "language": "python", 712 | "metadata": {}, 713 | "outputs": [] 714 | }, 715 | { 716 | "cell_type": "code", 717 | "collapsed": false, 718 | "input": [], 719 | "language": "python", 720 | "metadata": {}, 721 | "outputs": [] 722 | }, 723 | { 724 | "cell_type": "code", 725 | "collapsed": false, 726 | "input": [], 727 | "language": "python", 728 | "metadata": {}, 729 | "outputs": [] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "collapsed": false, 734 | "input": [], 735 | "language": "python", 736 | "metadata": {}, 737 | "outputs": [] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "collapsed": false, 742 | "input": [], 743 | "language": "python", 744 | "metadata": {}, 745 | "outputs": [] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "collapsed": false, 750 | "input": [], 751 | "language": "python", 752 | "metadata": {}, 753 | "outputs": [] 754 | }, 755 | { 756 | "cell_type": "code", 757 | "collapsed": false, 758 | "input": [], 759 | "language": "python", 760 | "metadata": {}, 761 | "outputs": [] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "collapsed": false, 766 | "input": [], 767 | "language": "python", 768 | "metadata": {}, 769 | "outputs": [] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "collapsed": false, 774 | "input": [], 775 | "language": "python", 776 | "metadata": {}, 777 | "outputs": [] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "collapsed": false, 782 | "input": [], 783 | "language": "python", 784 | "metadata": {}, 785 | "outputs": [] 786 | }, 787 | { 788 | "cell_type": "code", 789 | "collapsed": false, 790 | "input": [], 791 | "language": "python", 792 | "metadata": {}, 793 | "outputs": [] 794 | }, 795 | { 796 | "cell_type": "code", 797 | "collapsed": false, 798 | "input": [], 799 | "language": "python", 800 | "metadata": {}, 801 | "outputs": [] 802 | }, 803 | { 804 | "cell_type": "code", 805 | "collapsed": false, 806 | "input": [], 807 | "language": "python", 808 | "metadata": {}, 809 | "outputs": [] 810 | }, 811 | { 812 | "cell_type": "code", 813 | "collapsed": false, 814 | "input": [], 815 | "language": "python", 816 | "metadata": {}, 817 | "outputs": [] 818 | }, 819 | { 820 | "cell_type": "code", 821 | "collapsed": false, 822 | "input": [], 823 | "language": "python", 824 | "metadata": {}, 825 | "outputs": [] 826 | }, 827 | { 828 | "cell_type": "markdown", 829 | "metadata": {}, 830 | "source": [ 831 | "# Fin" 832 | ] 833 | } 834 | ], 835 | "metadata": {} 836 | } 837 | ] 838 | } -------------------------------------------------------------------------------- /Lab_1/Lab_1-Solutions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:e7553f59d5e28227c460b7c26f3b048bd5723079634bb861a32a0b13ff7b0f3c" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Lab 1\n", 16 | "\n", 17 | "## Introduction\n", 18 | "This is a basic introduction to IPython and panads functionality. Pandas (Python Data Analysis Library) \"is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.\" It (pandas) provides dataframe functionality for reading/accessing/manipulating data in memory. You can think of a data frame as a table of indexed values.\n", 19 | "\n", 20 | "What you're currently looking at is an IPython Notebook, this acts as a way to interactively use the python interpreter as well as a way to display graphs/charts/images/markdown along with code. IPython is commonly used in scientific computing due to its flexibility. Much more information is available on the IPython website.\n", 21 | "\n", 22 | "Often data is stored in files, and the first goal is to get that information off of disk and into a dataframe. Since we're working with limited resources in this VM we'll have to use samples of some of the files. Don't worry though, the same techniques apply if you're not sampling the files for exploration.\n", 23 | "\n", 24 | "## Tip\n", 25 | "If you ever want to know the various keyboard shortcuts, just click on a (non-code) cell or the text \"In []\" to the left of the cell, and press the *H* key. Or select *Help* from the menu above, and then *Keyboard Shortcuts*.\n", 26 | "___" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## Exercises\n", 34 | "\n", 35 | "### File sampling\n", 36 | "First off, let's take a look at a log file generated from Bro this log is similar to netflow logs as well. However, this log file is rather large and doesn't fit in memory.\n", 37 | "\n", 38 | "As part of the first exercise, figure out what setting the variable **sample_percent** should be in order to read in between 200k and 300k worth of (randomly selected) lines from the file. Change the variable, after doing that either click the *play* button above (it's the arrow) or hit the *[Shift]+[Enter]* keys as the same time." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "collapsed": false, 44 | "input": [ 45 | "import random\n", 46 | "logfile = 'conn.log'\n", 47 | "sample_percent = .01\n", 48 | "num_lines = sum(1 for line in open(logfile))\n", 49 | "slines = set(sorted(random.sample(xrange(num_lines), int(num_lines * sample_percent))))\n", 50 | "print \"%s lines in %s, using a sample of %s lines\" %(num_lines, logfile, len(slines))" 51 | ], 52 | "language": "python", 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "output_type": "stream", 57 | "stream": "stdout", 58 | "text": [ 59 | "22694356 lines in conn.log, using a sample of 226943 lines\n" 60 | ] 61 | } 62 | ], 63 | "prompt_number": 5 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "### File Creation\n", 70 | "Awesome! Now that you have a subset of lines to work with, let's write them to another file so we'll have something to practice reading in. Simply hit *[Shift]+[Enter]* below to run the code in the cell and create a new file." 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "collapsed": false, 76 | "input": [ 77 | "outfile = 'conn_sample.log'\n", 78 | "f = open(outfile, 'w+')\n", 79 | "i = open(logfile, 'r+')\n", 80 | "linecount = 0\n", 81 | "for line in i:\n", 82 | " if linecount in slines:\n", 83 | " f.write(line)\n", 84 | " linecount += 1\n", 85 | "f.close()\n", 86 | "i.close()" 87 | ], 88 | "language": "python", 89 | "metadata": {}, 90 | "outputs": [], 91 | "prompt_number": 6 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "### File Input (CSV)\n", 98 | "This next cell does a couple of things, first it imports pandas so we can create a dataframe, and then it reads our newly created file from above into memory. You can see the separator is specified to \"\\t\" because Bro produces tab-delimited files by default. In this case we've also specified what we should call the columns in the dataframe." 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "collapsed": false, 104 | "input": [ 105 | "import pandas as pd\n", 106 | "conn_df = pd.read_csv(outfile, sep=\"\\t\", header=None, names=['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','proto','service','duration','orig_bytes','resp_bytes','conn_state','local_orig','missed_bytes','history','orig_pkts','orig_ip_bytes','resp_pkts','resp_ip_bytes','tunnel_parents','threat','sample'])" 107 | ], 108 | "language": "python", 109 | "metadata": {}, 110 | "outputs": [], 111 | "prompt_number": 16 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### Verifying Input\n", 118 | "Now (in theory) the contents of the file should be in a nicely laid-out dataframe.\n", 119 | "\n", 120 | "For this next exercise, experiment with calling the **head()** and **tail()** method to see the values at the beginning and end of the dataframe. You can also pass a number to **head()** and **tail()** to specify the number of lines you want to see. Remember to click *play* or press *[Shift]+[Enter]* to execute the code in the cell after you change it." 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "collapsed": false, 126 | "input": [ 127 | "conn_df.head()" 128 | ], 129 | "language": "python", 130 | "metadata": {}, 131 | "outputs": [ 132 | { 133 | "html": [ 134 | "
\n", 135 | "\n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | "
tsuidid.orig_hid.orig_pid.resp_hid.resp_pprotoservicedurationorig_bytes...local_origmissed_byteshistoryorig_pktsorig_ip_bytesresp_pktsresp_ip_bytestunnel_parentsthreatsample
0 1.331901e+09 CddInw3BLL4rBo8aXh 192.168.202.79 55896 192.168.229.251 53 tcp - 0.010000 0... - 0 ShR 2 84 1 44 (empty)NaNNaN
1 1.331901e+09 CkoCJi2jAzoOjZocm9 192.168.202.100 45658 192.168.27.25 16948 udp - - -... - 0 D 1 28 0 0 (empty)NaNNaN
2 1.331901e+09 CCLTL53SZhguS8YHfi 192.168.202.100 45659 192.168.27.253 34578 udp - - -... - 0 D 1 28 0 0 (empty)NaNNaN
3 1.331901e+09 CJPvPZ2MG8hyzTwsva 192.168.202.79 47819 192.168.229.153 49160 tcp - 0.010000 198... - 0 ShADdfFa 5 466 4 334 (empty)NaNNaN
4 1.331901e+09 CS4Gks3dvOPavNvxp2 192.168.202.79 47827 192.168.229.153 49160 tcp - 0.100000 198... - 0 ShADdfFa 6 518 4 334 (empty)NaNNaN
\n", 285 | "

5 rows \u00d7 22 columns

\n", 286 | "
" 287 | ], 288 | "metadata": {}, 289 | "output_type": "pyout", 290 | "prompt_number": 17, 291 | "text": [ 292 | " ts uid id.orig_h id.orig_p \\\n", 293 | "0 1.331901e+09 CddInw3BLL4rBo8aXh 192.168.202.79 55896 \n", 294 | "1 1.331901e+09 CkoCJi2jAzoOjZocm9 192.168.202.100 45658 \n", 295 | "2 1.331901e+09 CCLTL53SZhguS8YHfi 192.168.202.100 45659 \n", 296 | "3 1.331901e+09 CJPvPZ2MG8hyzTwsva 192.168.202.79 47819 \n", 297 | "4 1.331901e+09 CS4Gks3dvOPavNvxp2 192.168.202.79 47827 \n", 298 | "\n", 299 | " id.resp_h id.resp_p proto service duration orig_bytes ... \\\n", 300 | "0 192.168.229.251 53 tcp - 0.010000 0 ... \n", 301 | "1 192.168.27.25 16948 udp - - - ... \n", 302 | "2 192.168.27.253 34578 udp - - - ... \n", 303 | "3 192.168.229.153 49160 tcp - 0.010000 198 ... \n", 304 | "4 192.168.229.153 49160 tcp - 0.100000 198 ... \n", 305 | "\n", 306 | " local_orig missed_bytes history orig_pkts orig_ip_bytes resp_pkts \\\n", 307 | "0 - 0 ShR 2 84 1 \n", 308 | "1 - 0 D 1 28 0 \n", 309 | "2 - 0 D 1 28 0 \n", 310 | "3 - 0 ShADdfFa 5 466 4 \n", 311 | "4 - 0 ShADdfFa 6 518 4 \n", 312 | "\n", 313 | " resp_ip_bytes tunnel_parents threat sample \n", 314 | "0 44 (empty) NaN NaN \n", 315 | "1 0 (empty) NaN NaN \n", 316 | "2 0 (empty) NaN NaN \n", 317 | "3 334 (empty) NaN NaN \n", 318 | "4 334 (empty) NaN NaN \n", 319 | "\n", 320 | "[5 rows x 22 columns]" 321 | ] 322 | } 323 | ], 324 | "prompt_number": 17 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "### Data Summarization\n", 331 | "Now create a new cell below this one. This can be accomplished by clicking on this cell once, and then clicking the *+* icon towards the top or selecting *Insert* from above and then selecting *Insert Cell Below*. After creating the new cell, it's time to learn about the **describe()** method that can be called on dataframes. This will give you a numeric summarization of all columns that contain numbers.\n", 332 | "\n", 333 | "Try it out!" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "collapsed": false, 339 | "input": [ 340 | "conn_df.describe()" 341 | ], 342 | "language": "python", 343 | "metadata": {}, 344 | "outputs": [ 345 | { 346 | "html": [ 347 | "
\n", 348 | "\n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | "
tsid.orig_pid.resp_pmissed_bytesorig_pktsorig_ip_bytesresp_pktsresp_ip_bytesthreatsample
count 2.269430e+05 226943.000000 226943.000000 226943.000000 226943.000000 226943.000000 226943.000000 226943.000000 0 0
mean 1.331949e+09 42673.618799 20442.668873 1.445028 1.388468 128.940637 0.852818 177.177454NaNNaN
std 4.273521e+04 15311.420570 20618.839391 688.390189 9.025720 3349.892018 11.067241 12805.400204NaNNaN
min 1.331901e+09 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000NaNNaN
25% 1.331908e+09 35968.000000 2121.000000 0.000000 1.000000 44.000000 0.000000 0.000000NaNNaN
50% 1.331928e+09 44316.000000 10244.000000 0.000000 1.000000 48.000000 1.000000 40.000000NaNNaN
75% 1.331997e+09 54416.000000 37716.500000 0.000000 1.000000 60.000000 1.000000 40.000000NaNNaN
max 1.332018e+09 65535.000000 65535.000000 327939.000000 2480.000000 996593.000000 4223.000000 5700628.000000NaNNaN
\n", 471 | "
" 472 | ], 473 | "metadata": {}, 474 | "output_type": "pyout", 475 | "prompt_number": 18, 476 | "text": [ 477 | " ts id.orig_p id.resp_p missed_bytes \\\n", 478 | "count 2.269430e+05 226943.000000 226943.000000 226943.000000 \n", 479 | "mean 1.331949e+09 42673.618799 20442.668873 1.445028 \n", 480 | "std 4.273521e+04 15311.420570 20618.839391 688.390189 \n", 481 | "min 1.331901e+09 0.000000 0.000000 0.000000 \n", 482 | "25% 1.331908e+09 35968.000000 2121.000000 0.000000 \n", 483 | "50% 1.331928e+09 44316.000000 10244.000000 0.000000 \n", 484 | "75% 1.331997e+09 54416.000000 37716.500000 0.000000 \n", 485 | "max 1.332018e+09 65535.000000 65535.000000 327939.000000 \n", 486 | "\n", 487 | " orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes threat \\\n", 488 | "count 226943.000000 226943.000000 226943.000000 226943.000000 0 \n", 489 | "mean 1.388468 128.940637 0.852818 177.177454 NaN \n", 490 | "std 9.025720 3349.892018 11.067241 12805.400204 NaN \n", 491 | "min 0.000000 0.000000 0.000000 0.000000 NaN \n", 492 | "25% 1.000000 44.000000 0.000000 0.000000 NaN \n", 493 | "50% 1.000000 48.000000 1.000000 40.000000 NaN \n", 494 | "75% 1.000000 60.000000 1.000000 40.000000 NaN \n", 495 | "max 2480.000000 996593.000000 4223.000000 5700628.000000 NaN \n", 496 | "\n", 497 | " sample \n", 498 | "count 0 \n", 499 | "mean NaN \n", 500 | "std NaN \n", 501 | "min NaN \n", 502 | "25% NaN \n", 503 | "50% NaN \n", 504 | "75% NaN \n", 505 | "max NaN " 506 | ] 507 | } 508 | ], 509 | "prompt_number": 18 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "### Data Types\n", 516 | "Wait a second, isn't the ts column supposed to be a timestamp? Perhaps this column would be better suited as a time data type vs. a number.\n", 517 | "\n", 518 | "Run the cell below to see what type of information Python stored in each column." 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "collapsed": false, 524 | "input": [ 525 | "conn_df.dtypes" 526 | ], 527 | "language": "python", 528 | "metadata": {}, 529 | "outputs": [ 530 | { 531 | "metadata": {}, 532 | "output_type": "pyout", 533 | "prompt_number": 19, 534 | "text": [ 535 | "ts float64\n", 536 | "uid object\n", 537 | "id.orig_h object\n", 538 | "id.orig_p int64\n", 539 | "id.resp_h object\n", 540 | "id.resp_p int64\n", 541 | "proto object\n", 542 | "service object\n", 543 | "duration object\n", 544 | "orig_bytes object\n", 545 | "resp_bytes object\n", 546 | "conn_state object\n", 547 | "local_orig object\n", 548 | "missed_bytes int64\n", 549 | "history object\n", 550 | "orig_pkts int64\n", 551 | "orig_ip_bytes int64\n", 552 | "resp_pkts int64\n", 553 | "resp_ip_bytes int64\n", 554 | "tunnel_parents object\n", 555 | "threat float64\n", 556 | "sample float64\n", 557 | "dtype: object" 558 | ] 559 | } 560 | ], 561 | "prompt_number": 19 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "metadata": {}, 566 | "source": [ 567 | "### Converting Column Types\n", 568 | "Time to change the ts column to a datetime object! We will accomplish that by using a simple function provided called *to_datetime()*. The cell below runs this function on the ts column (what should be a time stamp), and then re-assigns this column back to the dataframe in the same place. A new timestamp column could have been added to the dataframe as well so both the float value and the datetime object columns are present.\n", 569 | "\n", 570 | "Run the cell below to convert the column type." 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "collapsed": false, 576 | "input": [ 577 | "from datetime import datetime\n", 578 | "conn_df['ts'] = [datetime.fromtimestamp(float(date)) for date in conn_df['ts'].values]" 579 | ], 580 | "language": "python", 581 | "metadata": {}, 582 | "outputs": [], 583 | "prompt_number": 20 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "metadata": {}, 588 | "source": [ 589 | "### Data Value Exploration\n", 590 | "Verify that the conversion was successful. What is the datatype of the column now?\n", 591 | "\n", 592 | "Scroll back up the page and note where you ran the **describe()** function. You'll see under the threat and sample columns there is likely the value of *NaN*. This stands for Not a Number and is a special value assigned to empty column values. There are a few ways to explore what values a column has. Two of these are **value_counts()** and **unique()**. \n", 593 | "\n", 594 | "Try them below on different columns. You can create new cells or if you want to get more than the last command worth of output you can put a print statement in front. \n", 595 | "\n", 596 | "What happens when you run them on a column with IPs (*id.orig_h, id.resp_h*)? What about sample or threat?" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "collapsed": false, 602 | "input": [ 603 | "conn_df['sample'].unique()" 604 | ], 605 | "language": "python", 606 | "metadata": {}, 607 | "outputs": [ 608 | { 609 | "metadata": {}, 610 | "output_type": "pyout", 611 | "prompt_number": 21, 612 | "text": [ 613 | "array([ nan])" 614 | ] 615 | } 616 | ], 617 | "prompt_number": 21 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": {}, 622 | "source": [ 623 | "### Remove Columns\n", 624 | "Another useful operation on a dataframe is removing and adding columns. Since the threat and sample columns contain only *NaNs*, we can safely remove them and not impact any analysis that may be performed. \n", 625 | "\n", 626 | "Below the sample column is removed (dropped), add a similar line to drop the *threat* column and use a method from above to verify they are no longer in the dataframe." 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "collapsed": false, 632 | "input": [ 633 | "conn_df.drop('sample', axis=1, inplace=True)" 634 | ], 635 | "language": "python", 636 | "metadata": {}, 637 | "outputs": [], 638 | "prompt_number": 22 639 | }, 640 | { 641 | "cell_type": "markdown", 642 | "metadata": {}, 643 | "source": [ 644 | "Can you think of other columns to remove? Select a few and remove them as well. What does your dataframe look like now? (Insert additional cells as needed)" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "collapsed": false, 650 | "input": [ 651 | "conn_df.drop('threat', axis=1, inplace=True)\n", 652 | "conn_df.head()" 653 | ], 654 | "language": "python", 655 | "metadata": {}, 656 | "outputs": [ 657 | { 658 | "html": [ 659 | "
\n", 660 | "\n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | "
tsuidid.orig_hid.orig_pid.resp_hid.resp_pprotoservicedurationorig_bytesresp_bytesconn_statelocal_origmissed_byteshistoryorig_pktsorig_ip_bytesresp_pktsresp_ip_bytestunnel_parents
02012-03-16 07:30:12.530000 CddInw3BLL4rBo8aXh 192.168.202.79 55896 192.168.229.251 53 tcp - 0.010000 0 0 RSTO - 0 ShR 2 84 1 44 (empty)
12012-03-16 07:30:01.700000 CkoCJi2jAzoOjZocm9 192.168.202.100 45658 192.168.27.25 16948 udp - - - - S0 - 0 D 1 28 0 0 (empty)
22012-03-16 07:30:06.210000 CCLTL53SZhguS8YHfi 192.168.202.100 45659 192.168.27.253 34578 udp - - - - S0 - 0 D 1 28 0 0 (empty)
32012-03-16 07:31:05.570000 CJPvPZ2MG8hyzTwsva 192.168.202.79 47819 192.168.229.153 49160 tcp - 0.010000 198 118 SF - 0 ShADdfFa 5 466 4 334 (empty)
42012-03-16 07:31:05.790000 CS4Gks3dvOPavNvxp2 192.168.202.79 47827 192.168.229.153 49160 tcp - 0.100000 198 118 SF - 0 ShADdfFa 6 518 4 334 (empty)
\n", 804 | "
" 805 | ], 806 | "metadata": {}, 807 | "output_type": "pyout", 808 | "prompt_number": 23, 809 | "text": [ 810 | " ts uid id.orig_h id.orig_p \\\n", 811 | "0 2012-03-16 07:30:12.530000 CddInw3BLL4rBo8aXh 192.168.202.79 55896 \n", 812 | "1 2012-03-16 07:30:01.700000 CkoCJi2jAzoOjZocm9 192.168.202.100 45658 \n", 813 | "2 2012-03-16 07:30:06.210000 CCLTL53SZhguS8YHfi 192.168.202.100 45659 \n", 814 | "3 2012-03-16 07:31:05.570000 CJPvPZ2MG8hyzTwsva 192.168.202.79 47819 \n", 815 | "4 2012-03-16 07:31:05.790000 CS4Gks3dvOPavNvxp2 192.168.202.79 47827 \n", 816 | "\n", 817 | " id.resp_h id.resp_p proto service duration orig_bytes resp_bytes \\\n", 818 | "0 192.168.229.251 53 tcp - 0.010000 0 0 \n", 819 | "1 192.168.27.25 16948 udp - - - - \n", 820 | "2 192.168.27.253 34578 udp - - - - \n", 821 | "3 192.168.229.153 49160 tcp - 0.010000 198 118 \n", 822 | "4 192.168.229.153 49160 tcp - 0.100000 198 118 \n", 823 | "\n", 824 | " conn_state local_orig missed_bytes history orig_pkts orig_ip_bytes \\\n", 825 | "0 RSTO - 0 ShR 2 84 \n", 826 | "1 S0 - 0 D 1 28 \n", 827 | "2 S0 - 0 D 1 28 \n", 828 | "3 SF - 0 ShADdfFa 5 466 \n", 829 | "4 SF - 0 ShADdfFa 6 518 \n", 830 | "\n", 831 | " resp_pkts resp_ip_bytes tunnel_parents \n", 832 | "0 1 44 (empty) \n", 833 | "1 0 0 (empty) \n", 834 | "2 0 0 (empty) \n", 835 | "3 4 334 (empty) \n", 836 | "4 4 334 (empty) " 837 | ] 838 | } 839 | ], 840 | "prompt_number": 23 841 | }, 842 | { 843 | "cell_type": "markdown", 844 | "metadata": {}, 845 | "source": [ 846 | "### Row Selection\n", 847 | "\n", 848 | "You can use column values to select rows from the dataframes (and even only view specific columns). First, select all rows that contain *SSL* traffic by running the cell below." 849 | ] 850 | }, 851 | { 852 | "cell_type": "code", 853 | "collapsed": false, 854 | "input": [ 855 | "conn_df[conn_df['service'] == 'ssl'].head()" 856 | ], 857 | "language": "python", 858 | "metadata": {}, 859 | "outputs": [ 860 | { 861 | "html": [ 862 | "
\n", 863 | "\n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | "
tsuidid.orig_hid.orig_pid.resp_hid.resp_pprotoservicedurationorig_bytesresp_bytesconn_statelocal_origmissed_byteshistoryorig_pktsorig_ip_bytesresp_pktsresp_ip_bytestunnel_parents
55192012-03-16 07:46:55.500000 C99nzvZoisDlH33U 192.168.202.79 51012 192.168.229.254 443 tcp ssl 0.260000 539 1060 SF - 0 ShADadfrF 15 1327 13 1744 (empty)
55342012-03-16 07:46:59.420000 CqtDAz3GW7CCxDlP3i 192.168.202.79 51100 192.168.229.254 443 tcp ssl 0.260000 560 1060 SF - 0 ShADadfrF 15 1348 13 1744 (empty)
56852012-03-16 07:47:31.430000 CG3EsB1fMQ5sWzzol 192.168.202.79 52181 192.168.229.254 443 tcp ssl 0.260000 550 1060 SF - 0 ShADadfrF 15 1338 13 1744 (empty)
57942012-03-16 07:47:54.040000 CCAZaL2jlKXFvCZtDb 192.168.202.79 53094 192.168.229.254 443 tcp ssl 0.260000 535 1060 SF - 0 ShADadfrF 15 1323 13 1744 (empty)
59022012-03-16 07:48:15.130000 CpZW6p3iO95iolPIte 192.168.202.79 53650 192.168.229.254 443 tcp ssl 0.260000 546 1060 SF - 0 ShADadfrF 15 1334 13 1744 (empty)
\n", 1007 | "
" 1008 | ], 1009 | "metadata": {}, 1010 | "output_type": "pyout", 1011 | "prompt_number": 24, 1012 | "text": [ 1013 | " ts uid id.orig_h \\\n", 1014 | "5519 2012-03-16 07:46:55.500000 C99nzvZoisDlH33U 192.168.202.79 \n", 1015 | "5534 2012-03-16 07:46:59.420000 CqtDAz3GW7CCxDlP3i 192.168.202.79 \n", 1016 | "5685 2012-03-16 07:47:31.430000 CG3EsB1fMQ5sWzzol 192.168.202.79 \n", 1017 | "5794 2012-03-16 07:47:54.040000 CCAZaL2jlKXFvCZtDb 192.168.202.79 \n", 1018 | "5902 2012-03-16 07:48:15.130000 CpZW6p3iO95iolPIte 192.168.202.79 \n", 1019 | "\n", 1020 | " id.orig_p id.resp_h id.resp_p proto service duration \\\n", 1021 | "5519 51012 192.168.229.254 443 tcp ssl 0.260000 \n", 1022 | "5534 51100 192.168.229.254 443 tcp ssl 0.260000 \n", 1023 | "5685 52181 192.168.229.254 443 tcp ssl 0.260000 \n", 1024 | "5794 53094 192.168.229.254 443 tcp ssl 0.260000 \n", 1025 | "5902 53650 192.168.229.254 443 tcp ssl 0.260000 \n", 1026 | "\n", 1027 | " orig_bytes resp_bytes conn_state local_orig missed_bytes history \\\n", 1028 | "5519 539 1060 SF - 0 ShADadfrF \n", 1029 | "5534 560 1060 SF - 0 ShADadfrF \n", 1030 | "5685 550 1060 SF - 0 ShADadfrF \n", 1031 | "5794 535 1060 SF - 0 ShADadfrF \n", 1032 | "5902 546 1060 SF - 0 ShADadfrF \n", 1033 | "\n", 1034 | " orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents \n", 1035 | "5519 15 1327 13 1744 (empty) \n", 1036 | "5534 15 1348 13 1744 (empty) \n", 1037 | "5685 15 1338 13 1744 (empty) \n", 1038 | "5794 15 1323 13 1744 (empty) \n", 1039 | "5902 15 1334 13 1744 (empty) " 1040 | ] 1041 | } 1042 | ], 1043 | "prompt_number": 24 1044 | }, 1045 | { 1046 | "cell_type": "markdown", 1047 | "metadata": {}, 1048 | "source": [ 1049 | "Next we can assign that result to a dataframe, and then look at all all the *SSL* connections that happen over ports other than 443." 1050 | ] 1051 | }, 1052 | { 1053 | "cell_type": "code", 1054 | "collapsed": false, 1055 | "input": [ 1056 | "ssl_df = conn_df[conn_df['service'] == 'ssl']\n", 1057 | "ssl_df[ssl_df['id.resp_p'] != 443].head()" 1058 | ], 1059 | "language": "python", 1060 | "metadata": {}, 1061 | "outputs": [ 1062 | { 1063 | "html": [ 1064 | "
\n", 1065 | "\n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | "
tsuidid.orig_hid.orig_pid.resp_hid.resp_pprotoservicedurationorig_bytesresp_bytesconn_statelocal_origmissed_byteshistoryorig_pktsorig_ip_bytesresp_pktsresp_ip_bytestunnel_parents
420682012-03-16 08:35:12.670000 CWEAWH17T5NGwgpsbk 192.168.202.110 52016 192.168.27.253 8089 tcp ssl 0.310000 697 1816 RSTO - 0 ShADadfR 10 1225 6 2136 (empty)
528082012-03-16 09:09:19.530000 CRotGy3jAXGOxvCMX3 192.168.204.70 37074 192.168.202.68 55553 tcp ssl 5.030000 637 8441 SF - 0 ShADadFfR 30 2802 26 18250 (empty)
764042012-03-16 12:15:48.180000 CMrDsn1lGAwM9ylHkd 192.168.203.64 43477 192.168.202.68 55553 tcp ssl 5.020000 535 8633 SF - 0 ShADadFfR 30 2598 26 18634 (empty)
860302012-03-16 12:51:17.590000 CRJFe04DM00TK2IOg9 192.168.202.79 60419 192.168.26.203 8089 tcp ssl 5.010000 402 143 SF - 0 ShADadFfRR 10 906 10 704 (empty)
983952012-03-16 13:28:17.950000 CJnsYp1hNeUxgCL4V1 192.168.202.110 40831 192.168.21.102 993 tcp ssl - - - RSTO - 0 ShADadR 5 318 3 396 (empty)
\n", 1209 | "
" 1210 | ], 1211 | "metadata": {}, 1212 | "output_type": "pyout", 1213 | "prompt_number": 25, 1214 | "text": [ 1215 | " ts uid id.orig_h \\\n", 1216 | "42068 2012-03-16 08:35:12.670000 CWEAWH17T5NGwgpsbk 192.168.202.110 \n", 1217 | "52808 2012-03-16 09:09:19.530000 CRotGy3jAXGOxvCMX3 192.168.204.70 \n", 1218 | "76404 2012-03-16 12:15:48.180000 CMrDsn1lGAwM9ylHkd 192.168.203.64 \n", 1219 | "86030 2012-03-16 12:51:17.590000 CRJFe04DM00TK2IOg9 192.168.202.79 \n", 1220 | "98395 2012-03-16 13:28:17.950000 CJnsYp1hNeUxgCL4V1 192.168.202.110 \n", 1221 | "\n", 1222 | " id.orig_p id.resp_h id.resp_p proto service duration \\\n", 1223 | "42068 52016 192.168.27.253 8089 tcp ssl 0.310000 \n", 1224 | "52808 37074 192.168.202.68 55553 tcp ssl 5.030000 \n", 1225 | "76404 43477 192.168.202.68 55553 tcp ssl 5.020000 \n", 1226 | "86030 60419 192.168.26.203 8089 tcp ssl 5.010000 \n", 1227 | "98395 40831 192.168.21.102 993 tcp ssl - \n", 1228 | "\n", 1229 | " orig_bytes resp_bytes conn_state local_orig missed_bytes history \\\n", 1230 | "42068 697 1816 RSTO - 0 ShADadfR \n", 1231 | "52808 637 8441 SF - 0 ShADadFfR \n", 1232 | "76404 535 8633 SF - 0 ShADadFfR \n", 1233 | "86030 402 143 SF - 0 ShADadFfRR \n", 1234 | "98395 - - RSTO - 0 ShADadR \n", 1235 | "\n", 1236 | " orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents \n", 1237 | "42068 10 1225 6 2136 (empty) \n", 1238 | "52808 30 2802 26 18250 (empty) \n", 1239 | "76404 30 2598 26 18634 (empty) \n", 1240 | "86030 10 906 10 704 (empty) \n", 1241 | "98395 5 318 3 396 (empty) " 1242 | ] 1243 | } 1244 | ], 1245 | "prompt_number": 25 1246 | }, 1247 | { 1248 | "cell_type": "markdown", 1249 | "metadata": {}, 1250 | "source": [ 1251 | "You can see the individual column selections above eg: *conn_df['service']*, and *ssl_df['id.resp_p']* respectively. You can use these to view output of specific columns. \n", 1252 | "\n", 1253 | "For example, run the cell below to see all the individual values of originator bytes associated with a *SSL* connection over port 443." 1254 | ] 1255 | }, 1256 | { 1257 | "cell_type": "code", 1258 | "collapsed": false, 1259 | "input": [ 1260 | "ssl_df[ssl_df['id.resp_p'] == 443]['orig_bytes'].head()" 1261 | ], 1262 | "language": "python", 1263 | "metadata": {}, 1264 | "outputs": [ 1265 | { 1266 | "metadata": {}, 1267 | "output_type": "pyout", 1268 | "prompt_number": 26, 1269 | "text": [ 1270 | "5519 539\n", 1271 | "5534 560\n", 1272 | "5685 550\n", 1273 | "5794 535\n", 1274 | "5902 546\n", 1275 | "Name: orig_bytes, dtype: object" 1276 | ] 1277 | } 1278 | ], 1279 | "prompt_number": 26 1280 | }, 1281 | { 1282 | "cell_type": "markdown", 1283 | "metadata": {}, 1284 | "source": [ 1285 | "## Final Exercise\n", 1286 | "Use all of the techniques above to display the unique ports and originator IPs (bonus points for the number of connections of each) associated with all *HTTP* connections **NOT** over port 80." 1287 | ] 1288 | }, 1289 | { 1290 | "cell_type": "code", 1291 | "collapsed": false, 1292 | "input": [ 1293 | "http_df = conn_df[conn_df['service'] == 'http']\n", 1294 | "http_df[http_df['id.resp_p'] != 80]['id.orig_h'].value_counts()" 1295 | ], 1296 | "language": "python", 1297 | "metadata": {}, 1298 | "outputs": [ 1299 | { 1300 | "metadata": {}, 1301 | "output_type": "pyout", 1302 | "prompt_number": 29, 1303 | "text": [ 1304 | "192.168.202.110 423\n", 1305 | "192.168.202.140 77\n", 1306 | "192.168.202.138 65\n", 1307 | "192.168.202.79 20\n", 1308 | "192.168.204.45 6\n", 1309 | "192.168.202.108 5\n", 1310 | "192.168.202.100 3\n", 1311 | "192.168.202.112 2\n", 1312 | "192.168.202.4 1\n", 1313 | "192.168.202.103 1\n", 1314 | "192.168.202.95 1\n", 1315 | "192.168.202.144 1\n", 1316 | "192.168.202.68 1\n", 1317 | "dtype: int64" 1318 | ] 1319 | } 1320 | ], 1321 | "prompt_number": 29 1322 | }, 1323 | { 1324 | "cell_type": "code", 1325 | "collapsed": false, 1326 | "input": [ 1327 | "http_df[http_df['id.resp_p'] != 80]['id.resp_p'].value_counts()" 1328 | ], 1329 | "language": "python", 1330 | "metadata": {}, 1331 | "outputs": [ 1332 | { 1333 | "metadata": {}, 1334 | "output_type": "pyout", 1335 | "prompt_number": 30, 1336 | "text": [ 1337 | "3128 219\n", 1338 | "8080 190\n", 1339 | "8000 132\n", 1340 | "5488 52\n", 1341 | "5357 13\n", 1342 | "dtype: int64" 1343 | ] 1344 | } 1345 | ], 1346 | "prompt_number": 30 1347 | } 1348 | ], 1349 | "metadata": {} 1350 | } 1351 | ] 1352 | } 1353 | --------------------------------------------------------------------------------