├── datasets ├── sample.flow └── ctu-13.sh ├── README.md ├── scripts └── csvtomtx.sh └── ANALYZE.md /datasets/sample.flow: -------------------------------------------------------------------------------- 1 | Date flow start Duration Proto Src IP Addr:Port Dst IP Addr:Port Packets Bytes Flows 2 | 2010-09-01 00:00:00.459 0.000 UDP 127.0.0.1:24920 -> 192.168.0.1:22126 1 46 1 3 | 2010-09-01 00:00:00.363 0.000 UDP 192.168.0.1:22126 -> 127.0.0.1:24920 1 80 1 4 | -------------------------------------------------------------------------------- /datasets/ctu-13.sh: -------------------------------------------------------------------------------- 1 | # THE CTU-13 DATASET. A LABELED DATASET WITH BOTNET, NORMAL AND BACKGROUND TRAFFIC. 2 | # The CTU-13 is a dataset of botnet traffic that was captured in the CTU University, Czech Republic, in 2011. The goal of the dataset was to have a large capture of real botnet traffic mixed with normal traffic and background traffic. The CTU-13 dataset consists in thirteen captures (called scenarios) of different botnet samples. On each scenario we executed a specific malware, which used several protocols and performed different actions. 3 | 4 | wget https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/CTU-13-Dataset.tar.bz2 5 | tar xvjf CTU-13-Dataset.tar.bz2 6 | echo "Convert PCAP to CSV: tshark -r test.pcap -T fields -e eth.src -e eth.dst -e" 7 | # "An empirical comparison of botnet detection methods" Sebastian Garcia, Martin Grill, Honza Stiborek and Alejandro Zunino. Computers and Security Journal, Elsevier. 2014. Vol 45, pp 100-123. http://dx.doi.org/10.1016/j.cose.2014.05.011 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Netflow 2 | Cybersecurity: Graph Processing using Gunrock. 3 | 4 | ## What is Netflow? 5 | NetFlow is a traffic profile monitoring technology that describes the method for a router to export statistics about the routed socket pairs. When a network administrator enables the NetFlow export on a router interface, traffic statistics of packets received on that interface will be counted as "flow" and stored into a dynamic flow cache. 6 | 7 | ## What is flow? 8 | Flow is defined as a unidirectional sequence of packets (which means there will be two flows for each connection session, one from the server to client, one from the client to server) between two endpoints. A flow can be identified by seven key fields: source IP address, destination IP address, source port number, destination port number, protocol type, type of services, and the router input interface. Any time after receiving a packet, a router will look for these seven fields and then make a decision: if the packet belongs to an existent flow, traffic statistics of the corresponding flow will be increased, otherwise a new flow entry will be created. 9 | 10 | ``` 11 | Date flow start Duration Proto Src IP Addr:Port Dst IP Addr:Port Packets Bytes Flows 12 | 2010-09-01 00:00:00.459 0.000 UDP 127.0.0.1:24920 -> 192.168.0.1:22126 1 46 1 13 | 2010-09-01 00:00:00.363 0.000 UDP 192.168.0.1:22126 -> 127.0.0.1:24920 1 80 1 14 | ``` 15 | 16 | ## Analysis Methods 17 | * Top N and Baseline 18 | * Top N session 19 | * Top N data 20 | * Pattern Matching 21 | * Port matching 22 | * IP address matching 23 | 24 | ## Sources 25 | * [Inter Projekt - NetFlow](https://pliki.ip-sa.pl/wiki/Wiki.jsp?page=NetFlow) 26 | * [Detecting Worms and Abnormal Activities with NetFlow](https://www.symantec.com/connect/articles/detecting-worms-and-abnormal-activities-netflow-part-1) 27 | -------------------------------------------------------------------------------- /scripts/csvtomtx.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # tshark -r botnet-capture-20110810-neris.pcap -T fields -e ip.src -e ip.dst 4 | 5 | FILENAME=${1:-"botnet-capture-20110810-neris.csv"} 6 | EXT=${FILENAME##*.} 7 | DATASET=$(echo $FILENAME | cut -f 1 -d '.') 8 | 9 | OUTPUT=./output 10 | mkdir -p ${OUTPUT} 11 | 12 | sed 's/,/\t/g' ${DATASET}.${EXT} > ${OUTPUT}/${DATASET}.tab 13 | sort -k1 -n -t\t ${OUTPUT}/${DATASET}.tab > ${OUTPUT}/${DATASET}.sorted 14 | awk '{print $1}' ${OUTPUT}/${DATASET}.sorted > ${OUTPUT}/${DATASET}.nodes 15 | awk '{print $2}' ${OUTPUT}/${DATASET}.sorted > ${OUTPUT}/${DATASET}.nodes 16 | sed '/^\s*$/d' ${OUTPUT}/${DATASET}.nodes > ${OUTPUT}/${DATASET}.clean 17 | cat ${OUTPUT}/${DATASET}.clean | sort | uniq > ${OUTPUT}/${DATASET}.unique 18 | awk '{printf("%d %s\n", NR, $0)}' ${OUTPUT}/${DATASET}.unique > ${OUTPUT}/${DATASET}.dict 19 | 20 | awk '{print $1"\t"$2}' ${OUTPUT}/${DATASET}.tab > ${OUTPUT}/${DATASET}.proc 21 | sed '/^\s*$/d' ${OUTPUT}/${DATASET}.proc > ${OUTPUT}/${DATASET}.mm 22 | awk 'NR==FNR{ a[$2]=$1; next }{ $1=a[$1]; $2=a[$2] }1' ${OUTPUT}/${DATASET}.dict ${OUTPUT}/${DATASET}.mm > ${OUTPUT}/${DATASET}.mtx 23 | NODES=$(wc -l < ${OUTPUT}/${DATASET}.dict) 24 | EDGES=$(wc -l < ${OUTPUT}/${DATASET}.mtx) 25 | # echo ${NODES} ${NODES} ${EDGES} 26 | sed -i '1s/^/'"${NODES} ${NODES} ${EDGES}"'\n/' ${OUTPUT}/${DATASET}.mtx 27 | sed -i '1s/^/%-------------------------------------------------------------------------------\n/' ${OUTPUT}/${DATASET}.mtx 28 | sed -i '1s/^/%%MatrixMarket matrix coordinate pattern \n/' ${OUTPUT}/${DATASET}.mtx 29 | sed -i '1s/^/%-------------------------------------------------------------------------------\n/' ${OUTPUT}/${DATASET}.mtx 30 | 31 | echo "CSV to MTX Conversion: \n \ 32 | MatrixMarket: ${OUTPUT}/${DATASET}.mtx \n \ 33 | Dictionary: ${OUTPUT}/${DATASET}.mtx \n \ 34 | Raw: ${OUTPUT}/${DATASET}.mm" 35 | 36 | -------------------------------------------------------------------------------- /ANALYZE.md: -------------------------------------------------------------------------------- 1 | # How to efficiently analyze flow data? 2 | There are some interesting approaches used to study the flow data; Top N and Baseline, Pattern Matching, etc. Regardless of the approach we take, it requires us to extract useful records from the huge amount of flow records that are available to us. 3 | 4 | Our approach is to create a table like structure that stores the IP addresses and the `id` that they link to. These ids are also stored in a Matrix Market file format to easily construct a graph (in data formats: csr, csc, or coo) using [gunrock](https://github.com/gunrock/gunrock). 5 | 6 | ##Top N and Baseline 7 | 8 | > A baseline is a model describing what 'normal' network activity is according to some historical traffic pattern; all traffic that falls outside the scope of this established traffic pattern will be flagged as anomalous. 9 | 10 | > Trend and baseline analysis reports, commonly referred to as Top N and Baseline Analysis, is the most common and basic method of doing flow-based analysis. With this approach, attention is paid to flow records which have some "special high volume" characteristics, especially the value of those flow fields that deviate significantly from an established historical baseline. 11 | 12 | Once the `.table` and `.mtx` files are created, Top N analysis is trivial to implement. If a single IP address reflects high volume sessions it can be marked as a compromised host (or under attack). Similarly, when conducting Top N and data analysis, we simply need to look at the `.table` file hosts producing abnormally high amounts of data. All these statistics are kept count in the table structure, and the problem boils down to simply sorting the data and extracting the top N results. 13 | 14 | ##Pattern Matching 15 | 16 | > Pattern matching is another method we can use to identify abnormal network activities when doing flow-based analysis. With this method, the flow records will be searched and those hosts associated with flow fields that seem "suspicious" based on our criteria will be flagged. 17 | 18 | Once again, looking at the `.table` file, specific functional ports can be extracted (some of the common ones are 1434 for SQL Slammer worm or 12345 for Netbus Trojan). Similarly IP addresses can be compared and matched after sorting the `.table` file and comparing them to the Internet Assigned Numbers Authority (IANA) database. 19 | --------------------------------------------------------------------------------