├── datasets
    ├── sample.flow
    └── ctu-13.sh
├── README.md
├── scripts
    └── csvtomtx.sh
└── ANALYZE.md


/datasets/sample.flow:
--------------------------------------------------------------------------------
1 |  Date flow start          Duration Proto   Src IP Addr:Port      Dst IP Addr:Port     Packets    Bytes Flows
2 |  2010-09-01 00:00:00.459     0.000 UDP     127.0.0.1:24920   ->  192.168.0.1:22126        1       46     1
3 |  2010-09-01 00:00:00.363     0.000 UDP     192.168.0.1:22126 ->  127.0.0.1:24920          1       80     1
4 | 


--------------------------------------------------------------------------------
/datasets/ctu-13.sh:
--------------------------------------------------------------------------------
1 | # THE CTU-13 DATASET. A LABELED DATASET WITH BOTNET, NORMAL AND BACKGROUND TRAFFIC.
2 | # The CTU-13 is a dataset of botnet traffic that was captured in the CTU University, Czech Republic, in 2011. The goal of the dataset was to have a large capture of real botnet traffic mixed with normal traffic and background traffic. The CTU-13 dataset consists in thirteen captures (called scenarios) of different botnet samples. On each scenario we executed a specific malware, which used several protocols and performed different actions.
3 | 
4 | wget https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/CTU-13-Dataset.tar.bz2
5 | tar xvjf CTU-13-Dataset.tar.bz2
6 | echo "Convert PCAP to CSV: tshark -r test.pcap -T fields -e eth.src -e eth.dst -e"
7 | # "An empirical comparison of botnet detection methods" Sebastian Garcia, Martin Grill, Honza Stiborek and Alejandro Zunino. Computers and Security Journal, Elsevier. 2014. Vol 45, pp 100-123. http://dx.doi.org/10.1016/j.cose.2014.05.011
8 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Netflow
 2 | Cybersecurity: Graph Processing using Gunrock.
 3 | 
 4 | ## What is Netflow?
 5 | NetFlow is a traffic profile monitoring technology that describes the method for a router to export statistics about the routed socket pairs. When a network administrator enables the NetFlow export on a router interface, traffic statistics of packets received on that interface will be counted as "flow" and stored into a dynamic flow cache.
 6 | 
 7 | ## What is flow?
 8 | Flow is defined as a unidirectional sequence of packets (which means there will be two flows for each connection session, one from the server to client, one from the client to server) between two endpoints. A flow can be identified by seven key fields: source IP address, destination IP address, source port number, destination port number, protocol type, type of services, and the router input interface. Any time after receiving a packet, a router will look for these seven fields and then make a decision: if the packet belongs to an existent flow, traffic statistics of the corresponding flow will be increased, otherwise a new flow entry will be created.
 9 | 
10 | ```
11 |  Date flow start          Duration Proto   Src IP Addr:Port      Dst IP Addr:Port     Packets    Bytes Flows
12 |  2010-09-01 00:00:00.459     0.000 UDP     127.0.0.1:24920   ->  192.168.0.1:22126        1       46     1
13 |  2010-09-01 00:00:00.363     0.000 UDP     192.168.0.1:22126 ->  127.0.0.1:24920          1       80     1
14 | ```
15 |  
16 | ## Analysis Methods
17 | * Top N and Baseline
18 | * Top N session
19 | * Top N data
20 | * Pattern Matching
21 |   * Port matching
22 |   * IP address matching
23 |   
24 | ## Sources
25 | * [Inter Projekt - NetFlow](https://pliki.ip-sa.pl/wiki/Wiki.jsp?page=NetFlow)
26 | * [Detecting Worms and Abnormal Activities with NetFlow](https://www.symantec.com/connect/articles/detecting-worms-and-abnormal-activities-netflow-part-1)
27 | 


--------------------------------------------------------------------------------
/scripts/csvtomtx.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | # tshark -r botnet-capture-20110810-neris.pcap -T fields -e ip.src -e ip.dst
 4 | 
 5 | FILENAME=${1:-"botnet-capture-20110810-neris.csv"}
 6 | EXT=${FILENAME##*.}
 7 | DATASET=$(echo $FILENAME | cut -f 1 -d '.')
 8 | 
 9 | OUTPUT=./output
10 | mkdir -p ${OUTPUT}
11 | 
12 | sed 's/,/\t/g'    ${DATASET}.${EXT}            > ${OUTPUT}/${DATASET}.tab
13 | sort -k1 -n -t\t  ${OUTPUT}/${DATASET}.tab     > ${OUTPUT}/${DATASET}.sorted
14 | awk '{print $1}'  ${OUTPUT}/${DATASET}.sorted  > ${OUTPUT}/${DATASET}.nodes
15 | awk '{print $2}'  ${OUTPUT}/${DATASET}.sorted  > ${OUTPUT}/${DATASET}.nodes
16 | sed '/^\s*$/d'    ${OUTPUT}/${DATASET}.nodes   > ${OUTPUT}/${DATASET}.clean
17 | cat               ${OUTPUT}/${DATASET}.clean | sort | uniq > ${OUTPUT}/${DATASET}.unique
18 | awk '{printf("%d %s\n", NR, $0)}' ${OUTPUT}/${DATASET}.unique > ${OUTPUT}/${DATASET}.dict
19 | 
20 | awk '{print $1"\t"$2}' ${OUTPUT}/${DATASET}.tab  > ${OUTPUT}/${DATASET}.proc
21 | sed '/^\s*$/d'    ${OUTPUT}/${DATASET}.proc   > ${OUTPUT}/${DATASET}.mm
22 | awk 'NR==FNR{ a[$2]=$1; next }{ $1=a[$1]; $2=a[$2] }1' ${OUTPUT}/${DATASET}.dict ${OUTPUT}/${DATASET}.mm > ${OUTPUT}/${DATASET}.mtx
23 | NODES=$(wc -l < ${OUTPUT}/${DATASET}.dict)
24 | EDGES=$(wc -l < ${OUTPUT}/${DATASET}.mtx)
25 | # echo ${NODES} ${NODES} ${EDGES}
26 | sed -i '1s/^/'"${NODES} ${NODES} ${EDGES}"'\n/' ${OUTPUT}/${DATASET}.mtx
27 | sed -i '1s/^/%-------------------------------------------------------------------------------\n/' ${OUTPUT}/${DATASET}.mtx
28 | sed -i '1s/^/%%MatrixMarket matrix coordinate pattern \n/' ${OUTPUT}/${DATASET}.mtx
29 | sed -i '1s/^/%-------------------------------------------------------------------------------\n/' ${OUTPUT}/${DATASET}.mtx
30 | 
31 | echo "CSV to MTX Conversion: \n \
32 | MatrixMarket: ${OUTPUT}/${DATASET}.mtx \n \
33 | Dictionary: ${OUTPUT}/${DATASET}.mtx \n \
34 | Raw: ${OUTPUT}/${DATASET}.mm"
35 | 
36 | 


--------------------------------------------------------------------------------
/ANALYZE.md:
--------------------------------------------------------------------------------
 1 | # How to efficiently analyze flow data?
 2 | There are some interesting approaches used to study the flow data; Top N and Baseline, Pattern Matching, etc. Regardless of the approach we take, it requires us to extract useful records from the huge amount of flow records that are available to us.
 3 | 
 4 | Our approach is to create a table like structure that stores the IP addresses and the `id` that they link to. These ids are also stored in a Matrix Market file format to easily construct a graph (in data formats: csr, csc, or coo) using [gunrock](https://github.com/gunrock/gunrock).
 5 | 
 6 | ##Top N and Baseline
 7 | 
 8 | > A baseline is a model describing what 'normal' network activity is according to some historical traffic pattern; all traffic that falls outside the scope of this established traffic pattern will be flagged as anomalous.
 9 | 
10 | > Trend and baseline analysis reports, commonly referred to as Top N and Baseline Analysis, is the most common and basic method of doing flow-based analysis. With this approach, attention is paid to flow records which have some "special high volume" characteristics, especially the value of those flow fields that deviate significantly from an established historical baseline.
11 | 
12 | Once the `.table` and `.mtx` files are created, Top N analysis is trivial to implement. If a single IP address reflects high volume sessions it can be marked as a compromised host (or under attack). Similarly, when conducting Top N and data analysis, we simply need to look at the `.table` file hosts producing abnormally high amounts of data. All these statistics are kept count in the table structure, and the problem boils down to simply sorting the data and extracting the top N results.
13 | 
14 | ##Pattern Matching
15 | 
16 | > Pattern matching is another method we can use to identify abnormal network activities when doing flow-based analysis. With this method, the flow records will be searched and those hosts associated with flow fields that seem "suspicious" based on our criteria will be flagged.
17 | 
18 | Once again, looking at the `.table` file, specific functional ports can be extracted (some of the common ones are 1434 for SQL Slammer worm or 12345 for Netbus Trojan). Similarly IP addresses can be compared and matched after sorting the `.table` file and comparing them to the Internet Assigned Numbers Authority (IANA) database.
19 | 


--------------------------------------------------------------------------------