├── .gitignore
├── CHANGES
├── LICENSE
├── Readme.md
├── docker
    ├── .env
    ├── data
    │   └── .empty
    ├── docker-compose.yml
    └── zeek2es
    │   ├── Dockerfile
    │   └── entrypoint.sh
├── images
    ├── kibana-aggregation.png
    ├── kibana-map.png
    ├── kibana-subnet-search.png
    ├── kibana-timeseries.png
    ├── kibana.png
    └── multi-log-correlation.png
├── process_log.sh
├── process_logs_as_datastream.sh
├── process_logs_to_stdout.sh
├── setup.py
└── zeek2es.py


/.gitignore:
--------------------------------------------------------------------------------
1 | build/
2 | *.so
3 | *.c
4 | .DS_Store
5 | docker/data
6 | 


--------------------------------------------------------------------------------
/CHANGES:
--------------------------------------------------------------------------------
 1 | v0.3.15         Improved Humio import.
 2 | v0.3.14         Removed a print statement.
 3 | v0.3.13         Fixed some errors on Humio import.
 4 | v0.3.12         Will continue to populate data after a Humio error.
 5 | v0.3.11         Added Humio support.
 6 | v0.3.10         Improved Docker components.
 7 | v0.3.9          Fixed a variable check when there is no output.
 8 | v0.3.8          Fixed up some minor issues with JSON stdout output.
 9 | v0.3.7          Added Docker pieces.
10 | v0.3.6          Fixed a bug with the slash on the end of the ES url option.
11 | v0.3.5          Removed need for trailing slash on ES URL. 
12 | v0.3.4		    Made datastream names consistent with ES expectations if -d is used without an index name.
13 | v0.3.3		    Added best compression option and fixed helper script.
14 | v0.3.2          Fixed a bug with a grep command.
15 | v0.3.1          Added more logic to make ready for Elastic v8.
16 | v0.3.0          Added filtering on keys.  Cleaned up some argparse logic, breaking previous command lines.
17 | v0.2.20         Fix wording.
18 | v0.2.19         Fix a bug in a helper script.
19 | v0.2.18         Added the -p command line argument to split additional fields.
20 | v0.2.17         Fixed various things in the help scripts.  Refactor.
21 | v0.2.16         Fixed a typo in a helper script.
22 | v0.2.15         Refactor helper script.
23 | v0.2.14         Added a fswatch helper script.
24 | v0.2.13         Refactored the helper script.
25 | v0.2.12         Added a supporting shell script for data streams.
26 | v0.2.11         Fixed a mapping issue with data streams.
27 | v0.2.10         Fixed help screen output.
28 | v0.2.9          Added hashdates option to use random hashes instead of dates in indices.
29 | v0.2.8          Added lifecycle policy for shard size rollover.
30 | v0.2.7          Added data stream capability.
31 | v0.2.6          Added capability to output only certain fields.
32 | v0.2.5          Added Cython and Python lambda filtering capabilities.
33 | v0.2.4          Added error checking for empty field.
34 | v0.2.3          Added keyword sub field capabilities with -k option.  
35 |                 Added more documentation to readme.
36 | v0.2.2          Added a split ingest pipeline on the "service" field.
37 | v0.2.1          Added ES pipeline capability, which allows for Geolocation on IP addresses.
38 | v0.2.0          Removed some index checking, made indices on log type and day to 
39 |                 reduce the number of open indices.  Remove state documents.
40 |                 Other odds and ends.  Added @timestamp for ease.
41 | v0.1.16         Added JSON input support with -j.
42 | v0.1.15         Fix a bug with timezone translation.
43 | v0.1.14         Add timezone support.
44 | v0.1.13         Tune down the -l parameter.
45 | v0.1.12         Added origtime command line option.
46 | v0.1.11         Improvements to processing speed.
47 | v0.1.10         Add option to keep original times.
48 | v0.1.9          Remove stderr output from zeek-cut.
49 | v0.1.8          Added system name to log, if available.
50 | v0.1.7          Improved index name generation.
51 | v0.1.6          Get date from log rather than path.
52 | v0.1.5          Added more debug output.
53 | v0.1.4          Added some error checking.
54 | v0.1.3          Added number of items processed to state document.
55 | v0.1.2          Added state information and --checkstate command line option.
56 | v0.1.1          Added file name to JSON documents.
57 | v0.1.0          Initial release.
58 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2021, Corelight, Inc. All rights reserved.
 2 | 
 3 | Redistribution and use in source and binary forms, with or without
 4 | modification, are permitted provided that the following conditions are
 5 | met:
 6 | 
 7 | (1) Redistributions of source code must retain the above copyright
 8 |     notice, this list of conditions and the following disclaimer.
 9 | 
10 | (2) Redistributions in binary form must reproduce the above copyright
11 |     notice, this list of conditions and the following disclaimer in
12 |     the documentation and/or other materials provided with the
13 |     distribution.
14 | 
15 | (3) Neither the name of Corelight nor the names of any contributors
16 |     may be used to endorse or promote products derived from this
17 |     software without specific prior written permission.
18 | 
19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
20 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
21 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
22 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
23 | OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
24 | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
25 | LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
26 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
27 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
  1 | # zeek2es.py
  2 | 
  3 | This Python application translates [Zeek's](https://zeek.org/) ASCII TSV and JSON
  4 | logs into [ElasticSearch's bulk load JSON format](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html#add-multiple-documents).
  5 | 
  6 | ## Table of Contents:
  7 | - [Introduction](#introduction)
  8 | - [Installation](#installation)
  9 |   - [Elastic v8.0+](#elastic80)
 10 |   - [Docker](#docker)
 11 | - [Upgrading zeek2es](#upgradingzeek2es)
 12 |   - [ES Ingest Pipeline](#esingestpipeline)
 13 | - [Filtering Data](#filteringdata)
 14 |   - [Python Filters](#pythonfilters)
 15 |   - [Filter on Keys](#filteronkeys)
 16 | - [Command Line Examples](#commandlineexamples)
 17 | - [Command Line Options](#commandlineoptions)
 18 | - [Requirements](#requirements)
 19 | - [Notes](#notes)
 20 |   - [Humio](#humio)
 21 |   - [JSON Log Input](#jsonloginput)
 22 |   - [Data Streams](#datastreams)
 23 |   - [Helper Scripts](#helperscripts)
 24 |   - [Cython](#cython)
 25 | 
 26 | ## Introduction <a name="introduction" />
 27 | 
 28 | ![Kibana](images/kibana.png)
 29 | 
 30 | Want to see multiple Zeek logs for the same connection ID (uid)
 31 | or file ID (fuid)?  Here are the hits from files.log, http.log, and
 32 | conn.log for a single uid:
 33 | 
 34 | ![Kibana](images/multi-log-correlation.png)
 35 | 
 36 | You can perform subnet searching on Zeek's 'addr' type:
 37 | 
 38 | ![Kibana Subnet Searching](images/kibana-subnet-search.png)
 39 | 
 40 | You can create time series graphs, such as this NTP and HTTP graph:
 41 | 
 42 | ![Kibana Time Series](images/kibana-timeseries.png)
 43 | 
 44 | IP Addresses can be Geolocated with the `-g` command line option:
 45 | 
 46 | ![Kibana Mapping](images/kibana-map.png)
 47 | 
 48 | Aggregations are simple and quick:
 49 | 
 50 | ![Kibana Aggregation](images/kibana-aggregation.png)
 51 | 
 52 | This application will "just work" when Zeek log formats change.  The logic reads
 53 | the field names and associated types to set up the mappings correctly in
 54 | ElasticSearch.
 55 | 
 56 | This application will recognize gzip or uncompressed logs.  This application assumes 
 57 | you have ElasticSearch set up on your localhost at the default port.
 58 | If you do not have ElasticSearch you can output the JSON to stdout with the `-s -b` command line options
 59 | to process with the [jq application](https://stedolan.github.io/jq).
 60 | 
 61 | You can add a keyword subfield to text fields with the `-k` command line option.  This is useful
 62 | for aggregations in Kibana.
 63 | 
 64 | If Python is already on your system, there is nothing additional for you to copy over
 65 | to your machine than [Elasticsearch, Kibana](https://www.elastic.co/start), and [zeek2es.py](zeek2es.py)
 66 | if you already have the [requests](https://docs.python-requests.org/en/latest/) library installed.
 67 | 
 68 | ## Installation <a name="installation" />
 69 | 
 70 | Assuming you meet the [requirements](#requirements), there is none.  You just 
 71 | copy [zeek2es.py](zeek2es.py) to your host and run it with Python.  Once Zeek
 72 | logs have been imported with automatic index name generation (meaning, you did not supply the `-i` option)
 73 | you will find your indices named "zeek_`zeeklogname`_`date`", where `zeeklogname` is a log name like `conn`
 74 | and the `date` is in `YYYY-MM-DD` format.  Set your Kibana index pattern to match `zeek*` in this case.  If
 75 | you named your index with the `-i` option, you will need to create a Kibana index pattern that 
 76 | matches your naming scheme.
 77 | 
 78 | If you are upgrading zeek2es, please see [the section on upgrading zeek2es](#upgradingzeek2es).
 79 | 
 80 | ### Elastic v8.0+ <a name="elastic80" />
 81 | 
 82 | If you are using Elastic v8.0+, it has security enabled by default.  This adds a requirement of a username
 83 | and password, plus HTTPS.  
 84 | 
 85 | If you want to be able to delete indices/data streams with wildcards (as examples in this readme show),
 86 | edit  `elasticsearch.yml` with the following line:
 87 | 
 88 | ```
 89 | action.destructive_requires_name: false
 90 | ```
 91 | 
 92 | You will also need to change the curl commands in this readme to contain `-k -u elastic:<password>`
 93 | where the `elastic` user's password is set with a command like the following:
 94 | 
 95 | ```
 96 | ./bin/elasticsearch-reset-password -u elastic -i
 97 | ```
 98 | 
 99 | You can use `zeek2es.py` with the `--user` and `--passwd` command line options to specify your
100 | credentials to ES.  You can also supply these options via the extra command line arguments for the helper
101 | scripts.
102 | 
103 | ### Docker <a name="docker" />
104 | 
105 | Probably the easiest way to use this code is through Docker.  All of the files are in the `docker` directory.
106 | First, you will want to edit the lines with `CHANGEME!!!` in the `.env` file to fit your environment.  
107 | You will also need to edit the Elastic password in `docker/zeek2es/entrypoint.sh` to match.  It can be found after the `--passwd` option.  
108 | Next, you can change directory into the `docker` directory and type the following commands to bring 
109 | up a zeek2es and Elasticsearch cluster:
110 | 
111 | ```
112 | docker-compose build
113 | dockr-compose up
114 | ```
115 | 
116 | Now you can put logs in the `VOLUME_MOUNT/data/logs` directory (`VOLUME_MOUNT` you set in the `.env` file).
117 | When logs are CREATED in this directory, zeek2es will begin processing them and pushing them into Elasticsearch.
118 | You can then login to https://localhost:5601 with the username and password you set up in the `.env` file.  
119 | By default there is a self signed certificate, but you can change that if you edit the docker compose files.  Once inside
120 | Kibana you will go to Stack Management->Data Views and create a data view for `logs*` with the timestamp `@timestamp`.
121 | Now you will be able to go to Discover and start searching your logs!  Your data is persistent in the `VOLUME_MOUNT/data` directory you set.
122 | If you would like to remove all data, just `rm -rf VOLUME_MOUNT/data`, substituting the directory you set into that remove command.
123 | The next time you start your cluster it will be brand new for more data.
124 | 
125 | ## Upgrading zeek2es <a name="upgradingzeek2es" />
126 | 
127 | Most upgrades should be as simple as copying the newer [zeek2es.py](zeek2es.py) over 
128 | the old one.  In some cases, the ES ingest pipeline required for the `-g` command line option 
129 | might change during an upgrade.  Therefore, it is strongly recommend you delete 
130 | your [ingest pipeline](#esingestpipeline) before you run a new version of zeek2es.py.
131 | 
132 | ### ES Ingest Pipeline <a name="esingestpipeline" />
133 | 
134 | If you need to [delete the "zeekgeoip" ES ingest pipeline](https://www.elastic.co/guide/en/elasticsearch/reference/current/delete-pipeline-api.html) 
135 | used to geolocate IP addresses with the `-g` command line option, you can either do it graphically
136 | through Kibana's Stack Management->Ingest Pipelines or this command will do it for you:
137 | 
138 | ```
139 | curl -X DELETE "localhost:9200/_ingest/pipeline/zeekgeoip?pretty"
140 | ```
141 | 
142 | This command is strongly recommended whenever updating your copy of zeek2es.py.
143 | 
144 | ## Filtering Data <a name="filteringdata" />
145 | 
146 | ### Python Filters <a name="pythonfilters" />
147 | 
148 | zeek2es provides filtering capabilities for your Zeek logs before they are stored in ElasticSearch.  This
149 | functionality can be enabled with the `-a` or `-f` options.  The filters are constructed from Python
150 | lambda functions, where the input is a Python dictionary representing the output.  You can add a 
151 | filter to only store connection logs where the `service` field is populated using the `-f` option with
152 | this lambda filter file:
153 | 
154 | ```
155 | lambda x: 'service' in x and len(x['service']) > 0
156 | ```
157 | 
158 | Or maybe you'd like to filter for connections that have at least 1,024 bytes, with at least 1 byte coming from 
159 | the destination:
160 | 
161 | ```
162 | lambda x: 'orig_ip_bytes' in x and 'resp_ip_bytes' in x and x['orig_ip_bytes'] + x['resp_ip_bytes'] > 1024 and x['resp_ip_bytes'] > 0
163 | ```
164 | 
165 | Simpler lambda filters can be provided on the command line via the `-a` option.  This filter will only store 
166 | connection log entries where the originator IP address is part of the `192.0.0.0/8` network:
167 | 
168 | ```
169 | python zeek2es.py conn.log.gz -a "lambda x: 'id.orig_h' in x and ipaddress.ip_address(x['id.orig_h']) in ipaddress.ip_network('192.0.0.0/8')"
170 | ```
171 | 
172 | For power users, the `-f` option will allow you to define a full function (instead of Python's lambda functions) so you can write functions that 
173 | span multiple lines.
174 | 
175 | ### Filter on Keys <a name="filteronkeys" />
176 | 
177 | In some instances you might want to pull data from one log that depends on another.  An
178 | example would be finding all `ssl.log` rows that have a `uid` matching previously
179 | indexed rows from `conn.log`, or vice versa.  You can filter by importing your
180 | `conn.log` files with the `-o uid uid.txt` command line.  This will log all uids that were 
181 | indexed to a file named `uid.txt`.  Then, when you import your `ssl.log` files you will provide 
182 | the `-e uid uid.txt` command line.  This will only import SSL rows 
183 | containing `uid` values that are in `uid.txt`, previously built from our import of `conn.log`.
184 | 
185 | ## Command Line Examples <a name="commandlineexamples" />
186 | 
187 | ```
188 | python zeek2es.py your_zeek_log.gz -i your_es_index_name
189 | ```
190 | 
191 | This script can be run in parallel on all connection logs, 10 at a time, with the following command:
192 | 
193 | ```
194 | find /some/dir -name “conn*.log.gz” | parallel -j 10 python zeek2es.py {1} :::: -
195 | ```
196 | 
197 | If you would like to automatically import all conn.log files as they are created in a directory, the following
198 | [fswatch](https://emcrisostomo.github.io/fswatch/) command will do that for you:
199 | 
200 | ```
201 | fswatch -m poll_monitor --event Created -r /data/logs/zeek/ | awk '/^.*\/conn.*\.log\.gz$/' | parallel -j 5 python ~/zeek2es.py {} -g -d :::: -
202 | ```
203 | 
204 | If you have the jq command installed you can perform searches across all your logs for a common
205 | field like connection uid, even without ElasticSearch:
206 | 
207 | ```
208 | find /usr/local/var/logs -name "*.log.gz" -exec python ~/Source/zeek2es/zeek2es.py {} -s -b -z \; | jq -c '. | select(.uid=="CLbPij1vThLvQ2qDKh")'
209 | ```
210 | 
211 | You can use much more complex jq queries than this if you are familiar with jq.
212 | 
213 | If you want to remove all of your Zeek data from ElasticSearch, this command will do it for you:
214 | 
215 | ```
216 | curl -X DELETE http://localhost:9200/zeek*
217 | ```
218 | 
219 | Since the indices have the date appended to them, you could
220 | delete Dec 31, 2021 with the following command:
221 | 
222 | ```
223 | curl -X DELETE http://localhost:9200/zeek_*_2021-12-31
224 | ```
225 | 
226 | You could delete all conn.log entries with this command:
227 | 
228 | ```
229 | curl -X DELETE http://localhost:9200/zeek_conn_*
230 | ```
231 | 
232 | ## Command Line Options <a name="commandlineoptions" />
233 | 
234 | ```
235 | $ python zeek2es.py -h
236 | usage: zeek2es.py [-h] [-i ESINDEX] [-u ESURL] [--user USER] [--passwd PASSWD]
237 |                   [-l LINES] [-n NAME] [-k KEYWORDS [KEYWORDS ...]]
238 |                   [-a LAMBDAFILTER] [-f FILTERFILE]
239 |                   [-y OUTPUTFIELDS [OUTPUTFIELDS ...]] [-d DATASTREAM]
240 |                   [--compress] [-o fieldname filename] [-e fieldname filename]
241 |                   [-g] [-p SPLITFIELDS [SPLITFIELDS ...]] [-j] [-r] [-t] [-s]
242 |                   [-b] [--humio HUMIO HUMIO] [-c] [-w] [-z]
243 |                   filename
244 | 
245 | Process Zeek ASCII logs into ElasticSearch.
246 | 
247 | positional arguments:
248 |   filename              The Zeek log in *.log or *.gz format.  Include the full path.
249 | 
250 | optional arguments:
251 |   -h, --help            show this help message and exit
252 |   -i ESINDEX, --esindex ESINDEX
253 |                         The Elasticsearch index/data stream name.
254 |   -u ESURL, --esurl ESURL
255 |                         The Elasticsearch URL.  Use ending slash.  Use https for Elastic v8+. (default: http://localhost:9200)
256 |   --user USER           The Elasticsearch user. (default: disabled)
257 |   --passwd PASSWD       The Elasticsearch password. Note this will put your password in this shell history file.  (default: disabled)
258 |   -l LINES, --lines LINES
259 |                         Lines to buffer for RESTful operations. (default: 10,000)
260 |   -n NAME, --name NAME  The name of the system to add to the index for uniqueness. (default: empty string)
261 |   -k KEYWORDS [KEYWORDS ...], --keywords KEYWORDS [KEYWORDS ...]
262 |                         A list of text fields to add a keyword subfield. (default: service)
263 |   -a LAMBDAFILTER, --lambdafilter LAMBDAFILTER
264 |                         A Python lambda function, when eval'd will filter your output JSON dict. (default: empty string)
265 |   -f FILTERFILE, --filterfile FILTERFILE
266 |                         A Python function file, when eval'd will filter your output JSON dict. (default: empty string)
267 |   -y OUTPUTFIELDS [OUTPUTFIELDS ...], --outputfields OUTPUTFIELDS [OUTPUTFIELDS ...]
268 |                         A list of fields to keep for the output.  Must include ts. (default: empty string)
269 |   -d DATASTREAM, --datastream DATASTREAM
270 |                         Instead of an index, use a data stream that will rollover at this many GB.
271 |                         Recommended is 50 or less.  (default: 0 - disabled)
272 |   --compress            If a datastream is used, enable best compression.
273 |   -o fieldname filename, --logkey fieldname filename
274 |                         A field to log to a file.  Example: uid uid.txt.  
275 |                         Will append to the file!  Delete file before running if appending is undesired.  
276 |                         This option can be called more than once.  (default: empty - disabled)
277 |   -e fieldname filename, --filterkeys fieldname filename
278 |                         A field to filter with keys from a file.  Example: uid uid.txt.  (default: empty string - disabled)
279 |   -g, --ingestion       Use the ingestion pipeline to do things like geolocate IPs and split services.  Takes longer, but worth it.
280 |   -p SPLITFIELDS [SPLITFIELDS ...], --splitfields SPLITFIELDS [SPLITFIELDS ...]
281 |                         A list of additional fields to split with the ingestion pipeline, if enabled.
282 |                         (default: empty string - disabled)
283 |   -j, --jsonlogs        Assume input logs are JSON.
284 |   -r, --origtime        Keep the numerical time format, not milliseconds as ES needs.
285 |   -t, --timestamp       Keep the time in timestamp format.
286 |   -s, --stdout          Print JSON to stdout instead of sending to Elasticsearch directly.
287 |   -b, --nobulk          Remove the ES bulk JSON header.  Requires --stdout.
288 |   --humio HUMIO HUMIO   First argument is the Humio URL, the second argument is the ingest token.
289 |   -c, --cython          Use Cython execution by loading the local zeek2es.so file through an import.
290 |                         Run python setup.py build_ext --inplace first to make your zeek2es.so file!
291 |   -w, --hashdates       Use hashes instead of dates for the index name.
292 |   -z, --supresswarnings
293 |                         Supress any type of warning.  Die stoically and silently.
294 | 
295 | To delete indices:
296 | 
297 | 	curl -X DELETE http://localhost:9200/zeek*?pretty
298 | 
299 | To delete data streams:
300 | 
301 | 	curl -X DELETE http://localhost:9200/_data_stream/zeek*?pretty
302 | 
303 | To delete index templates:
304 | 
305 | 	curl -X DELETE http://localhost:9200/_index_template/zeek*?pretty
306 | 
307 | To delete the lifecycle policy:
308 | 
309 | 	curl -X DELETE http://localhost:9200/_ilm/policy/zeek-lifecycle-policy?pretty
310 | 
311 | You will need to add -k -u elastic_user:password if you are using Elastic v8+.
312 | ```
313 | 
314 | ## Requirements <a name="requirements" />
315 | 
316 | - A Unix-like environment (MacOs works!)
317 | - Python
318 |   - [requests](https://docs.python-requests.org/en/latest/) Python library installed, such as with with `pip`.
319 | 
320 | ## Notes <a name="notes" />
321 | 
322 | ### Humio <a name="humio" />
323 | 
324 | To import your data into Humio you will need to set up a repository with the `corelight-json` parser.  Obtain
325 | the ingest token for the repository and you can import your data with a command such as:
326 | 
327 | ```
328 | python3 zeek2es.py -s -b --humio http://localhost:8080 b005bf74-1ed3-4871-904f-9460a4687202 http.log 
329 | ```
330 | 
331 | The URL should be in the format of: `http://yourserver:8080`, as the rest of the path is added by the
332 | `zeek2es.py` script automatically for you.
333 | 
334 | ### JSON Log Input <a name="jsonloginput" />
335 | 
336 | Since Zeek JSON logs do not have type information like the ASCII TSV versions, only limited type information 
337 | can be provided to ElasticSearch.  You will notice this most for Zeek "addr" log fields that 
338 | are not id$orig_h and id$resp_h, since the type information is not available to translate the field into 
339 | ElasticSearch's "ip" type.  Since address fields will not be of type "ip", you will not be able to use 
340 | subnet searches, for example, like you could for the TSV logs.  Saving Zeek logs in ASCII TSV 
341 | format provides for greater long term flexibility.
342 | 
343 | ### Data Streams <a name="datastreams" />
344 | 
345 | You can use data streams instead of indices for large logs with the `-d` command line option.  This
346 | option creates index templates beginning with `zeek_`.  It also creates a lifecycle policy
347 | named `zeek-lifecycle-policy`.  If you would like to delete all of your data streams, lifecycle policies,
348 | and index templates, these commands will do it for you:
349 | 
350 | ```
351 | curl -X DELETE http://localhost:9200/_data_stream/zeek*?pretty
352 | curl -X DELETE http://localhost:9200/_index_template/zeek*?pretty
353 | curl -X DELETE http://localhost:9200/_ilm/policy/zeek-lifecycle-policy?pretty
354 | ```
355 | 
356 | ### Helper Scripts <a name="helperscripts" />
357 | 
358 | There are two scripts that will help you make your logs into data streams such as `logs-zeek-conn`.
359 | The first script is [process_logs_as_datastream.sh](process_logs_as_datastream.sh) and given 
360 | a list of logs and directories, will import them as such.  The second script 
361 | is [process_log.sh](process_log.sh), and it can be used to import logs 
362 | one at a time.  This script can also be used to monitor logs created in a directory with 
363 | [fswatch](https://emcrisostomo.github.io/fswatch/).  Both scripts have example command lines 
364 | if you run them without any parameters.  
365 | 
366 | ```
367 | $ ./process_logs_as_datastream.sh 
368 | Usage: ./process_logs_as_datastream.sh NJOBS "ADDITIONAL_ARGS_TO_ZEEK2ES" "LIST_OF_LOGS_DELIMITED_BY_SPACES" DIR1 DIR2 ...
369 | 
370 | Example:
371 |   time ./process_logs_as_datastream.sh 16 "" "amqp bgp conn dce_rpc dhcp dns dpd files ftp http ipsec irc kerberos modbus modbus_register_change mount mqtt mysql nfs notice ntlm ntp ospf portmap radius reporter rdp rfb rip ripng sip smb_cmd smb_files smb_mapping smtp snmp socks ssh ssl stun syslog tunnel vpn weird wireguard x509" /usr/local/var/logs
372 | ```
373 | 
374 | ```
375 | $ ./process_log.sh 
376 | Usage: ./process_log.sh LOGFILENAME "ADDITIONAL_ARGS_TO_ZEEK2ES"
377 | 
378 | Example:
379 |   fswatch -m poll_monitor --event Created -r /data/logs/zeek |  awk '/^.*\/(conn|dns|http)\..*\.log\.gz$/' | parallel -j 16 ./process_log.sh {} "" :::: -
380 | ```
381 | 
382 | You will need to edit these scripts and command lines according to your environment.  
383 | 
384 | Any files having a name of a log such as `conn_filter.txt` in the `lambda_filter_file_dir`, by default your home directory, will be applied as a lambda
385 | filter file to the corresponding log input.  This allows you to set up all of your filters in one directory and import multiple log files with
386 | that set of filters in one command with [process_logs_as_datastream.sh](process_logs_as_datastream.sh).
387 | 
388 | The following lines should delete all Zeek data in ElasticSearch no matter if you use indices or 
389 | data streams, or these helper scripts:
390 | 
391 | ```
392 | curl -X DELETE http://localhost:9200/zeek*?pretty
393 | curl -X DELETE http://localhost:9200/_data_stream/zeek*?pretty
394 | curl -X DELETE http://localhost:9200/_data_stream/logs-zeek*?pretty
395 | curl -X DELETE http://localhost:9200/_index_template/zeek*?pretty
396 | curl -X DELETE http://localhost:9200/_index_template/logs-zeek*?pretty
397 | curl -X DELETE http://localhost:9200/_ilm/policy/zeek-lifecycle-policy?pretty
398 | ```
399 | 
400 | ... or if using Elastic v8+ ...
401 | 
402 | ```
403 | curl -X DELETE -k -u elastic:password https://localhost:9200/zeek*?pretty
404 | curl -X DELETE -k -u elastic:password https://localhost:9200/_data_stream/zeek*?pretty
405 | curl -X DELETE -k -u elastic:password https://localhost:9200/_data_stream/logs-zeek*?pretty
406 | curl -X DELETE -k -u elastic:password https://localhost:9200/_index_template/zeek*?pretty
407 | curl -X DELETE -k -u elastic:password https://localhost:9200/_index_template/logs-zeek*?pretty
408 | curl -X DELETE -k -u elastic:password https://localhost:9200/_ilm/policy/zeek-lifecycle-policy?pretty
409 | ```
410 | 
411 | But to be able to do this in v8+ you will need to configure Elastic as described 
412 | in the section [Elastic v8.0+](#elastic80).
413 | 
414 | ### Cython <a name="cython" />
415 | 
416 | If you'd like to try [Cython](https://cython.org/), you must run `python setup.py build_ext --inplace` 
417 | first to generate your compiled file.  You must do this every time you update zeek2es!


--------------------------------------------------------------------------------
/docker/.env:
--------------------------------------------------------------------------------
 1 | # Password for the 'elastic' user (at least 6 characters) CHANGEME!!!
 2 | ELASTIC_PASSWORD=elastic
 3 | 
 4 | # Password for the 'kibana_system' user (at least 6 characters) CHANGEME!!!
 5 | KIBANA_PASSWORD=elasticANDkibana
 6 | 
 7 | # Version of Elastic products
 8 | STACK_VERSION=8.1.3
 9 | 
10 | # Set the cluster name
11 | CLUSTER_NAME=docker-cluster
12 | 
13 | # Set to 'basic' or 'trial' to automatically start the 30-day trial
14 | LICENSE=basic
15 | #LICENSE=trial
16 | 
17 | # Port to expose Elasticsearch HTTP API to the host
18 | ES_PORT=9200
19 | #ES_PORT=127.0.0.1:9200
20 | 
21 | # Port to expose Kibana to the host
22 | KIBANA_PORT=5601
23 | #KIBANA_PORT=80
24 | 
25 | # Increase or decrease based on the available host memory (in bytes)
26 | MEM_LIMIT=1073741824
27 | 
28 | # Project namespace (defaults to the current folder name if not set)
29 | #COMPOSE_PROJECT_NAME=myproject
30 | 
31 | # Where the data directory resides for volumes  CHANGEME!!!
32 | VOLUME_MOUNT=./


--------------------------------------------------------------------------------
/docker/data/.empty:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corelight/zeek2es/078b531dc27741e3dad26880f58aaee859e8721d/docker/data/.empty


--------------------------------------------------------------------------------
/docker/docker-compose.yml:
--------------------------------------------------------------------------------
  1 | version: "2.2"
  2 | 
  3 | services:
  4 |   setup:
  5 |     image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
  6 |     volumes:
  7 |       - certs:/usr/share/elasticsearch/config/certs
  8 |     user: "0"
  9 |     command: >
 10 |       bash -c '
 11 |         if [ x${ELASTIC_PASSWORD} == x ]; then
 12 |           echo "Set the ELASTIC_PASSWORD environment variable in the .env file";
 13 |           exit 1;
 14 |         elif [ x${KIBANA_PASSWORD} == x ]; then
 15 |           echo "Set the KIBANA_PASSWORD environment variable in the .env file";
 16 |           exit 1;
 17 |         fi;
 18 |         if [ ! -f config/certs/ca.zip ]; then
 19 |           echo "Creating CA";
 20 |           bin/elasticsearch-certutil ca --silent --pem -out config/certs/ca.zip;
 21 |           unzip config/certs/ca.zip -d config/certs;
 22 |         fi;
 23 |         if [ ! -f config/certs/certs.zip ]; then
 24 |           echo "Creating certs";
 25 |           echo -ne \
 26 |           "instances:\n"\
 27 |           "  - name: es01\n"\
 28 |           "    dns:\n"\
 29 |           "      - es01\n"\
 30 |           "      - localhost\n"\
 31 |           "    ip:\n"\
 32 |           "      - 127.0.0.1\n"\
 33 |           "  - name: es02\n"\
 34 |           "    dns:\n"\
 35 |           "      - es02\n"\
 36 |           "      - localhost\n"\
 37 |           "    ip:\n"\
 38 |           "      - 127.0.0.1\n"\
 39 |           "  - name: es03\n"\
 40 |           "    dns:\n"\
 41 |           "      - es03\n"\
 42 |           "      - localhost\n"\
 43 |           "    ip:\n"\
 44 |           "      - 127.0.0.1\n"\
 45 |           "  - name: kibana\n"\
 46 |           "    dns:\n"\
 47 |           "      - kibana\n"\
 48 |           "      - localhost\n"\
 49 |           "    ip:\n"\
 50 |           "      - 127.0.0.1\n"\
 51 |           > config/certs/instances.yml;
 52 |           bin/elasticsearch-certutil cert --silent --pem -out config/certs/certs.zip --in config/certs/instances.yml --ca-cert config/certs/ca/ca.crt --ca-key config/certs/ca/ca.key;
 53 |           unzip config/certs/certs.zip -d config/certs;
 54 |         fi;
 55 |         echo "Setting file permissions"
 56 |         chown -R root:root config/certs;
 57 |         find . -type d -exec chmod 750 \{\} \;;
 58 |         find . -type f -exec chmod 640 \{\} \;;
 59 |         echo "Waiting for Elasticsearch availability";
 60 |         until curl -s --cacert config/certs/ca/ca.crt https://es01:9200 | grep -q "missing authentication credentials"; do sleep 30; done;
 61 |         echo "Setting kibana_system password";
 62 |         until curl -s -X POST --cacert config/certs/ca/ca.crt -u elastic:${ELASTIC_PASSWORD} -H "Content-Type: application/json" https://es01:9200/_security/user/kibana_system/_password -d "{\"password\":\"${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
 63 |         echo "All done!";
 64 |       '
 65 |     healthcheck:
 66 |       test: ["CMD-SHELL", "[ -f config/certs/es01/es01.crt ]"]
 67 |       interval: 1s
 68 |       timeout: 5s
 69 |       retries: 120
 70 |     container_name: "setup"
 71 | 
 72 |   es01:
 73 |     depends_on:
 74 |       setup:
 75 |         condition: service_healthy
 76 |     image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
 77 |     restart: "unless-stopped"
 78 |     volumes:
 79 |       - certs:/usr/share/elasticsearch/config/certs
 80 |       - ${VOLUME_MOUNT}/data/es01:/usr/share/elasticsearch/data
 81 |     ports:
 82 |       - ${ES_PORT}:9200
 83 |     environment:
 84 |       - node.name=es01
 85 |       - cluster.name=${CLUSTER_NAME}
 86 |       - cluster.initial_master_nodes=es01,es02,es03
 87 |       - discovery.seed_hosts=es02,es03
 88 |       - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
 89 |       - bootstrap.memory_lock=true
 90 |       - xpack.security.enabled=true
 91 |       - xpack.security.http.ssl.enabled=true
 92 |       - xpack.security.http.ssl.key=certs/es01/es01.key
 93 |       - xpack.security.http.ssl.certificate=certs/es01/es01.crt
 94 |       - xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
 95 |       - xpack.security.http.ssl.verification_mode=certificate
 96 |       - xpack.security.transport.ssl.enabled=true
 97 |       - xpack.security.transport.ssl.key=certs/es01/es01.key
 98 |       - xpack.security.transport.ssl.certificate=certs/es01/es01.crt
 99 |       - xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
100 |       - xpack.security.transport.ssl.verification_mode=certificate
101 |       - xpack.license.self_generated.type=${LICENSE}
102 |     mem_limit: ${MEM_LIMIT}
103 |     ulimits:
104 |       memlock:
105 |         soft: -1
106 |         hard: -1
107 |     healthcheck:
108 |       test:
109 |         [
110 |           "CMD-SHELL",
111 |           "curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
112 |         ]
113 |       interval: 10s
114 |       timeout: 10s
115 |       retries: 120
116 |     container_name: "es01"
117 | 
118 |   es02:
119 |     depends_on:
120 |       - es01
121 |     image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
122 |     restart: "unless-stopped"
123 |     volumes:
124 |       - certs:/usr/share/elasticsearch/config/certs
125 |       - ${VOLUME_MOUNT}/data/es02:/usr/share/elasticsearch/data
126 |     environment:
127 |       - node.name=es02
128 |       - cluster.name=${CLUSTER_NAME}
129 |       - cluster.initial_master_nodes=es01,es02,es03
130 |       - discovery.seed_hosts=es01,es03
131 |       - bootstrap.memory_lock=true
132 |       - xpack.security.enabled=true
133 |       - xpack.security.http.ssl.enabled=true
134 |       - xpack.security.http.ssl.key=certs/es02/es02.key
135 |       - xpack.security.http.ssl.certificate=certs/es02/es02.crt
136 |       - xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
137 |       - xpack.security.http.ssl.verification_mode=certificate
138 |       - xpack.security.transport.ssl.enabled=true
139 |       - xpack.security.transport.ssl.key=certs/es02/es02.key
140 |       - xpack.security.transport.ssl.certificate=certs/es02/es02.crt
141 |       - xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
142 |       - xpack.security.transport.ssl.verification_mode=certificate
143 |       - xpack.license.self_generated.type=${LICENSE}
144 |     mem_limit: ${MEM_LIMIT}
145 |     ulimits:
146 |       memlock:
147 |         soft: -1
148 |         hard: -1
149 |     healthcheck:
150 |       test:
151 |         [
152 |           "CMD-SHELL",
153 |           "curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
154 |         ]
155 |       interval: 10s
156 |       timeout: 10s
157 |       retries: 120
158 |     container_name: "es02"
159 | 
160 |   es03:
161 |     depends_on:
162 |       - es02
163 |     image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
164 |     restart: "unless-stopped"
165 |     volumes:
166 |       - certs:/usr/share/elasticsearch/config/certs
167 |       - ${VOLUME_MOUNT}/data/es03:/usr/share/elasticsearch/data
168 |     environment:
169 |       - node.name=es03
170 |       - cluster.name=${CLUSTER_NAME}
171 |       - cluster.initial_master_nodes=es01,es02,es03
172 |       - discovery.seed_hosts=es01,es02
173 |       - bootstrap.memory_lock=true
174 |       - xpack.security.enabled=true
175 |       - xpack.security.http.ssl.enabled=true
176 |       - xpack.security.http.ssl.key=certs/es03/es03.key
177 |       - xpack.security.http.ssl.certificate=certs/es03/es03.crt
178 |       - xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
179 |       - xpack.security.http.ssl.verification_mode=certificate
180 |       - xpack.security.transport.ssl.enabled=true
181 |       - xpack.security.transport.ssl.key=certs/es03/es03.key
182 |       - xpack.security.transport.ssl.certificate=certs/es03/es03.crt
183 |       - xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
184 |       - xpack.security.transport.ssl.verification_mode=certificate
185 |       - xpack.license.self_generated.type=${LICENSE}
186 |     mem_limit: ${MEM_LIMIT}
187 |     ulimits:
188 |       memlock:
189 |         soft: -1
190 |         hard: -1
191 |     healthcheck:
192 |       test:
193 |         [
194 |           "CMD-SHELL",
195 |           "curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
196 |         ]
197 |       interval: 10s
198 |       timeout: 10s
199 |       retries: 120
200 |     container_name: "es03"
201 | 
202 |   kibana:
203 |     depends_on:
204 |       es01:
205 |         condition: service_healthy
206 |       es02:
207 |         condition: service_healthy
208 |       es03:
209 |         condition: service_healthy
210 |     image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
211 |     restart: "unless-stopped"
212 |     volumes:
213 |       - certs:/usr/share/kibana/config/certs
214 |       - ${VOLUME_MOUNT}/data/kibana:/usr/share/kibana/data
215 |     ports:
216 |       - ${KIBANA_PORT}:5601
217 |     environment:
218 |       - SERVERNAME=kibana
219 |       - ELASTICSEARCH_HOSTS=https://es01:9200
220 |       - ELASTICSEARCH_USERNAME=kibana_system
221 |       - ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
222 |       - ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
223 |       - SERVER_SSL_ENABLED=true
224 |       - SERVER_SSL_KEY=/usr/share/kibana/config/certs/kibana/kibana.key
225 |       - SERVER_SSL_CERTIFICATE=/usr/share/kibana/config/certs/kibana/kibana.crt
226 |       - SERVER_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
227 | #      - SERVER_SSL_PASSWORD=${KIBANA_CERT_PASSWORD}
228 |     mem_limit: ${MEM_LIMIT}
229 |     healthcheck:
230 |       test:
231 |         [
232 |           "CMD-SHELL",
233 |           "curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'",
234 |         ]
235 |       interval: 10s
236 |       timeout: 10s
237 |       retries: 120
238 |     container_name: "kibana"
239 | 
240 |   zeek2es:
241 |     build:
242 |       context: ./zeek2es
243 |       dockerfile: Dockerfile
244 |     restart: "unless-stopped"
245 |     depends_on:
246 |       es01:
247 |         condition: service_healthy
248 |       es02:
249 |         condition: service_healthy
250 |       es03:
251 |         condition: service_healthy
252 |     command: >
253 |       bash -c '
254 |         chmod 755 /entrypoint.sh;
255 |         /entrypoint.sh
256 |       '
257 |     volumes:
258 |       - ./zeek2es/entrypoint.sh:/entrypoint.sh
259 |       - ${VOLUME_MOUNT}/data/logs:/logs
260 |     tty: true
261 |     container_name: "zeek2es"
262 | 
263 | volumes:
264 |   certs:
265 |     driver: local


--------------------------------------------------------------------------------
/docker/zeek2es/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM ubuntu:jammy
 2 | 
 3 | RUN apt-get -q update && \
 4 |     DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
 5 |       curl \
 6 |       fswatch \
 7 |       geoipupdate \
 8 |       git \
 9 |       iproute2 \
10 |       jq \
11 |       less \
12 |       netcat \
13 |       net-tools \
14 |       parallel \
15 |       python3 \
16 |       python3-dev \
17 |       python3-pip \
18 |       python3-setuptools \
19 |       python3-wheel \
20 |       swig \
21 |       tcpdump \
22 |       tcpreplay \
23 |       termshark \
24 |       tshark \
25 |       vim \
26 |       wget \
27 |       zeek-aux && \
28 |     pip3 install --no-cache-dir pre-commit requests && \
29 |     curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.2.0-amd64.deb && \
30 |     dpkg -i filebeat-8.2.0-amd64.deb && \
31 |     rm filebeat-8.2.0-amd64.deb && \
32 |     apt-get clean && rm -rf /var/lib/apt/lists/* && rm -rf ~/.cache/pip
33 | 
34 | # Install zeek2es
35 | RUN cd / && git clone https://github.com/corelight/zeek2es.git
36 | 
37 | #COPY entrypoint.sh /entrypoint.sh
38 | #RUN chmod 755 /entrypoint.sh


--------------------------------------------------------------------------------
/docker/zeek2es/entrypoint.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | fswatch -m poll_monitor --event Created -r /logs | parallel -j 3 python3 /zeek2es/zeek2es.py {} --compress -g -l 5000 -d 25 -u https://es01:9200 --user elastic --passwd elastic :::: - 


--------------------------------------------------------------------------------
/images/kibana-aggregation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corelight/zeek2es/078b531dc27741e3dad26880f58aaee859e8721d/images/kibana-aggregation.png


--------------------------------------------------------------------------------
/images/kibana-map.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corelight/zeek2es/078b531dc27741e3dad26880f58aaee859e8721d/images/kibana-map.png


--------------------------------------------------------------------------------
/images/kibana-subnet-search.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corelight/zeek2es/078b531dc27741e3dad26880f58aaee859e8721d/images/kibana-subnet-search.png


--------------------------------------------------------------------------------
/images/kibana-timeseries.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corelight/zeek2es/078b531dc27741e3dad26880f58aaee859e8721d/images/kibana-timeseries.png


--------------------------------------------------------------------------------
/images/kibana.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corelight/zeek2es/078b531dc27741e3dad26880f58aaee859e8721d/images/kibana.png


--------------------------------------------------------------------------------
/images/multi-log-correlation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/corelight/zeek2es/078b531dc27741e3dad26880f58aaee859e8721d/images/multi-log-correlation.png


--------------------------------------------------------------------------------
/process_log.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Things you can set:
 4 | zeek2es_path=~/Source/zeek2es/zeek2es.py
 5 | filter_file_dir=~/
 6 | num_of_lines=50000
 7 | logfiledelim=\\.
 8 | stream_prepend="logs-zeek-"
 9 | stream_ending=""
10 | pythoncmd="python3"
11 | zeek2esargs="-g -l $num_of_lines"
12 | 
13 | # Error checking
14 | if [ "$#" -ne 2 ]; then
15 |   echo "Usage: $0 LOGFILENAME \"ADDITIONAL_ARGS_TO_ZEEK2ES\"" >&2
16 |   echo >&2
17 |   echo "Example:" >&2
18 |   echo "  fswatch -m poll_monitor --event Created -r /data/logs/zeek |  awk '/^.*\/(conn|dns|http)\..*\.log\.gz$/' | parallel -j 16 $0 {} \"\"" :::: - >&2
19 |   exit 1
20 | fi
21 | 
22 | # Things set from the command line
23 | logfile=$1
24 | additional_args=$2
25 | 
26 | echo "Processing $logfile..."
27 | regex="s/.*\/\([^0-9\.]*\)$logfiledelim[0-9].*\.log\.gz/\1/"
28 | log_type=`echo $logfile | sed $regex`
29 | echo $log_type
30 | 
31 | zeek2esargsplus=$zeek2esargs" -i $stream_prepend$log_type$stream_ending "$additional_args
32 | 
33 | filterfile=$filter_file_dir$log_type"_filter.txt"
34 | 
35 | if [ -f $filterfile ]; then
36 |   echo "  Using filter file "$filterfile
37 |   $pythoncmd $zeek2es_path $logfile $zeek2esargsplus -f $filterfile
38 | else
39 |   echo "  No filter file found for "$filterfile
40 |   $pythoncmd $zeek2es_path $logfile $zeek2esargsplus
41 | fi


--------------------------------------------------------------------------------
/process_logs_as_datastream.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Things you can set:
 4 | zeek2es_path=~/Source/zeek2es/zeek2es.py
 5 | lognamedelim=\\.
 6 | #zeek2es_path=~/zeek2es.py
 7 | #lognamedelim=_2
 8 | filter_file_dir=~/
 9 | num_of_lines=50000
10 | num_of_gb=50
11 | pythoncmd="python3"
12 | zeek2esargs="-g -l $num_of_lines"
13 | 
14 | # Error checking
15 | if [ "$#" -lt 4 ]; then
16 |   echo "Usage: $0 NJOBS \"ADDITIONAL_ARGS_TO_ZEEK2ES\" \"LIST_OF_LOGS_DELIMITED_BY_SPACES\" DIR1 DIR2 ..." >&2
17 |   echo >&2
18 |   echo "Example:" >&2
19 |   echo "  time $0 16 \"\" \"amqp bgp conn dce_rpc dhcp dns dpd files ftp http ipsec irc kerberos modbus modbus_register_change mount mqtt mysql nfs notice ntlm ntp ospf portmap radius reporter rdp rfb rip ripng sip smb_cmd smb_files smb_mapping smtp snmp socks ssh ssl stun syslog tunnel vpn weird wireguard x509\" /usr/local/var/logs" >&2
20 |   exit 1
21 | fi
22 | 
23 | # Things set from the command line
24 | njobs=$1
25 | additional_args=$2
26 | logs=$3
27 | logdirs=${@:4}
28 | 
29 | # Iterate through the *.log.gz files in the supplied directory
30 | for val in $logs; do
31 |     zeek2esargsplus=$zeek2esargs" --compress -d "$num_of_gb" "$additional_args
32 |     echo "Processing $val logs..."
33 |     filename_re="/^.*\/"$val$lognamedelim".*\.log\.gz$/"
34 | 
35 |     filterfile=$filter_file_dir$val"_filter.txt"
36 | 
37 |     if [ -f $filterfile ]; then
38 |       echo "  Using filter file "$filterfile
39 |       find $logdirs | awk $filename_re | parallel -j $njobs $pythoncmd $zeek2es_path {} $zeek2esargsplus -f $filterfile :::: -
40 |     else
41 |       echo "  No filter file found for "$filterfile
42 |       find $logdirs | awk $filename_re | parallel -j $njobs $pythoncmd $zeek2es_path {} $zeek2esargsplus :::: -
43 |     fi
44 | done


--------------------------------------------------------------------------------
/process_logs_to_stdout.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Things you can set:
 4 | zeek2es_path=~/Source/zeek2es/zeek2es.py
 5 | lognamedelim=\\.
 6 | #zeek2es_path=~/zeek2es.py
 7 | #lognamedelim=_2
 8 | filter_file_dir=~/
 9 | num_of_lines=50000
10 | stream_prepend="logs-zeek-"
11 | stream_ending=""
12 | pythoncmd="python3"
13 | zeek2esargs="-s -b"
14 | 
15 | # Error checking
16 | if [ "$#" -lt 4 ]; then
17 |   echo "Usage: $0 NJOBS \"ADDITIONAL_ARGS_TO_ZEEK2ES\" \"LIST_OF_LOGS_DELIMITED_BY_SPACES\" DIR1 DIR2 ..." >&2
18 |   echo >&2
19 |   echo "Example:" >&2
20 |   echo "  time $0 16 \"\" \"amqp bgp conn dce_rpc dhcp dns dpd files ftp http ipsec irc kerberos modbus modbus_register_change mount mqtt mysql nfs notice ntlm ntp ospf portmap radius reporter rdp rfb rip ripng sip smb_cmd smb_files smb_mapping smtp snmp socks ssh ssl stun syslog tunnel vpn weird wireguard x509\" /usr/local/var/logs" >&2
21 |   exit 1
22 | fi
23 | 
24 | # Things set from the command line
25 | njobs=$1
26 | additional_args=$2
27 | logs=$3
28 | logdirs=${@:4}
29 | 
30 | # Iterate through the *.log.gz files in the supplied directory
31 | for val in $logs; do
32 |     zeek2esargsplus=$zeek2esargs" "$additional_args
33 |     filename_re="/^.*\/"$val$lognamedelim".*\.log\.gz$/"
34 | 
35 |     filterfile=$filter_file_dir$val"_filter.txt"
36 | 
37 |     if [ -f $filterfile ]; then
38 |       find $logdirs | awk $filename_re | parallel -j $njobs $pythoncmd $zeek2es_path {} $zeek2esargsplus -f $filterfile :::: -
39 |     else
40 |       find $logdirs | awk $filename_re | parallel -j $njobs $pythoncmd $zeek2es_path {} $zeek2esargsplus :::: -
41 |     fi
42 | done


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 | from Cython.Build import cythonize
3 | 
4 | setup(
5 |     ext_modules = cythonize("zeek2es.py")
6 | )
7 | 


--------------------------------------------------------------------------------
/zeek2es.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import subprocess
  3 | import json
  4 | import csv
  5 | import io
  6 | import requests
  7 | from requests.auth import HTTPBasicAuth
  8 | from urllib3.exceptions import InsecureRequestWarning
  9 | import datetime
 10 | import re
 11 | import argparse
 12 | import random
 13 | import time
 14 | # Making these available for lambda filter input.
 15 | import ipaddress
 16 | import os
 17 | 
 18 | # The number of bits to use in a random hash.
 19 | hashbits = 128
 20 | 
 21 | # Disable SSL warnings.
 22 | requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
 23 | 
 24 | # We do this to add a little extra help at the end.
 25 | class MyParser(argparse.ArgumentParser):
 26 |     def print_help(self):
 27 |         super().print_help()
 28 |         print("")
 29 |         print("To delete indices:\n\n\tcurl -X DELETE http://localhost:9200/zeek*?pretty\n")
 30 |         print("To delete data streams:\n\n\tcurl -X DELETE http://localhost:9200/_data_stream/zeek*?pretty\n")
 31 |         print("To delete index templates:\n\n\tcurl -X DELETE http://localhost:9200/_index_template/zeek*?pretty\n")
 32 |         print("To delete the lifecycle policy:\n\n\tcurl -X DELETE http://localhost:9200/_ilm/policy/zeek-lifecycle-policy?pretty\n")
 33 |         print("You will need to add -k -u elastic_user:password if you are using Elastic v8+.\n")
 34 | 
 35 | # This takes care of arg parsing
 36 | def parseargs():
 37 |     parser = MyParser(description='Process Zeek ASCII logs into ElasticSearch.', formatter_class=argparse.RawTextHelpFormatter)
 38 |     parser.add_argument('filename', 
 39 |                         help='The Zeek log in *.log or *.gz format.  Include the full path.')
 40 |     parser.add_argument('-i', '--esindex', help='The Elasticsearch index/data stream name.')
 41 |     parser.add_argument('-u', '--esurl', default="http://localhost:9200", help='The Elasticsearch URL.  Use ending slash.  Use https for Elastic v8+. (default: http://localhost:9200)')
 42 |     parser.add_argument('--user', default="", help='The Elasticsearch user. (default: disabled)')
 43 |     parser.add_argument('--passwd', default="", help='The Elasticsearch password. Note this will put your password in this shell history file.  (default: disabled)')
 44 |     parser.add_argument('-l', '--lines', default=10000, type=int, help='Lines to buffer for RESTful operations. (default: 10,000)')
 45 |     parser.add_argument('-n', '--name', default="", help='The name of the system to add to the index for uniqueness. (default: empty string)')
 46 |     parser.add_argument('-k', '--keywords', nargs="+", default="service", help='A list of text fields to add a keyword subfield. (default: service)')
 47 |     parser.add_argument('-a', '--lambdafilter', default="", help='A Python lambda function, when eval\'d will filter your output JSON dict. (default: empty string)')
 48 |     parser.add_argument('-f', '--filterfile', default="", help='A Python function file, when eval\'d will filter your output JSON dict. (default: empty string)')
 49 |     parser.add_argument('-y', '--outputfields', nargs="+", default="", help='A list of fields to keep for the output.  Must include ts. (default: empty string)')
 50 |     parser.add_argument('-d', '--datastream', default=0, type=int, help='Instead of an index, use a data stream that will rollover at this many GB.\nRecommended is 50 or less.  (default: 0 - disabled)')
 51 |     parser.add_argument('--compress', action="store_true", help='If a datastream is used, enable best compression.')
 52 |     parser.add_argument('-o', '--logkey', nargs=2, action='append', metavar=('fieldname','filename'), default=[], help='A field to log to a file.  Example: uid uid.txt.  \nWill append to the file!  Delete file before running if appending is undesired.  \nThis option can be called more than once.  (default: empty - disabled)')
 53 |     parser.add_argument('-e', '--filterkeys', nargs=2, metavar=('fieldname','filename'), default="", help='A field to filter with keys from a file.  Example: uid uid.txt.  (default: empty string - disabled)')
 54 |     parser.add_argument('-g', '--ingestion', action="store_true", help='Use the ingestion pipeline to do things like geolocate IPs and split services.  Takes longer, but worth it.')
 55 |     parser.add_argument('-p', '--splitfields', nargs="+", default="", help='A list of additional fields to split with the ingestion pipeline, if enabled.\n(default: empty string - disabled)')
 56 |     parser.add_argument('-j', '--jsonlogs', action="store_true", help='Assume input logs are JSON.')
 57 |     parser.add_argument('-r', '--origtime', action="store_true", help='Keep the numerical time format, not milliseconds as ES needs.')
 58 |     parser.add_argument('-t', '--timestamp', action="store_true", help='Keep the time in timestamp format.')
 59 |     parser.add_argument('-s', '--stdout', action="store_true", help='Print JSON to stdout instead of sending to Elasticsearch directly.')
 60 |     parser.add_argument('-b', '--nobulk', action="store_true", help='Remove the ES bulk JSON header.  Requires --stdout.')
 61 |     parser.add_argument('--humio', nargs=2, default="", help='First argument is the Humio URL, the second argument is the ingest token.')
 62 |     parser.add_argument('-c', '--cython', action="store_true", help='Use Cython execution by loading the local zeek2es.so file through an import.\nRun python setup.py build_ext --inplace first to make your zeek2es.so file!')
 63 |     parser.add_argument('-w', '--hashdates', action="store_true", help='Use hashes instead of dates for the index name.')
 64 |     parser.add_argument('-z', '--supresswarnings', action="store_true", help='Supress any type of warning.  Die stoically and silently.')
 65 |     args = parser.parse_args()
 66 |     return args
 67 | 
 68 | # A function to send data in bulk to ES.
 69 | def sendbulk(args, outstring, es_index, filename):
 70 |     # Elastic username and password auth
 71 |     auth = None
 72 |     if (len(args['user']) > 0):
 73 |         auth = HTTPBasicAuth(args['user'], args['passwd'])
 74 | 
 75 |     if len(args['humio']) != 2:
 76 |         if not args['stdout']:
 77 |             esurl = args['esurl'][:-1] if args['esurl'].endswith('/') else args['esurl']
 78 | 
 79 |             res = requests.put(esurl+'/_bulk', headers={'Content-Type': 'application/json'}, 
 80 |                                 data=outstring.encode('UTF-8'), auth=auth, verify=False)
 81 |             if not res.ok:
 82 |                 if not args['supresswarnings']:
 83 |                     print("WARNING! PUT did not return OK! Your index {} is incomplete.  Filename: {} Response: {} {}".format(es_index, filename, res, res.text))
 84 |         else:
 85 |             print(outstring.strip())
 86 |     else:
 87 |         # Send to Humio
 88 |         Headers = { "Authorization" : "Bearer "+args['humio'][1] }
 89 |         data = [{"messages" : outstring.strip().split('\n') }]
 90 |         while True:
 91 |             try:
 92 |                 r = requests.post(args['humio'][0]+'/api/v1/ingest/humio-unstructured', headers=Headers, json=data)
 93 |                 break
 94 |             except Exception as exc:
 95 |                 if not args['supresswarnings']:
 96 |                     print("WARNING, Humio error: {}".format(exc))
 97 |                 time.sleep(1)
 98 | 
 99 | # A function to send the datastream info to ES.
100 | def senddatastream(args, es_index, mappings):
101 |     # Elastic username and password auth
102 |     auth = None
103 |     if (len(args['user']) > 0):
104 |         auth = HTTPBasicAuth(args['user'], args['passwd'])
105 | 
106 |     esurl = args['esurl'][:-1] if args['esurl'].endswith('/') else args['esurl']
107 | 
108 |     lifecycle_policy = {"policy": {"phases": {"hot": {"actions": {"rollover": {"max_primary_shard_size": "{}GB".format(args['datastream'])}}}}}}
109 |     res = requests.put(esurl+"/_ilm/policy/zeek-lifecycle-policy", headers={'Content-Type': 'application/json'},
110 |                         data=json.dumps(lifecycle_policy).encode('UTF-8'), auth=auth, verify=False)
111 |     index_template = {"index_patterns": [es_index], "data_stream": {}, "composed_of": [], "priority": 500, 
112 |                         "template": {"settings": {"index.lifecycle.name": "zeek-lifecycle-policy"}, "mappings": mappings["mappings"]}}
113 |     if (args['compress']):
114 |         index_template["template"]["settings"]["index"] = {"codec": "best_compression"}
115 |     res = requests.put(esurl+"/_index_template/"+es_index, headers={'Content-Type': 'application/json'},
116 |                         data=json.dumps(index_template).encode('UTF-8'), auth=auth, verify=False)
117 | 
118 | # A function to send mappings to ES.
119 | def sendmappings(args, es_index, mappings):
120 |     # Elastic username and password auth
121 |     auth = None
122 |     if (len(args['user']) > 0):
123 |         auth = HTTPBasicAuth(args['user'], args['passwd'])
124 | 
125 |     esurl = args['esurl'][:-1] if args['esurl'].endswith('/') else args['esurl']
126 | 
127 |     res = requests.put(esurl+"/"+es_index, headers={'Content-Type': 'application/json'},
128 |                         data=json.dumps(mappings).encode('UTF-8'), auth=auth, verify=False)
129 | 
130 | # A function to send the ingest pipeline to ES.
131 | def sendpipeline(args, ingest_pipeline):
132 |     # Elastic username and password auth
133 |     auth = None
134 |     if (len(args['user']) > 0):
135 |         auth = HTTPBasicAuth(args['user'], args['passwd'])
136 | 
137 |     esurl = args['esurl'][:-1] if args['esurl'].endswith('/') else args['esurl']
138 | 
139 |     res = requests.put(esurl+"/_ingest/pipeline/zeekgeoip", headers={'Content-Type': 'application/json'},
140 |                         data=json.dumps(ingest_pipeline).encode('UTF-8'), auth=auth, verify=False)
141 | 
142 | # Everything important is in here.
143 | def main(**args):
144 | 
145 |     # Takes care of the fields we want to output, if not all.
146 |     outputfields = []
147 |     if (len(args['outputfields']) > 0):
148 |         outputfields = args['outputfields']
149 | 
150 |     # Takes care of logging keys to a file.
151 |     logkeyfields = []
152 |     logkeys_fds = []
153 |     if (len(args['logkey']) > 0):
154 |         for lk in args['logkey']:
155 |             thefield, thefile = lk[0], lk[1]
156 |             f = open(thefile, "a+")
157 |             logkeyfields.append(thefield)
158 |             logkeys_fds.append(f)
159 | 
160 |     # Takes care of loading keys from a file to use in a filter.
161 |     filterkeys = set()
162 |     filterkeys_field = None
163 |     if (len(args['filterkeys']) > 0):
164 |         filterkeys_field = args['filterkeys'][0]
165 |         filterkeys_file = args['filterkeys'][1]
166 |         with open(filterkeys_file, "r") as infile:
167 |             filterkeys = set(infile.read().splitlines())
168 | 
169 |     # This takes care of fields where we want to add the keyword field.
170 |     keywords = []
171 |     if (len(args['keywords']) > 0):
172 |         keywords = args['keywords']
173 | 
174 |     # Error checking
175 |     if args['esindex'] and args['stdout']:
176 |         if not args['supresswarnings']:
177 |             print("Cannot write to Elasticsearch and stdout at the same time.")
178 |         exit(-1)
179 | 
180 |     # Error checking
181 |     if args['nobulk'] and not args['stdout']:
182 |         if not args['supresswarnings']:
183 |             print("The nobulk option can only be used with the stdout option.")
184 |         exit(-2)
185 | 
186 |     # Error checking
187 |     if len(args['humio']) > 0 and (not args['stdout'] or not args['nobulk'] or args['timestamp']):
188 |         if not args['supresswarnings']:
189 |             print("The Humio option can only be used with the stdout and nobulk options, and cannot have the timestamp option.")
190 |         exit(-5)
191 | 
192 |     # Error checking
193 |     if not args['timestamp'] and args['origtime']:
194 |         if not args['supresswarnings']:
195 |             print("The origtime option can only be used with the timestamp option.")
196 |         exit(-3)
197 | 
198 |     # Error checking
199 |     if len(args['lambdafilter']) > 0 and len(args['filterfile']) > 0:
200 |         if not args['supresswarnings']:
201 |             print("The lambdafilter option cannot be used with the filterfile option.")
202 |         exit(-7)
203 | 
204 |     # This takes care of loading the Python filters.
205 |     filterfilter = None
206 |     if len(args['lambdafilter']) > 0:
207 |         filterfilter = eval(args['lambdafilter'])
208 | 
209 |     if len(args['filterfile']) > 0:
210 |         with open(args['filterfile'], "r") as ff:
211 |             filterfilter = eval(ff.read())
212 | 
213 |     # The file we are processing.
214 |     filename = args['filename']
215 |                     
216 |     # Detect if the log is compressed or not.
217 |     if filename.split(".")[-1].lower() == "gz":
218 |         # This works on Linux and MacOs
219 |         zcat_name = ["gzip", "-d", "-c"]
220 |     else:
221 |         zcat_name = ["cat"]
222 | 
223 |     # Setup the ingest pipeline
224 |     ingest_pipeline = {"description": "Zeek Log Ingestion Pipeline.", "processors": [ ]}
225 | 
226 |     if args['ingestion']:
227 |         fields_to_split = []
228 |         if len(args['splitfields']) > 0:
229 |             fields_to_split = args['splitfields']
230 |         ingest_pipeline["processors"] += [{"dot_expander": {"field": "*"}}]
231 |         ingest_pipeline["processors"] += [{"split": {"field": "service", "separator": ",", "ignore_missing": True, "ignore_failure": True}}]
232 |         for f in fields_to_split:
233 |             ingest_pipeline["processors"] += [{"split": {"field": f, "separator": ",", "ignore_missing": True, "ignore_failure": True}}]
234 |         ingest_pipeline["processors"] += [{"geoip": {"field": "id.orig_h", "target_field": "geoip_orig", "ignore_missing": True}}]
235 |         ingest_pipeline["processors"] += [{"geoip": {"field": "id.resp_h", "target_field": "geoip_resp", "ignore_missing": True}}]
236 | 
237 |     # This section takes care of TSV logs.  Skip ahead for the JSON logic.
238 |     if not args['jsonlogs']:
239 |         # Get the date
240 | 
241 |         zcat_process = subprocess.Popen(zcat_name+[filename], 
242 |                                         stdout=subprocess.PIPE)
243 | 
244 |         head_process = subprocess.Popen(['head'], 
245 |                                         stdin=zcat_process.stdout,
246 |                                         stdout=subprocess.PIPE)
247 | 
248 |         grep_process = subprocess.Popen(['grep', '#open'], 
249 |                                         stdin=head_process.stdout,
250 |                                         stdout=subprocess.PIPE)
251 | 
252 |         try:
253 |             log_date = datetime.datetime.strptime(grep_process.communicate()[0].decode('UTF-8').strip().split('\t')[1], "%Y-%m-%d-%H-%M-%S")
254 |         except:
255 |             if not args['supresswarnings']:
256 |                 print("Date not found from Zeek log! {}".format(filename))
257 |             exit(-4)
258 | 
259 |         # Get the Zeek log path
260 | 
261 |         zcat_process = subprocess.Popen(zcat_name+[filename], 
262 |                                         stdout=subprocess.PIPE)
263 | 
264 |         head_process = subprocess.Popen(['head'], 
265 |                                         stdin=zcat_process.stdout,
266 |                                         stdout=subprocess.PIPE)
267 | 
268 |         grep_process = subprocess.Popen(['grep', '#path'], 
269 |                                         stdin=head_process.stdout,
270 |                                         stdout=subprocess.PIPE)
271 | 
272 |         zeek_log_path = grep_process.communicate()[0].decode('UTF-8').strip().split('\t')[1]
273 | 
274 |         # Build the ES index.
275 |         if not args['esindex']:
276 |             if args['datastream'] > 0:
277 |                 es_index = "logs-zeek-{}".format(zeek_log_path)
278 |             else:
279 |                 sysname = ""
280 |                 if (len(args['name']) > 0):
281 |                     sysname = "{}_".format(args['name'])
282 |                 # We allow for hashes instead of dates in the index name.
283 |                 if not args['hashdates']:
284 |                     es_index = "zeek_"+sysname+"{}_{}".format(zeek_log_path, log_date.date())
285 |                 else:
286 |                     es_index = "zeek_"+sysname+"{}_{}".format(zeek_log_path, random.getrandbits(hashbits))
287 |         else:
288 |             es_index = args['esindex']
289 | 
290 |         es_index = es_index.replace(':', '_').replace("/", "_")
291 | 
292 |         # Get the Zeek fields from the log file.
293 | 
294 |         zcat_process = subprocess.Popen(zcat_name+[filename], 
295 |                                         stdout=subprocess.PIPE)
296 | 
297 |         head_process = subprocess.Popen(['head'], 
298 |                                         stdin=zcat_process.stdout,
299 |                                         stdout=subprocess.PIPE)
300 | 
301 |         grep_process = subprocess.Popen(['grep', '#fields'], 
302 |                                         stdin=head_process.stdout,
303 |                                         stdout=subprocess.PIPE)
304 | 
305 |         fields = grep_process.communicate()[0].decode('UTF-8').strip().split('\t')[1:]
306 | 
307 |         # Get the Zeek types from the log file.
308 | 
309 |         zcat_process = subprocess.Popen(zcat_name+[filename], 
310 |                                         stdout=subprocess.PIPE)
311 | 
312 |         head_process = subprocess.Popen(['head'], 
313 |                                         stdin=zcat_process.stdout,
314 |                                         stdout=subprocess.PIPE)
315 | 
316 |         grep_process = subprocess.Popen(['grep', '#types'], 
317 |                                         stdin=head_process.stdout,
318 |                                         stdout=subprocess.PIPE)
319 | 
320 |         types = grep_process.communicate()[0].decode('UTF-8').strip().split('\t')[1:]
321 | 
322 |         # Read TSV
323 | 
324 |         zcat_process = subprocess.Popen(zcat_name+[filename], 
325 |                                         stdout=subprocess.PIPE)
326 | 
327 |         grep_process = subprocess.Popen(['grep', '-E', '-v', '^#'], 
328 |                                         stdin=zcat_process.stdout,
329 |                                         stdout=subprocess.PIPE)
330 | 
331 |         # Make the max size 
332 |         csv.field_size_limit(sys.maxsize)
333 | 
334 |         # Only process if we have a valid log file.
335 |         if len(types) > 0 and len(fields) > 0:
336 |             read_tsv = csv.reader(io.TextIOWrapper(grep_process.stdout), delimiter="\t", quoting=csv.QUOTE_NONE)
337 | 
338 |             # Put mappings
339 | 
340 |             mappings = {"mappings": {"properties": dict(geoip_orig=dict(properties=dict(location=dict(type="geo_point"))), geoip_resp=dict(properties=dict(location=dict(type="geo_point"))))}}
341 | 
342 |             for i in range(len(fields)):
343 |                 if types[i] == "time":
344 |                     mappings["mappings"]["properties"][fields[i]] = {"type": "date"}
345 |                 elif types[i] == "addr":
346 |                     mappings["mappings"]["properties"][fields[i]] = {"type": "ip"}
347 |                 elif types[i] == "string":
348 |                     # Special cases
349 |                     if fields[i] in keywords:
350 |                         mappings["mappings"]["properties"][fields[i]] = {"type": "text", "fields": { "keyword": { "type": "keyword" }}}
351 |                     else:
352 |                         mappings["mappings"]["properties"][fields[i]] = {"type": "text"}
353 | 
354 |             # Put index template for data stream
355 | 
356 |             if args["datastream"] > 0:
357 |                 senddatastream(args, es_index, mappings)
358 | 
359 |             # Put data
360 | 
361 |             putmapping = False
362 |             putpipeline = False
363 |             n = 0
364 |             items = 0
365 |             outstring = ""
366 |             ofl = len(outputfields)
367 | 
368 |             # Iterate through every row in the TSV.
369 |             for row in read_tsv:
370 |                 # Build the dict and fill in any default info.
371 |                 d = dict(zeek_log_filename=filename, zeek_log_path=zeek_log_path)
372 |                 if (len(args['name']) > 0):
373 |                     d["zeek_log_system_name"] = args['name']
374 |                 i = 0
375 |                 added_val = False
376 | 
377 |                 # For each column in the row.
378 |                 for col in row:
379 |                     # Process the data using a method for each type.  We also will only output fields of a certain name,
380 |                     # if identified on the command line.
381 |                     if types[i] == "time":
382 |                         if col != '-' and col != '(empty)' and col != '' and (ofl == 0 or fields[i] in outputfields):
383 |                             gmt_mydt = datetime.datetime.utcfromtimestamp(float(col))
384 |                             if not args['timestamp']:
385 |                                 d[fields[i]] = "{}T{}".format(gmt_mydt.date(), gmt_mydt.time())
386 |                             else:
387 |                                 if args['origtime']:
388 |                                     d[fields[i]] = gmt_mydt.timestamp()
389 |                                 else:
390 |                                     d[fields[i]] = gmt_mydt.timestamp()*1000
391 |                             added_val = True
392 |                     elif types[i] == "interval" or types[i] == "double":
393 |                         if col != '-' and col != '(empty)' and col != '' and (ofl == 0 or fields[i] in outputfields):
394 |                             d[fields[i]] = float(col)
395 |                             added_val = True
396 |                     elif types[i] == "bool":
397 |                         if col != '-' and col != '(empty)' and col != '' and (ofl == 0 or fields[i] in outputfields):
398 |                             d[fields[i]] = col == "T"
399 |                             added_val = True
400 |                     elif types[i] == "port" or types[i] == "count" or types[i] == "int":
401 |                         if col != '-' and col != '(empty)' and col != '' and (ofl == 0 or fields[i] in outputfields):
402 |                             d[fields[i]] = int(col)
403 |                             added_val = True
404 |                     elif types[i].startswith("vector") or types[i].startswith("set"):
405 |                         if col != '-' and col != '(empty)' and col != '' and (ofl == 0 or fields[i] in outputfields):
406 |                             d[fields[i]] = col.split(",")
407 |                             added_val = True
408 |                     else:
409 |                         if col != '-' and col != '(empty)' and col != '' and (ofl == 0 or fields[i] in outputfields):
410 |                             d[fields[i]] = col
411 |                             added_val = True
412 |                     i += 1
413 | 
414 |                 # Here we only add data if there is a timestamp, and if the filter keys are used we make sure our key exists.
415 |                 if added_val and "ts" in d and (not filterkeys_field or (filterkeys_field and d[filterkeys_field] in filterkeys)):
416 |                     # This is the Python function filtering logic.
417 |                     filter_data = False
418 |                     if filterfilter:
419 |                         output = list(filter(filterfilter, [d]))
420 |                         if len(output) == 0:
421 |                             filter_data = True
422 | 
423 |                     # If we haven't filtered using the Python filter function...
424 |                     if not filter_data:
425 |                         # Log the keys to a file, if desired.
426 |                         i = 0
427 |                         for lkf in logkeyfields:
428 |                             lkfd = logkeys_fds[i]
429 |                             if lkf in d:
430 |                                 if isinstance(d[lkf], list):
431 |                                     for z in d[lkf]:
432 |                                         lkfd.write(z)
433 |                                         lkfd.write("\n")
434 |                                 else:
435 |                                     lkfd.write(d[lkf])
436 |                                     lkfd.write("\n")
437 |                             i += 1
438 | 
439 |                         # Create the bulk header.
440 |                         if not args['nobulk']:
441 |                             i = dict(create=dict(_index=es_index))
442 |                             if len(ingest_pipeline["processors"]) > 0:
443 |                                 i["create"]["pipeline"] = "zeekgeoip"
444 |                             outstring += json.dumps(i)+"\n"
445 |                         # Prepare the output and increment counters
446 |                         if args['humio']:
447 |                             d['ts'] = d['ts'] + "Z"
448 |                             if "_write_ts" in d:
449 |                                 d['_write_ts'] = d['_write_ts'] + "Z"
450 |                             else:
451 |                                 d["_write_ts"] = d["ts"]
452 |                             if "_path" not in d:
453 |                                 d["_path"] = zeek_log_path
454 |                             if (len(args['name'].strip()) > 0):
455 |                                 d["_system_name"] = args['name'].strip()
456 |                         d["@timestamp"] = d["ts"]
457 |                         outstring += json.dumps(d)+"\n"
458 |                         n += 1
459 |                         items += 1
460 |                         # If we aren't using stdout, prepare the ES index/datastream.
461 |                         if not args['stdout']:
462 |                             if putmapping == False:
463 |                                 sendmappings(args, es_index, mappings)
464 |                                 putmapping = True
465 |                             if putpipeline == False and len(ingest_pipeline["processors"]) > 0:
466 |                                 sendpipeline(args, ingest_pipeline)
467 |                                 putpipeline = True
468 | 
469 |                 # Once we get more than "lines", we send it to ES
470 |                 if n >= args['lines'] and len(outstring) > 0:
471 |                     sendbulk(args, outstring, es_index, filename)
472 |                     outstring = ""
473 |                     n = 0
474 | 
475 |             # We do this one last time to get rid of any remaining lines.
476 |             if n != 0 and len(outstring) > 0:
477 |                 sendbulk(args, outstring, es_index, filename)
478 |     else:
479 |         # This does everything the TSV version does, but for JSON
480 |         # Read JSON log
481 |         zcat_process = subprocess.Popen(zcat_name+[filename], 
482 |                                         stdout=subprocess.PIPE)
483 |         j_in = io.TextIOWrapper(zcat_process.stdout)
484 | 
485 |         zeek_log_path = ""
486 |         items = 0
487 |         n = 0
488 |         outstring = ""
489 |         es_index = ""
490 | 
491 |         # Put mappings
492 | 
493 |         mappings = {"mappings": {"properties": dict(ts=dict(type="date"), geoip_orig=dict(properties=dict(location=dict(type="geo_point"))), 
494 |                                                                             geoip_resp=dict(properties=dict(location=dict(type="geo_point"))))}}
495 |         mappings["mappings"]["properties"]["id.orig_h"] = {"type": "ip"}
496 |         mappings["mappings"]["properties"]["id.resp_h"] = {"type": "ip"}
497 |         putmapping = False
498 |         putpipeline = False
499 |         putdatastream = False
500 | 
501 |         # We continue until broken.
502 |         while True:
503 |             line = j_in.readline()
504 |             
505 |             # Here is where we break out of the while True loop.
506 |             if not line:
507 |                 break
508 | 
509 |             # Load our data so we can process it.
510 |             j_data = json.loads(line)
511 | 
512 |             # Only process data that has a timestamp field.
513 |             if "ts" in j_data:
514 |                 # Here we deal with the time output format.
515 |                 gmt_mydt = datetime.datetime.utcfromtimestamp(float(j_data["ts"]))
516 | 
517 |                 if not args['timestamp']:
518 |                     j_data["ts"] = "{}T{}".format(gmt_mydt.date(), gmt_mydt.time())
519 |                 else:
520 |                     if args['origtime']:
521 |                         j_data["ts"] = gmt_mydt.timestamp()
522 |                     else:
523 |                         # ES uses ms
524 |                         j_data["ts"] = gmt_mydt.timestamp()*1000
525 | 
526 |                 # This happens when we go through this loop the first time and do not have an es_index name.
527 |                 if es_index == "":
528 |                     sysname = ""
529 | 
530 |                     if (len(args['name']) > 0):
531 |                         sysname = "{}_".format(args['name'])
532 | 
533 |                     # Since the JSON logs do not include the Zeek log path, we try to guess it from the name.
534 |                     try:
535 |                         zeek_log_path = re.search(".*\/([^\._]+).*", filename).group(1).lower()
536 |                     except:
537 |                         print("Log path cannot be found from filename: {}".format(filename))
538 |                         exit(-5)
539 | 
540 |                     # We allow for hahes instead of dates in our index name.
541 |                     if not args['hashdates']:
542 |                         es_index = "zeek_{}{}_{}".format(sysname, zeek_log_path, gmt_mydt.date())
543 |                     else:
544 |                         es_index = "zeek_{}{}_{}".format(sysname, zeek_log_path, random.getrandbits(hashbits))
545 | 
546 |                     es_index = es_index.replace(':', '_').replace("/", "_")
547 | 
548 |                 # If we are not sending the data to stdout, we prepare the ES index or datastream.
549 |                 if not args['stdout']:
550 |                     if putmapping == False:
551 |                         sendmappings(args, es_index, mappings)
552 |                         putmapping = True
553 |                     if putpipeline == False and len(ingest_pipeline["processors"]) > 0:
554 |                         sendpipeline(args, ingest_pipeline)
555 |                         putpipeline = True
556 |                     if args["datastream"] > 0 and putdatastream == False:
557 |                         senddatastream(args, es_index, mappings)
558 |                         putdatastream = True
559 | 
560 |                 # We add the system name, if desired.
561 |                 if (len(args['name']) > 0):
562 |                     j_data["zeek_log_system_name"] = args['name']
563 | 
564 |                 # Here we are checking if the keys will filter the data in.
565 |                 if not filterkeys_field or (filterkeys_field and j_data[filterkeys_field] in filterkeys):
566 |                     # This check below is for the Python filters.
567 |                     filter_data = False
568 |                     if filterfilter:
569 |                         output = list(filter(filterfilter, [j_data]))
570 |                         if len(output) == 0:
571 |                             filter_data = True
572 | 
573 |                     if not filter_data:
574 |                         # We log the keys, if so desired.
575 |                         i = 0
576 |                         for lkf in logkeyfields:
577 |                             lkfd = logkeys_fds[i]
578 |                             if lkf in j_data:
579 |                                 if isinstance(j_data[lkf], list):
580 |                                     for z in j_data[lkf]:
581 |                                         lkfd.write(z)
582 |                                         lkfd.write("\n")
583 |                                 else:
584 |                                     lkfd.write(j_data[lkf])
585 |                                     lkfd.write("\n")
586 |                             i += 1
587 |                         items += 1
588 | 
589 |                         if not args['nobulk']:
590 |                             i = dict(create=dict(_index=es_index))
591 |                             if len(ingest_pipeline["processors"]) > 0:
592 |                                 i["create"]["pipeline"] = "zeekgeoip"
593 |                             outstring += json.dumps(i)+"\n"
594 |                         j_data["@timestamp"] = j_data["ts"]
595 |                         # Here we only include the output fields identified via the command line.
596 |                         if len(outputfields) > 0:
597 |                             new_j_data = {}
598 |                             for o in outputfields:
599 |                                 if o in j_data:
600 |                                     new_j_data[o] = j_data[o]
601 |                             j_data = new_j_data
602 |                         outstring += json.dumps(j_data) + "\n"
603 |                         n += 1
604 | 
605 |                 # Here we output a set of lines to the ES server.
606 |                 if n >= args['lines'] and len(outstring) > 0:
607 |                     sendbulk(args, outstring, es_index, filename)
608 |                     outstring = ""
609 |                     n = 0
610 | 
611 |         # We send the last of the data to the ES server, if there is any left.
612 |         if n != 0 and len(outstring) > 0:
613 |             sendbulk(args, outstring, es_index, filename)
614 | 
615 | # This deals with running as a script vs. cython.
616 | if __name__ == "__main__":
617 |     args = parseargs()
618 |     if args.cython:
619 |         import zeek2es
620 |         zeek2es.main(**vars(args))
621 |     else:
622 |         main(**vars(args))


--------------------------------------------------------------------------------