├── .gitignore ├── README.md ├── translations └── ja-jp │ ├── README_JP.md │ ├── openstack │ ├── devstack │ │ ├── stack_setup.sh │ │ └── deploy_droplet.py │ └── host_aggregates.md │ ├── haproxy │ └── monitor-haproxy-with-datadog.md │ ├── cassandra │ └── monitoring_cassandra_with_datadog.md │ ├── mysql │ └── mysql_monitoring_with_datadog.md │ └── varnish │ └── how_to_collect_varnish_metrics.md ├── CONTRIBUTING.md ├── openstack ├── devstack │ ├── stack_setup.sh │ └── deploy_droplet.py ├── host_aggregates.md └── how_lithium_monitors_openstack.md ├── varnish └── how_to_collect_varnish_metrics.md ├── haproxy └── monitor_haproxy_with_datadog.md ├── azure ├── how_to_collect_azure_metrics.md └── monitor_azure_vms_using_datadog.md ├── mongodb ├── monitor-mongodb-performance-with-datadog.md └── collecting-mongodb-metrics-and-statistics.md ├── monitoring-101 └── monitoring_101_investigating_performance_issues.md ├── cassandra └── monitoring_cassandra_with_datadog.md ├── mysql └── mysql_monitoring_with_datadog.md ├── google-compute-engine └── monitor-google-compute-engine-with-datadog.md ├── elasticache ├── how-coursera-monitors-elasticache-and-memcached-performance.md └── collecting-elasticache-metrics-its-redis-memcached-metrics.md ├── elb ├── monitor_elb_performance_with_datadog.md └── how_to_collect_aws_elb_metrics.md ├── elasticsearch └── how_to_monitor_elasticsearch_with_datadog.md ├── cilium └── monitor_cilium_and_kubernetes_performance_with_hubble.md ├── dynamodb ├── how_to_collect_dynamodb_metrics.md └── how_medium_monitors_dynamodb_performance.md ├── hadoop └── monitoring_hadoop_with_datadog.md └── azure-sql-database └── azure-sql-database-monitoring-datadog.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | *~ 3 | *.swp 4 | PDF 5 | .python-version 6 | build/html 7 | build/doctrees 8 | *.mo 9 | .idea 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # the-monitor 2 | Markdown files for Datadog's longform blog posts: https://www.datadoghq.com/blog/ 3 | 4 | Please read our [contribution guidelines](https://github.com/DataDog/the-monitor/blob/master/CONTRIBUTING.md) before opening a new issue or pull request. 5 | 6 | -------------------------------------------------------------------------------- /translations/ja-jp/README_JP.md: -------------------------------------------------------------------------------- 1 | # The monitor 2 | Markdown files for Datadog's long-form blog posts: https://www.datadoghq.com/blog/ 3 | 4 | ## ザ・ITインフラ監視 5 | 6 | このディレクトリーには、Datadog社のblogに掲載されている記事用HTMLを生成する際に使っているオリジナルソースをコピーし、その行間に日本語訳を追記し保存しています。 7 | 8 | これらのコンテンツは、Datadog社が蓄積ている監視に関する重要なノウハウの一部をブログとして公開し、インフラ監視を最短でスタートする際のベースラインを提供するものです。又、これから監視を始めようと考えている人に取っては、無料で基本を学ぶ第一歩になることは間違いありません。 9 | 10 | オリジナルのコンテンツは英語で公開されていますが、一人でも多くの人に読んでもらうためには、日本語でのコンテンツ提供が必要だと思い、日常の業務の合間を縫って翻訳作業を進めています。しかしながら、翻訳工数が絶対的に足りていません。もしも、このREADMEを読んでいる人で、翻訳の趣旨に賛同してくれる人がいるなら、このja-jp内のどの項目のディレクトリでもよいので翻訳をしPRを送ってくれると本当に助かります。 11 | 12 | **翻訳協力者、絶賛募集中!** 13 | 14 | 翻訳者: 15 | 1. 堀田直孝 : Naotaka Hotta (jay@datadoghq.com) 16 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | 2 | # Contribution Guidelines 3 | 4 | Thanks for taking the time to contribute to Datadog's blog, [The Monitor](https://www.datadoghq.com/blog)! We use this repo to review content updates to our longform guides. 5 | 6 | There are a few guidelines we'd like contributors to follow when they submit a new issue or pull request (PR). 7 | 8 | ### Issues 9 | 10 | Issues can be used to report broken links, outdated content, or other content issues. As a general rule, issues should include: 11 | 12 | - a brief description of the issue 13 | - a proposed fix with supporting documentation 14 | 15 | **Questions or ideas for a new blog post?** Please do not use issues for pitches or general questions about Datadog. You can [contact support](https://www.datadoghq.com/support/) for any questions about using a particular Datadog feature. 16 | 17 | To help us focus on our guides, we will close all issues that are content pitches or requests for general support and redirect users to our support page where applicable. 18 | 19 | ### PRs 20 | 21 | Submitting a PR is the best way to fast-track an update to one of our guides. In general, PRs should: 22 | 23 | - address a single concern 24 | - include a summary of the issue and your motivation for submitting the PR 25 | - your proposed changes with supporting documentation -------------------------------------------------------------------------------- /openstack/devstack/stack_setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | if [[ $(whoami) != 'root' ]]; then echo "Run as root"; exit; fi 3 | apt-get update && apt-get install git -y 4 | cd /usr/local/src || echo "/usr/local/src does not exist" 5 | git clone -b stable/newton https://github.com/openstack-dev/devstack.git 6 | cd devstack || exit 7 | sed -i 's/HOST_IP=${HOST_IP:-}/HOST_IP=`dig +short myip.opendns.com @resolver1.opendns.com`/g' stackrc 8 | ./tools/create-stack-user.sh 9 | echo "[[local|localrc]] 10 | disable_service n-net 11 | enable_service q-svc 12 | enable_service q-agt 13 | enable_service q-dhcp 14 | enable_service q-l3 15 | enable_service q-meta 16 | enable_service n-cauth 17 | 18 | # We don't need no stinkin' Tempest 19 | disable_service tempest 20 | 21 | # Enable the ceilometer services 22 | enable_service ceilometer-acompute,ceilometer-acentral,ceilometer-collector,ceilometer-api 23 | 24 | #Configure events for both Stacktach and custom listener 25 | notification_driver=nova.openstack.common.notifier.rpc_notifier 26 | notification_topics=notifications,monitor 27 | notify_on_state_change=vm_and_task_state 28 | notify_on_any_change=True 29 | instance_usage_audit=True 30 | instance_usage_audit_period=hour 31 | 32 | # Password configuration below 33 | ADMIN_PASSWORD=devstack 34 | DATABASE_PASSWORD=devstack 35 | RABBIT_PASSWORD=devstack 36 | SERVICE_PASSWORD=devstack 37 | SERVICE_TOKEN=a682f596-76f3-11e3-b3b2-e716f9080d50 38 | 39 | HOST_IP=`dig +short myip.opendns.com @resolver1.opendns.com` 40 | " >> local.conf 41 | 42 | chown -R stack:stack . 43 | -------------------------------------------------------------------------------- /translations/ja-jp/openstack/devstack/stack_setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | if [[ $(whoami) != 'root' ]]; then echo "Run as root"; exit; fi 3 | apt-get update && apt-get install git -y 4 | cd /usr/local/src || echo "/usr/local/src does not exist" 5 | git clone -b stable/kilo https://github.com/openstack-dev/devstack.git 6 | cd devstack || exit 7 | sed -i 's/HOST_IP=${HOST_IP:-}/HOST_IP=`dig +short myip.opendns.com @resolver1.opendns.com`/g' stackrc 8 | ./tools/create-stack-user.sh 9 | echo "[[local|localrc]] 10 | disable_service n-net 11 | enable_service q-svc 12 | enable_service q-agt 13 | enable_service q-dhcp 14 | enable_service q-l3 15 | enable_service q-meta 16 | enable_service n-cauth 17 | 18 | # We don't need no stinkin' Tempest 19 | disable_service tempest 20 | 21 | # Enable the ceilometer services 22 | enable_service ceilometer-acompute,ceilometer-acentral,ceilometer-collector,ceilometer-api 23 | 24 | #Configure events for both Stacktach and custom listener 25 | notification_driver=nova.openstack.common.notifier.rpc_notifier 26 | notification_topics=notifications,monitor 27 | notify_on_state_change=vm_and_task_state 28 | notify_on_any_change=True 29 | instance_usage_audit=True 30 | instance_usage_audit_period=hour 31 | 32 | # Password configuration below 33 | ADMIN_PASSWORD=devstack 34 | DATABASE_PASSWORD=devstack 35 | RABBIT_PASSWORD=devstack 36 | SERVICE_PASSWORD=devstack 37 | SERVICE_TOKEN=a682f596-76f3-11e3-b3b2-e716f9080d50 38 | 39 | HOST_IP=`dig +short myip.opendns.com @resolver1.opendns.com` 40 | " >> local.conf 41 | 42 | chown -R stack:stack . 43 | -------------------------------------------------------------------------------- /openstack/devstack/deploy_droplet.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import os # For running system commands & env vars 3 | import sys # For command line arguments 4 | 5 | # Find your SSH key ID using 'tugboat keys' 6 | 7 | # Try to get values from environment variables 8 | try: 9 | RAM = os.environ['DO_RAM_SIZE'] 10 | IMAGE = os.environ['DO_IMAGE'] 11 | 12 | except KeyError: 13 | RAM = "4gb" # Minimum required RAM to host ~2 instances 14 | # You can see a list of sizes with "tugboat sizes" 15 | IMAGE = "14782728" # Ubuntu 14.04 x64 image 16 | # You can see a list of images with "tugboat images" 17 | 18 | 19 | def main(args): 20 | if args == '': 21 | args = raw_input("You must name your droplet: ") 22 | main(args) 23 | tug_com = "tugboat create " + args + " -s " + RAM + " -i " + IMAGE 24 | 25 | os.system(tug_com) 26 | # Wait for droplet to become active 27 | os.system("tugboat wait " + args + " -s") 28 | # Give it a couple of seconds after it reports activity 29 | os.system("sleep 8") 30 | t = os.popen("tugboat droplets | grep " + args) 31 | t = t.read() 32 | IP = t[t.find('ip') + 4 : t.find(',')] 33 | print "IP: " + IP 34 | 35 | command_to_run = "ssh root@" + IP + " 'bash -s' < stack_setup.sh" 36 | print "Run the following (in order): \n" + command_to_run 37 | print "ssh root@" + IP 38 | print "sudo -iu stack /usr/local/src/devstack/stack.sh" 39 | 40 | if __name__ == "__main__": 41 | try: 42 | main(sys.argv[1]) 43 | except IndexError: 44 | name = raw_input("Name your droplet: ") 45 | main(name) 46 | -------------------------------------------------------------------------------- /translations/ja-jp/openstack/devstack/deploy_droplet.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import os # For running system commands & env vars 3 | import sys # For command line arguments 4 | 5 | # Find your SSH key ID using 'tugboat keys' 6 | 7 | # Try to get values from environment variables 8 | try: 9 | RAM = os.environ['DO_RAM_SIZE'] 10 | IMAGE = os.environ['DO_IMAGE'] 11 | 12 | except KeyError: 13 | RAM = "4gb" # Minimum required RAM to host ~2 instances 14 | # You can see a list of sizes with "tugboat sizes" 15 | IMAGE = "14782728" # Ubuntu 14.04 x64 image 16 | # You can see a list of images with "tugboat images" 17 | 18 | 19 | def main(args): 20 | if args == '': 21 | args = raw_input("You must name your droplet: ") 22 | main(args) 23 | tug_com = "tugboat create " + args + " -s " + RAM + " -i " + IMAGE 24 | 25 | os.system(tug_com) 26 | # Wait for droplet to become active 27 | os.system("tugboat wait " + args + " -s") 28 | # Give it a couple of seconds after it reports activity 29 | os.system("sleep 8") 30 | t = os.popen("tugboat droplets | grep " + args) 31 | t = t.read() 32 | IP = t[t.find('ip') + 4 : t.find(',')] 33 | print "IP: " + IP 34 | 35 | command_to_run = "ssh root@" + IP + " 'bash -s' < stack_setup.sh" 36 | print "Run the following (in order): \n" + command_to_run 37 | print "ssh root@" + IP 38 | print "sudo -iu stack /usr/local/src/devstack/stack.sh" 39 | 40 | if __name__ == "__main__": 41 | try: 42 | main(sys.argv[1]) 43 | except IndexError: 44 | name = raw_input("Name your droplet: ") 45 | main(name) 46 | -------------------------------------------------------------------------------- /openstack/host_aggregates.md: -------------------------------------------------------------------------------- 1 | # OpenStack: host aggregates, flavors, and availability zones 2 | 3 | When discussing [OpenStack], correct word choice is essential. OpenStack uses many familiar terms in [unfamiliar ways][semantic-overloading], which can lead to confusing conversations. 4 | 5 | **Host aggregates** (or simply **aggregates**), are commonly confused with the more-familiar term **availability zones**—however the two are not identical. Customers using OpenStack as a service never see host **aggregates**; administrators use them to group hardware according to various properties. Most commonly, host aggregates are used to differentiate between physical host configurations. For example, you can have an aggregate composed of machines with 2GB of RAM and another aggregate composed of machines with 64GB of RAM. This highlights the typical use case of aggregates: defining static hardware profiles. 6 | 7 | Once an aggregate is created, administrators can then define specific public **flavors** from which clients can choose to run their virtual machines (the same concept as EC2 [instance types] on AWS). Flavors are used by customers and clients to choose the type of hardware that will host their instance. 8 | 9 | Contrast aggregates with **availability zones** (AZ) in OpenStack, which are customer-facing and usually partitioned geographically. To cement the concept, think of availability zones and flavors as customer-accessible subsets of host aggregates. 10 | [![Host aggregates and availability zones in OpenStack][agg-and-avail]][agg-and-avail] 11 | _As you can see, host aggregates can span across availability zones._ 12 | 13 | ## Host aggregate or availablity zone? 14 | As an OpenStack end user, you don't really have a choice. Only administrators can create host aggregates, so you will be using availability zones and flavors defined by your cloud administrator. 15 | 16 | OpenStack admins, on the other hand, should carefully consider the subtle distinction between the two when planning their deployments. Hosts separated geographically should be segregated with availability zones, while hosts sharing the same specs should be grouped with host aggregates. 17 | 18 | ## Conclusion 19 | You should now have a better sense of the differences between host aggregates, flavors, and availability zones. More information on host aggregates and availability zones is available in the [OpenStack documentation]. Additional terms and definitions can be found in the OpenStack [glossary]. 20 | 21 | Check out our 3-part series about [how to monitor][part 1] and [collect][part 2] OpenStack Nova performance [metrics][part 3]. Also, be sure to take a look at our piece on [How Lithium monitors OpenStack][part 4]. 22 | 23 | 24 | [agg-and-avail]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-OpenStack/host-aggregates/aggregates1.png 25 | [glossary]: http://docs.openstack.org/glossary/content/glossary.html 26 | [instance types]: https://aws.amazon.com/ec2/instance-types/ 27 | [OpenStack]: https://openstack.org 28 | [OpenStack documentation]: http://docs.openstack.org/developer/nova/aggregates.html 29 | [semantic-overloading]: https://en.wikipedia.org/wiki/Semantic_overload 30 | 31 | [part 1]: https://www.datadoghq.com/blog/openstack-monitoring-nova 32 | [part 2]: https://www.datadoghq.com/blog/collecting-metrics-notifications-openstack-nova 33 | [part 3]: https://www.datadoghq.com/blog/openstack-monitoring-datadog 34 | [part 4]: https://www.datadoghq.com/blog/how-lithium-monitors-openstack/ -------------------------------------------------------------------------------- /translations/ja-jp/openstack/host_aggregates.md: -------------------------------------------------------------------------------- 1 | # OpenStack: host aggregates, flavors, and availability zones 2 | 3 | When discussing [OpenStack], correct word choice is essential. OpenStack uses many familiar terms in [unfamiliar ways][semantic-overloading], which can lead to confusing conversations. 4 | 5 | **Host aggregates** (or simply **aggregates**), are commonly confused with the more-familiar term **availability zones**—however the two are not identical. Customers using OpenStack as a service never see host **aggregates**; administrators use them to group hardware according to various properties. Most commonly, host aggregates are used to differentiate between physical host configurations. For example, you can have an aggregate composed of machines with 2GB of RAM and another aggregate composed of machines with 64GB of RAM. This highlights the typical use case of aggregates: defining static hardware profiles. 6 | 7 | Once an aggregate is created, administrators can then define specific public **flavors** from which clients can choose to run their virtual machines (the same concept as EC2 [instance types] on AWS). Flavors are used by customers and clients to choose the type of hardware that will host their instance. 8 | 9 | Contrast aggregates with **availability zones** (AZ) in OpenStack, which are customer-facing and usually partitioned geographically. To cement the concept, think of availability zones and flavors as customer-accessible subsets of host aggregates. 10 | [![Host aggregates and availability zones in OpenStack][agg-and-avail]][agg-and-avail] 11 | _As you can see, host aggregates can span across availability zones._ 12 | 13 | ## Host aggregate or availablity zone? 14 | As an OpenStack end user, you don't really have a choice. Only administrators can create host aggregates, so you will be using availability zones and flavors defined by your cloud administrator. 15 | 16 | OpenStack admins, on the other hand, should carefully consider the subtle distinction between the two when planning their deployments. Hosts separated geographically should be segregated with availability zones, while hosts sharing the same specs should be grouped with host aggregates. 17 | 18 | ## Conclusion 19 | You should now have a better sense of the differences between host aggregates, flavors, and availability zones. More information on host aggregates and availability zones is available in the [OpenStack documentation]. Additional terms and definitions can be found in the OpenStack [glossary]. 20 | 21 | Check out our 3-part series about [how to monitor][part 1] and [collect][part 2] OpenStack Nova performance [metrics][part 3]. Also, be sure to take a look at our piece on [How Lithium monitors OpenStack][part 4]. 22 | 23 | 24 | [agg-and-avail]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-OpenStack/host-aggregates/aggregates1.png 25 | [glossary]: http://docs.openstack.org/glossary/content/glossary.html 26 | [instance types]: https://aws.amazon.com/ec2/instance-types/ 27 | [OpenStack]: https://openstack.org 28 | [OpenStack documentation]: http://docs.openstack.org/developer/nova/aggregates.html 29 | [semantic-overloading]: https://en.wikipedia.org/wiki/Semantic_overload 30 | 31 | [part 1]: https://www.datadoghq.com/blog/openstack-monitoring-nova 32 | [part 2]: https://www.datadoghq.com/blog/collecting-metrics-notifications-openstack-nova 33 | [part 3]: https://www.datadoghq.com/blog/openstack-monitoring-datadog 34 | [part 4]: https://www.datadoghq.com/blog/how-lithium-monitors-openstack/ -------------------------------------------------------------------------------- /varnish/how_to_collect_varnish_metrics.md: -------------------------------------------------------------------------------- 1 | # How to collect Varnish metrics 2 | 3 | *This post is part 2 of a 3-part series on Varnish monitoring. [Part 1](https://www.datadoghq.com/blog/top-varnish-performance-metrics/) explores the key metrics available in Varnish, and [Part 3](https://www.datadoghq.com/blog/monitor-varnish-using-datadog/) details how Datadog can help you to monitor Varnish.* 4 | 5 | ## How to get the Varnish metrics you need 6 | 7 | Varnish Cache ships with very useful and precise monitoring and logging tools. As explained in [the first post of this series](https://www.datadoghq.com/blog/top-varnish-performance-metrics/), for monitoring purposes, the most useful of the available tools is `varnishstat` which gives you a detailed snapshot of Varnish’s current performance. It provides access to in-memory statistics such as cache hits and misses, resource consumption, threads created, and more. 8 | 9 | ### varnishstat 10 | 11 | If you run `varnishstat` from the command line you will see a continuously updating list of all available Varnish metrics. If you add the `-1` flag, varnishstat will exit after printing the list one time. Example output below: 12 | 13 | ``` 14 | $ varnishstat 15 | 16 | MAIN.uptime Child process uptime 17 | MAIN.sess_conn Sessions accepted 18 | MAIN.sess_drop Sessions dropped 19 | MAIN.sess_fail Session accept failures 20 | MAIN.sess_pipe_overflow Session pipe overflow 21 | MAIN.client_req Good client requests received 22 | MAIN.cache_hit Cache hits 23 | MAIN.cache_hitpass Cache hits for pass 24 | MAIN.cache_miss Cache misses 25 | MAIN.backend_conn Backend conn. success 26 | MAIN.backend_unhealthy Backend conn. not attempted 27 | MAIN.backend_busy Backend conn. too many 28 | MAIN.backend_fail Backend conn. failures 29 | MAIN.backend_reuse Backend conn. reuses 30 | MAIN.backend_toolate Backend conn. was closed 31 | MAIN.backend_recycle Backend conn. recycles 32 | MAIN.backend_retry Backend conn. retry 33 | MAIN.pools Number of thread pools 34 | MAIN.threads Total number of threads 35 | MAIN.threads_limited Threads hit max 36 | MAIN.threads_created Threads created 37 | MAIN.threads_destroyed Threads destroyed 38 | MAIN.threads_failed Thread creation failed 39 | MAIN.thread_queue_len Length of session queue 40 | ``` 41 | 42 | To list specific values, pass them with the `-f` flag: e.g. `varnishstat -f field1,field2,field3`, followed by `-1` if needed. 43 | 44 | For example: `varnishstat -f MAIN.threads` will display a continuously updating count of threads currently being used: 45 | [![varnishstat output](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-07-varnish/2-01.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-07-varnish/2-01.png) 46 | 47 | Varnishstat is useful as a standalone tool if you need to spot-check the health of your cache. However, if Varnish is an important part of your software service, you will almost certainly want to graph its performance over time, correlate it with other metrics from across your infrastructure, and be alerted about any problems that may arise. To do this you will probably want to integrate the metrics that Varnishstat is reporting with a dedicated monitoring service. 48 | 49 | ### varnishlog 50 | 51 | If you need to debug your system or tune configuration, `varnishlog` can be a useful tool, as it provides detailed information about each individual request. 52 | 53 | Here is an edited example of `varnishlog` output generated by a single request—a full example would be several times longer: 54 | 55 | ``` 56 | $ varnishlog 57 | 58 | 3727 RxRequest c GET 59 | 3727 RxProtocol c HTTP/1.1 60 | 3727 RxHeader c Content-Type: application/x-www-form-urlencoded; 61 | 3727 RxHeader c Accept-Encoding: gzip,deflate,sdch 62 | 3727 RxHeader c Accept-Language: en-US,en;q=0.8 63 | 3727 VCL_return c hit 64 | 3727 ObjProtocol c HTTP/1.1 65 | 3727 TxProtocol c HTTP/1.1 66 | 3727 TxStatus c 200 67 | 3727 Length c 316 68 | … 69 | ``` 70 | 71 | The 4 columns represent: 72 | 73 | 1. Request ID 74 | 2. Data type (if the type starts with “Rx” that means Varnish is receiving data, and “Tx” means Varnish is sending data) 75 | 3. Whether this entry records communication between Varnish and the client: “c”, or backend: “b” (see [Part 1](https://www.datadoghq.com/blog/top-varnish-performance-metrics/)) 76 | 4. The data, or details about this entry 77 | 78 | ### varnishlog’s children 79 | 80 | You can display a subset of `varnishlog`’s information via three specialized tools built on top of varnishlog: 81 | 82 | - `varnishtop` presents a continuously updating list of the most commonly occurring log entries. Using filters, you can display the topmost requested documents, most common clients, user agents, or any other information which is recorded in the log. 83 | - `varnishhist` presents a continuously updating histogram showing the distribution of the last N requests bucketed by total time between request and response. The value of N and the vertical scale are displayed in the top left corner. 84 | - `varnishsizes` is very similar to varnishhist, except it shows the size of the objects requested rather than the processing time. 85 | 86 | ## Conclusion 87 | 88 | Which metrics you monitor will depend on your use case, the tools available to you, and whether the insight provided by a given metric justifies the overhead of monitoring it. 89 | 90 | At Datadog, we have built an integration with Varnish so that you can begin collecting and monitoring its metrics with a minimum of setup. Learn how Datadog can help you to monitor Varnish in the [next and final part of this series of articles](https://www.datadoghq.com/blog/monitor-varnish-using-datadog/). 91 | 92 | ------------------------------------------------------------------------ 93 | 94 | *Source Markdown for this post is available [on GitHub](https://github.com/DataDog/the-monitor/blob/master/varnish/how_to_collect_varnish_metrics.md). Questions, corrections, additions, etc.? Please [let us know](https://github.com/DataDog/the-monitor/issues).* -------------------------------------------------------------------------------- /haproxy/monitor_haproxy_with_datadog.md: -------------------------------------------------------------------------------- 1 | # Monitor HAProxy with Datadog 2 | _This post is part 3 of a 3-part series on HAProxy monitoring. [Part 1](http://www.datadoghq.com/blog/monitoring-haproxy-performance-metrics) evaluates the key metrics emitted by HAProxy, and [Part 2](http://www.datadoghq.com/blog/how-to-collect-haproxy-metrics) details how to collect metrics from HAProxy._ If you’ve already read [our post](http://www.datadoghq.com/blog/how-to-collect-haproxy-metrics) on accessing HAProxy metrics, you’ve seen that it’s relatively simple to run occasional spot checks using HAProxy’s built-in tools. 3 | 4 | To implement ongoing, [meaningful monitoring](https://www.datadoghq.com/blog/haproxy-monitoring/), however, you will need a dedicated system that allows you to store, visualize, and correlate your HAProxy metrics with the rest of your infrastructure. You also need to be alerted when any system starts to misbehave. 5 | 6 | In this post, we will show you how to use Datadog to capture and monitor all the key metrics identified in [Part 1](http://www.datadoghq.com/blog/monitoring-haproxy-performance-metrics) of this series, and more. 7 | 8 | [![Default HAProxy dashboard in Datadog](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/default-screen2.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/default-screen2.png) 9 | 10 |
_Built-in HAProxy dashboard in Datadog_
11 | 12 | ## Integrating Datadog and HAProxy 13 | 14 | ### Verify HAProxy’s status 15 | 16 | Before you begin, you must verify that HAProxy is set to output metrics over HTTP. To read more about enabling the HAProxy status page, refer to [Part 2](http://www.datadoghq.com/blog/how-to-collect-haproxy-metrics#Stats) of this series. Simply open a browser to the stats URL listed in `haproxy.cfg`. 17 | You should see something like this: 18 | 19 | [![HAProxy Stats Page](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/haproxy-stats-page.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/haproxy-stats-page.png) 20 | 21 | ### Install the Datadog Agent 22 | 23 | The [Datadog Agent](https://github.com/DataDog/dd-agent) is open-source software that collects and reports metrics from all of your hosts so you can view, monitor, and correlate them on the Datadog platform. Installing the Agent usually requires just a single command. Installation instructions are platform-dependent and can be found [here](https://app.datadoghq.com/account/settings#agent). As soon as the Datadog Agent is up and running, you should see your host reporting basic system metrics [in your Datadog account](https://app.datadoghq.com/infrastructure). [![Reporting host in Datadog](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/default-host.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/default-host.png) 24 | 25 | ### Configure the Agent 26 | 27 | Next you will need to create an HAProxy configuration file for the Agent. You can find the location of the Agent configuration directory for your OS [here](http://docs.datadoghq.com/guides/basic_agent_usage/). In that directory you will find a [sample HAProxy config file](https://github.com/DataDog/dd-agent/blob/master/conf.d/haproxy.yaml.example) named **haproxy.yaml.example**. Copy this file to **haproxy.yaml**. You must edit the file to match the _username_, _password_, and _URL_ specified in your `haproxy.cfg`. 28 | 29 | init_config: 30 | 31 | instances: 32 | - url: http://localhost/admin?stats 33 | # username: username 34 | # password: password 35 | 36 | Save and close the file. 37 | 38 | ### Restart the Agent 39 | 40 | Restart the Agent to load your new configuration. The restart command varies somewhat by platform; see the specific commands for your platform [here](http://docs.datadoghq.com/guides/basic_agent_usage/). 41 | 42 | ### Verify the configuration settings 43 | 44 | To check that Datadog and HAProxy are properly integrated, execute the Datadog `info` command. The command for each platform is available [here](http://docs.datadoghq.com/guides/basic_agent_usage/). If the configuration is correct, you will see a section resembling the one below in the `info` output: 45 | 46 | Checks 47 | ====== 48 | [...] 49 | 50 | haproxy 51 | ------- 52 | - instance #0 [OK] Last run duration: 0.00831699371338 53 | - Collected 26 metrics & 0 events 54 | 55 | ### Turn on the integration 56 | 57 | Finally, click the HAProxy **Install Integration** button inside your Datadog account. The button is located under the _Configuration_ tab in the [HAProxy integration settings](https://app.datadoghq.com/account/settings#integrations/haproxy). [![Install the integration](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/install-integration.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/install-integration.png) 58 | 59 | ### Show me the metrics! 60 | 61 | Once the Agent begins reporting metrics, you will see a comprehensive HAProxy dashboard among [your list of available dashboards](https://app.datadoghq.com/dash/list) in Datadog. The default HAProxy dashboard, as seen at the top of this article, displays the key metrics highlighted in our [introduction to HAProxy monitoring](http://www.datadoghq.com/blog/monitoring-haproxy-performance-metrics). 62 | 63 | You can easily create a more comprehensive dashboard to monitor HAProxy as well as your entire web stack by adding additional graphs and metrics from your other systems. For example, you might want to graph HAProxy metrics alongside metrics from your [NGINX web servers](https://www.datadoghq.com/blog/how-to-monitor-nginx-with-datadog/), or alongside host-level metrics such as memory usage on application servers. 64 | 65 | To start building a custom dashboard, clone the default HAProxy dashboard by clicking on the gear on the upper right of the dashboard and selecting **Clone Dash**. 66 | 67 | ![Clone dash](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/clone-dash.png) 68 | 69 | ### Alerting on HAProxy metrics 70 | 71 | Once Datadog is capturing and visualizing your metrics, you will likely want to [set up some alerts](http://docs.datadoghq.com/guides/monitoring/) to be automatically notified of potential issues. With our recently released [outlier detection](https://www.datadoghq.com/blog/introducing-outlier-detection-in-datadog/) feature, you can get alerted on the things that matter. For example. you can set an alert if a particular backend is experiencing an increase in latency while the others are operating normally. Datadog can monitor individual hosts, containers, services, processes—or virtually any combination thereof. For instance, you can monitor HAProxy frontends, backends, or all hosts in a certain availability zone, or even a single metric being reported by all hosts with a specific tag. 72 | 73 | ### Conclusion 74 | 75 | In this post we’ve walked you through integrating HAProxy with Datadog to visualize your key metrics and notify the right team whenever your infrastructure shows signs of trouble. If you’ve followed along using your own Datadog account, you should now have improved visibility into what’s happening in your environment, as well as the ability to create automated alerts tailored to your infrastructure, your usage patterns, and the metrics that are most valuable to your organization. If you don’t yet have a Datadog account, you can sign up for a [free trial](https://app.datadoghq.com/signup) and learn to monitor HAProxy right away. -------------------------------------------------------------------------------- /translations/ja-jp/haproxy/monitor-haproxy-with-datadog.md: -------------------------------------------------------------------------------- 1 | # Monitor HAProxy with Datadog 2 | _This post is part 3 of a 3-part series on HAProxy monitoring. [Part 1](http://www.datadoghq.com/blog/monitoring-haproxy-performance-metrics) evaluates the key metrics emitted by HAProxy, and [Part 2](http://www.datadoghq.com/blog/how-to-collect-haproxy-metrics) details how to collect metrics from HAProxy._ If you’ve already read [our post](http://www.datadoghq.com/blog/how-to-collect-haproxy-metrics) on accessing HAProxy metrics, you’ve seen that it’s relatively simple to run occasional spot checks using HAProxy’s built-in tools. 3 | 4 | To implement ongoing, [meaningful monitoring](https://www.datadoghq.com/blog/haproxy-monitoring/), however, you will need a dedicated system that allows you to store, visualize, and correlate your HAProxy metrics with the rest of your infrastructure. You also need to be alerted when any system starts to misbehave. 5 | 6 | In this post, we will show you how to use Datadog to capture and monitor all the key metrics identified in [Part 1](http://www.datadoghq.com/blog/monitoring-haproxy-performance-metrics) of this series, and more. 7 | 8 | [![Default HAProxy dashboard in Datadog](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/default-screen2.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/default-screen2.png) 9 | 10 |
_Built-in HAProxy dashboard in Datadog_
11 | 12 | ## Integrating Datadog and HAProxy 13 | 14 | ### Verify HAProxy’s status 15 | 16 | Before you begin, you must verify that HAProxy is set to output metrics over HTTP. To read more about enabling the HAProxy status page, refer to [Part 2](http://www.datadoghq.com/blog/how-to-collect-haproxy-metrics#Stats) of this series. Simply open a browser to the stats URL listed in `haproxy.cfg`. 17 | You should see something like this: 18 | 19 | [![HAProxy Stats Page](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/haproxy-stats-page.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/haproxy-stats-page.png) 20 | 21 | ### Install the Datadog Agent 22 | 23 | The [Datadog Agent](https://github.com/DataDog/dd-agent) is open-source software that collects and reports metrics from all of your hosts so you can view, monitor, and correlate them on the Datadog platform. Installing the Agent usually requires just a single command. Installation instructions are platform-dependent and can be found [here](https://app.datadoghq.com/account/settings#agent). As soon as the Datadog Agent is up and running, you should see your host reporting basic system metrics [in your Datadog account](https://app.datadoghq.com/infrastructure). [![Reporting host in Datadog](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/default-host.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/default-host.png) 24 | 25 | ### Configure the Agent 26 | 27 | Next you will need to create an HAProxy configuration file for the Agent. You can find the location of the Agent configuration directory for your OS [here](http://docs.datadoghq.com/guides/basic_agent_usage/). In that directory you will find a [sample HAProxy config file](https://github.com/DataDog/dd-agent/blob/master/conf.d/haproxy.yaml.example) named **haproxy.yaml.example**. Copy this file to **haproxy.yaml**. You must edit the file to match the _username_, _password_, and _URL_ specified in your `haproxy.cfg`. 28 | 29 | init_config: 30 | 31 | instances: 32 | - url: http://localhost/admin?stats 33 | # username: username 34 | # password: password 35 | 36 | Save and close the file. 37 | 38 | ### Restart the Agent 39 | 40 | Restart the Agent to load your new configuration. The restart command varies somewhat by platform; see the specific commands for your platform [here](http://docs.datadoghq.com/guides/basic_agent_usage/). 41 | 42 | ### Verify the configuration settings 43 | 44 | To check that Datadog and HAProxy are properly integrated, execute the Datadog `info` command. The command for each platform is available [here](http://docs.datadoghq.com/guides/basic_agent_usage/). If the configuration is correct, you will see a section resembling the one below in the `info` output: 45 | 46 | Checks 47 | ====== 48 | [...] 49 | 50 | haproxy 51 | ------- 52 | - instance #0 [OK] Last run duration: 0.00831699371338 53 | - Collected 26 metrics & 0 events 54 | 55 | ### Turn on the integration 56 | 57 | Finally, click the HAProxy **Install Integration** button inside your Datadog account. The button is located under the _Configuration_ tab in the [HAProxy integration settings](https://app.datadoghq.com/account/settings#integrations/haproxy). [![Install the integration](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/install-integration.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/install-integration.png) 58 | 59 | ### Show me the metrics! 60 | 61 | Once the Agent begins reporting metrics, you will see a comprehensive HAProxy dashboard among [your list of available dashboards](https://app.datadoghq.com/dash/list) in Datadog. The default HAProxy dashboard, as seen at the top of this article, displays the key metrics highlighted in our [introduction to HAProxy monitoring](http://www.datadoghq.com/blog/monitoring-haproxy-performance-metrics). 62 | 63 | You can easily create a more comprehensive dashboard to monitor your entire web stack by adding additional graphs and metrics from your other systems. For example, you might want to graph HAProxy metrics alongside metrics from your [NGINX web servers](https://www.datadoghq.com/blog/how-to-monitor-nginx-with-datadog/), or alongside host-level metrics such as memory usage on application servers. 64 | 65 | To start building a custom dashboard, clone the default HAProxy dashboard by clicking on the gear on the upper right of the dashboard and selecting **Clone Dash**. 66 | 67 | ![Clone dash](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-haproxy/clone-dash.png) 68 | 69 | ### Alerting on HAProxy metrics 70 | 71 | Once Datadog is capturing and visualizing your metrics, you will likely want to [set up some alerts](http://docs.datadoghq.com/guides/monitoring/) to be automatically notified of potential issues. With our recently released [outlier detection](https://www.datadoghq.com/blog/introducing-outlier-detection-in-datadog/) feature, you can get alerted on the things that matter. For example. you can set an alert if a particular backend is experiencing an increase in latency while the others are operating normally. Datadog can monitor individual hosts, containers, services, processes—or virtually any combination thereof. For instance, you can monitor all of your HAProxy frontends, backends, or all hosts in a certain availability zone, or even a single metric being reported by all hosts with a specific tag. 72 | 73 | ### Conclusion 74 | 75 | In this post we’ve walked you through integrating HAProxy with Datadog to visualize your key metrics and notify the right team whenever your infrastructure shows signs of trouble. If you’ve followed along using your own Datadog account, you should now have improved visibility into what’s happening in your environment, as well as the ability to create automated alerts tailored to your infrastructure, your usage patterns, and the metrics that are most valuable to your organization. If you don’t yet have a Datadog account, you can sign up for a [free trial](https://app.datadoghq.com/signup) and start monitoring HAProxy right away. -------------------------------------------------------------------------------- /azure/how_to_collect_azure_metrics.md: -------------------------------------------------------------------------------- 1 | # How to collect Azure metrics 2 | 3 | *This post is part 2 of a 3-part series on monitoring Azure virtual machines. [Part 1](/azure/how_to_monitor_microsoft_azure_vms.md) explores the key metrics available in Azure, and [Part 3](/azure/monitor_azure_vms_using_datadog.md) details how to monitor Azure with Datadog.* 4 | 5 | How you go about capturing and monitoring Azure metrics depends on your use case and the scale of your infrastructure. There are several ways to access metrics from Azure VMs: you can graph and monitor metrics using the Azure web portal, you can access the raw metric data via Azure storage, or you can use a monitoring service that integrates directly with Azure to gather metrics from your VMs. This post addresses the first two options (using the Azure web portal and accessing raw data); [a companion post](/azure/monitor_azure_vms_using_datadog.md) describes how to monitor your VMs by integrating Azure with Datadog. 6 | 7 | ## Viewing Azure metrics in the web portal 8 | 9 | The [Azure web portal](https://portal.azure.com/) has built-in monitoring functionality for viewing and alerting on performance metrics. You can graph any of the metrics available in Azure and set simple alert rules to send email notifications when metrics exceed minimum or maximum thresholds. 10 | 11 | ### Enabling Azure VM monitoring 12 | 13 | Azure’s Diagnostics extension can be enabled when you create a new virtual machine via the Azure web portal. But even if you disabled Diagnostics when creating a VM, you can turn it on later from the “Settings” menu in the VM view. You can select which metrics you wish to collect (Basic metrics, Network and web metrics, .NET metrics, etc.) in the Diagnostics tile as well. You will have to link the VM to an Azure storage account to store your Diagnostics data. 14 | 15 | [![Enable Azure diagnostics](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-enable-diagnostics-2.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-enable-diagnostics-2.png) 16 | 17 | ### Viewing metrics in the web portal 18 | 19 | Once monitoring is enabled, you will see several default metric graphs when you click on your VM in the Azure portal. 20 | 21 | ![Default graphs](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-default-graphs.png) 22 | Clicking on any monitoring graph opens a larger view, along with two important settings options: “Edit chart,” which allows you to select the metrics and the timeframe displayed on that graph, and “Add alert,” which opens the Azure alerting tile. 23 | 24 | ![Metric graphs](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-bigger-graph.png) 25 | 26 | ### Adding alert rules 27 | 28 | In the alerting tile you can set alerts on Azure VM metrics. Azure alerts can be set against any upper or lower threshold and will alert whenever the selected metric exceeds (or falls below) that threshold for a set amount of time. In the example below, we have set an alert that will notify us by email whenever the CPU usage on the given virtual machine exceeds 90 percent over a 10-minute interval. 29 | 30 | [![Create alert](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-alert-rule.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-alert-rule.png) 31 | 32 | ## Accessing raw metric data in Azure storage 33 | 34 | Because Azure metrics are written to storage tables, you can access the raw data from Azure if you want to use external tools to graph or analyze your metrics. This post focuses on accessing metrics via Microsoft’s Visual Studio IDE, but you can also copy metric data tables to local storage using the [AzCopy utility](https://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/) for Windows, or you can access metric data programmatically [using the .NET SDK](https://www.nuget.org/packages/Microsoft.Azure.Insights). Note that the [Azure command-line interface](https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-command-line-tools/), which is available for Mac, Linux, and Windows, will allow you to view the list of tables in your storage accounts (via the `azure storage table list` command) but not the actual contents of those tables. 35 | 36 | ### Connecting to Azure in Visual Studio Cloud Explorer 37 | 38 | Starting with Visual Studio 2015 and Azure SDK 2.7, you can now use Visual Studio’s [Cloud Explorer](https://msdn.microsoft.com/en-us/library/azure/mt185741.aspx) to view and manage your Azure resources. (Similar functionality is available using [Server Explorer](https://msdn.microsoft.com/en-us/library/azure/ff683677.aspx#BK_ViewResources) in older versions of Visual Studio, but not all Azure resources may be accessible.) 39 | 40 | To view the Cloud Explorer interface in Visual Studio 2015, go to View > Other Windows > Cloud Explorer. 41 | 42 | [![Cloud Explorer](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-cloud-explorer.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-cloud-explorer.png) 43 | Connect to your Azure account with Cloud Explorer by clicking on the gear and entering your account credentials. 44 | 45 | ![Add Azure account to Visual Studio](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-add-account.png) 46 | 47 | ### View stored Azure metrics 48 | 49 | Once you have signed in to your Azure subscription, you will find your metric storage listed under “Storage Accounts” or “Storage Accounts (Classic),” depending on whether the storage account was launched on Azure’s Resource Manager stack or on the Classic Stack. 50 | 51 | Metrics are stored in tables, the names of which usually start with “WADMetrics.” Open up a metric table in one of your metric storage accounts, and you will see your VM metrics. Each table contains 10 days worth of data to prevent any one table from growing too large; the date is appended to the end of the table name. 52 | 53 | ![Azure metrics in storage](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-wad-metrics2.png) 54 | 55 | ### Using stored metrics 56 | 57 | The name of your VM can be found at the end of each row’s partition key, which is helpful for filtering metrics when multiple VMs share the same metric storage account. The metric type can be found in the CounterName column. 58 | 59 | [![Metrics in tables](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-metric-table.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-metric-table.png) 60 | To export your data for use in Excel or another analytics tool, click the “Export to CSV File” button on the toolbar just above your table. 61 | 62 | ![Export metrics to CSV](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/2-export-to-csv.png) 63 | 64 | ## Conclusion 65 | 66 | In this post we have demonstrated how to use Azure’s built-in monitoring functionality to graph VM metrics and generate alerts when those metrics go out of bounds. We have also walked through the process of exporting raw metric data from Azure for custom analysis. 67 | 68 | At Datadog, we have integrated directly with Azure so that you can begin collecting and monitoring VM metrics with a minimum of setup. Learn how Datadog can help you to monitor Azure in the [next and final post](/azure/monitor_azure_vms_using_datadog.md) of this series. 69 | 70 | ------------------------------------------------------------------------ 71 | 72 | *Source Markdown for this post is available [on GitHub](https://github.com/DataDog/the-monitor/blob/master/azure/how_to_collect_azure_metrics.md). Questions, corrections, additions, etc.? Please [let us know](https://github.com/DataDog/the-monitor/issues).* 73 | -------------------------------------------------------------------------------- /mongodb/monitor-mongodb-performance-with-datadog.md: -------------------------------------------------------------------------------- 1 | #How to monitor MongoDB performance with Datadog 2 | 3 | *This post is the last of a 3-part series about how to best monitor MongoDB performance. Part 1 presents the key performance metrics available from MongoDB: there is [one post for the WiredTiger](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-wiredtiger) storage engine and [one for MMAPv1](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-mmap). [Part 2](https://www.datadoghq.com/blog/collecting-mongodb-metrics-and-statistics) explains the different ways to collect MongoDB metrics.* 4 | 5 | If you’ve already read our first two parts in this series, you know that monitoring MongoDB gives you a range of metrics that allow you to explore its health and performance in great depth. But for databases running in production, you need a robust monitoring system that collects, aggregates, and visualizes MongoDB metrics along with metrics from the other parts of your infrastructure. Advanced alert mechanisms are also essential to be able to quickly react when things go awry. In this post, we’ll show you how to start monitoring MongoDB in a few minutes with Datadog. 6 | [![MongoDB graphs on Datadog](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/1-monitor/mongodb-performance-metrics.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/1-monitor/mongodb-performance-metrics.png) 7 | 8 | ## Monitor MongoDB performance in 3 easy steps 9 | 10 | ### Step 1: install the Datadog Agent 11 | 12 | The Datadog Agent is [the open-source software](https://github.com/DataDog/dd-agent) that collects and reports metrics from your hosts so that you can visualize and monitor them in Datadog. Installing the agent usually takes just a single command. 13 | 14 | Installation instructions for a variety of platforms are available [here](https://app.datadoghq.com/account/settings#agent). 15 | 16 | MongoDB also requires a user with “[read](https://docs.mongodb.com/manual/reference/built-in-roles/#read)” and “[clusterMonitor](https://docs.mongodb.com/manual/reference/built-in-roles/#clusterMonitor)” client [roles](https://docs.mongodb.com/manual/reference/built-in-roles/#database-user-roles) for Datadog so the Agent can collect all the server statistics. The commands to run in the mongo shell differs between MongoDB versions 2.x and 3.x. They are detailed in the “configuration” tab of the [MongoDB’s integration tile on the integrations page on Datadog](https://app.datadoghq.com/account/settings#integrations/mongodb). 17 | [![MongoDB graphs on Datadog](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/3-datadog/mongodb-integration.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/3-datadog/mongodb-integration.png) 18 | 19 | As soon as your Agent is up and running, you should see your host reporting metrics [on your Datadog account](https://app.datadoghq.com/infrastructure). 20 | [![MongoDB Datadog Agent reporting metrics](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/3-datadog/mongodb-agent-setup.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/3-datadog/mongodb-agent-setup.png) 21 | 22 | ### Step 2: configure the Agent 23 | 24 | Then you’ll need to create a simple MongoDB configuration file for the Agent. For Linux hosts, configuration files are typically located **in/etc/dd-agent/conf.d/**, but you can find OS-specific config information [here](http://docs.datadoghq.com/guides/basic_agent_usage/). 25 | 26 | The Agent configuration file **mongo.yaml** is where you provide instances informations. You can also apply tags to your MongoDB instances so you can filter and aggregate your metrics later. 27 | 28 | The Agent ships with a **mongo.yaml.example** template, but to access all of the metrics described in [Part 1](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-wiredtiger) of this series, you should use the modified YAML template available [here](https://github.com/DataDog/dd-agent/blob/master/conf.d/mongo.yaml.example). 29 | 30 | ### Step 3: verify the configuration settings 31 | 32 | Restart the Agent using the [right command](http://docs.datadoghq.com/guides/basic_agent_usage/) for your platform, then check that Datadog and MongoDB are properly integrated by running the Datadog **info** command. 33 | If the configuration is correct, you should see a section like this in the info output: 34 | 35 | Checks 36 | ====== 37 | 38 | [...] 39 | 40 | mongo 41 | ----- 42 | - instance #0 [OK] 43 | - Collected 8 metrics & 0 events 44 | 45 | ### That’s it! You can now turn on the integration 46 | 47 | You can now switch on the MongoDB integration inside your Datadog account. It’s as simple as clicking the “Install Integration” button under the Configuration tab in the [MongoDB integration tile](https://app.datadoghq.com/account/settings#integrations/mongodb) on your Datadog account. 48 | 49 | ## Metrics! Metrics everywhere! 50 | 51 | Now that the Agent is properly configured, you will see all the MongoDB metrics available for monitoring, graphing, and correlation on Datadog. 52 | 53 | You can immediately see your metrics populating a default dashboard for MongoDB containing the essential MongoDB metrics presented in [Part 1](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-wiredtiger). It should be a great starting point for your monitoring. You can clone this dashboard and customize it as you wish, even adding metrics from other parts of your infrastructure so that you can easily correlate what’s happening in MongoDB with what’s happening throughout your stack. 54 | 55 | [![MongoDB Dashboard on Datadog](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/1-monitor/new-datadog-mongodb-dashboard.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/1-monitor/new-datadog-mongodb-dashboard.png) 56 | 57 | *MongoDB default dashboard on Datadog* 58 | 59 | ## Alerting 60 | 61 | Once Datadog is capturing and graphing your metrics, you will likely want to set up some [alerts](https://www.datadoghq.com/blog/monitoring-101-alerting/) to keep watch over your metrics and to notify your teams about any issues. 62 | 63 | Datadog allows you to alert on individual hosts, services, processes, and metrics—or virtually any combination thereof. For instance, you can monitor all of your hosts in a certain availability zone, or you can monitor a single key metric being reported by each of your MongoDB hosts. 64 | For example, as explained in [Part 1](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-wiredtiger), the number of current connections is limited to 65,536 simultaneous connections by default since v3.0. So you might want to set up an alert whenever the corresponding metric is getting close to this maximum. 65 | [![MongoDB Datadog alert](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/3-datadog/mongodb-datadog-alert.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/3-datadog/mongodb-datadog-alert.png) 66 | 67 | Datadog also integrates with many communication tools such as Slack, PagerDuty or HipChat so you can notify your teams via the channels you already use. 68 | 69 | ## You are now a MongoDB pro! 70 | 71 | This concludes the series on how to monitor MongoDB performance. In this post we’ve walked you through integrating MongoDB with Datadog to visualize your key metrics and notify your team whenever your database shows signs of trouble. 72 | 73 | If you’ve followed along using your own Datadog account, you should now have unparalleled visibility into MongoDB’s activity and performance. You are also aware of the ability to create automated alerts tailored to your environment, usage patterns, and the metrics that are most valuable to your teams. 74 | 75 | If you don’t yet have a Datadog account, you can sign up for [a free trial](https://app.datadoghq.com/signup) and begin to monitor MongoDB performance alongside the rest of your infrastructure, your applications, and your services today. 76 | -------------------------------------------------------------------------------- /monitoring-101/monitoring_101_investigating_performance_issues.md: -------------------------------------------------------------------------------- 1 | # Monitoring 101: Investigating performance issues 2 | 3 | *This post is part of a series on effective monitoring. Be sure to check out the rest of the series: [Collecting the right data](/blog/monitoring-101-collecting-data/) and [Alerting on what matters](/blog/monitoring-101-alerting/).* 4 | 5 | The responsibilities of a monitoring system do not end with symptom detection. Once your monitoring system has notified you of a real symptom that requires attention, its next job is to help you diagnose the root cause by making your systems [observable](https://en.wikipedia.org/wiki/Observability) via the monitoring data you have collected. Often this is the least structured aspect of monitoring, driven largely by hunches and guess-and-check. This post describes a more directed approach that can help you to find and correct root causes more efficiently. 6 | 7 | This series of articles comes out of our experience monitoring large-scale infrastructure for [our customers](https://www.datadoghq.com/customers/). It also draws on the work of [Brendan Gregg](http://dtdg.co/use-method), [Rob Ewaschuk](http://dtdg.co/philosophy-alerting), and [Baron Schwartz](http://dtdg.co/metrics-attention). 8 | 9 | ## A word about data 10 | 11 | ![metric types](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-05-how-to-monitor/alerting101_chart_1.png) 12 | 13 | There are three main types of monitoring data that can help you investigate the root causes of problems in your infrastructure. Data types and best practices for their collection are discussed in-depth [in a companion post](https://www.datadoghq.com/blog/2015/06/monitoring-101-collecting-data/), but in short: 14 | 15 | - **Work metrics** indicate the top-level health of your system by measuring its useful output 16 | - **Resource metrics** quantify the utilization, saturation, errors, or availability of a resource that your system depends on 17 | - **Events** describe discrete, infrequent occurrences in your system such as code changes, internal alerts, and scaling events 18 | 19 | By and large, work metrics will surface the most serious symptoms and should therefore generate [the most serious alerts](https://www.datadoghq.com/blog/2015/06/monitoring-101-alerting/#page-on-symptoms). But the other metric types are invaluable for investigating the *causes* of those symptoms. In order for your systems to be [observable](https://en.wikipedia.org/wiki/Observability), you need sufficiently comprehensive measurements to provide a full picture of each system's health and function. 20 | 21 | ## It’s resources all the way down 22 | 23 | ![metric uses](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-05-how-to-monitor/alerting101_2_chart.png) 24 | 25 | Most of the components of your infrastructure can be thought of as resources. At the highest levels, each of your systems that produces useful work likely relies on other systems. For instance, the Apache server in a LAMP stack relies on a MySQL database as a resource to support its work. One level down, within MySQL are database-specific resources that MySQL uses to do *its* work, such as the finite pool of client connections. At a lower level still are the physical resources of the server running MySQL, such as CPU, memory, and disks. 26 | 27 | Thinking about which systems *produce* useful work, and which resources *support* that work, can help you to efficiently get to the root of any issues that surface. Understanding these hierarchies helps you build a mental model of how your systems interact, so you can quickly focus in on the key diagnostic metrics for any incident. When an alert notifies you of a possible problem, the following process will help you to approach your investigation systematically. 28 | 29 | ![recursive investigation](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-05-how-to-monitor/investigating_diagram_4.png) 30 | 31 | ### 1. Start at the top with work metrics 32 | 33 | First ask yourself, “Is there a problem? How can I characterize it?” If you don’t describe the issue clearly at the outset, it’s easy to lose track as you dive deeper into your systems to diagnose the issue. 34 | 35 | Next examine the work metrics for the highest-level system that is exhibiting problems. These metrics will often point to the source of the problem, or at least set the direction for your investigation. For example, if the percentage of work that is successfully processed drops below a set threshold, diving into error metrics, and especially the types of errors being returned, will often help narrow the focus of your investigation. Alternatively, if latency is high, and the throughput of work being requested by outside systems is also very high, perhaps the system is simply overburdened. 36 | 37 | ### 2. Dig into resources 38 | 39 | If you haven’t found the cause of the problem by inspecting top-level work metrics, next examine the resources that the system uses—physical resources as well as software or external services that serve as resources to the system. If you’ve already set up dashboards for each system as outlined below, you should be able to quickly find and peruse metrics for the relevant resources. Are those resources unavailable? Are they highly utilized or saturated? If so, recurse into those resources and begin investigating each of them at step 1. 40 | 41 | ### 3. Did something change? 42 | 43 | Next consider alerts and other events that may be correlated with your metrics. If a code release, [internal alert](https://www.datadoghq.com/blog/2015/06/monitoring-101-alerting/#levels-of-urgency), or other event was registered slightly before problems started occurring, investigate whether they may be connected to the problem. 44 | 45 | ### 4. Fix it (and don’t forget it) 46 | 47 | Once you have determined what caused the issue, correct it. Your investigation is complete when symptoms disappear—you can now think about how to change the system to avoid similar problems in the future. If you did not have the data you needed to quickly diagnose the problem, add more instrumentation to your system to ensure that those metrics and events are available for future responders. 48 | 49 | ## Build dashboards before you need them 50 | 51 | [![dashboard](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-05-how-to-monitor/example-dashboard-2.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-05-how-to-monitor/example-dashboard-2.png) 52 | 53 | In an outage, every minute is crucial. To speed your investigation and keep your focus on the task at hand, set up dashboards in advance that help you observe the current and recent state of each system. You may want to set up one dashboard for your high-level application metrics, and one dashboard for each subsystem. Each system’s dashboard should render the work metrics of that system, along with resource metrics of the system itself and key metrics of the subsystems it depends on. If event data is available, overlay relevant events on the graphs for correlation analysis. 54 | 55 | ## Conclusion: Follow the metrics 56 | 57 | Adhering to a standardized monitoring framework allows you to investigate problems more systematically: 58 | 59 | - For each system in your infrastructure, set up a dashboard ahead of time that displays all its key metrics, with relevant events overlaid. 60 | - Investigate causes of problems by starting with the highest-level system that is showing symptoms, reviewing its [work and resource metrics](https://www.datadoghq.com/blog/2015/06/monitoring-101-collecting-data/#metrics) and any associated events. 61 | - If problematic resources are detected, apply the same investigation pattern to the resource (and its constituent resources) until your root problem is discovered and corrected. 62 | 63 | We would like to hear about your experiences as you apply this framework to your own monitoring practice. If it is working well, please [let us know on Twitter](https://twitter.com/datadoghq)! Questions, corrections, additions, complaints, etc? Please [let us know on GitHub](https://github.com/DataDog/the-monitor/blob/master/monitoring-101/monitoring_101_investigating_performance_issues.md). 64 | -------------------------------------------------------------------------------- /cassandra/monitoring_cassandra_with_datadog.md: -------------------------------------------------------------------------------- 1 | *This post is the last of a 3-part series about monitoring Cassandra. [Part 1](https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/) is about the key performance metrics available from Cassandra, and [Part 2](https://www.datadoghq.com/blog/how-to-collect-cassandra-metrics/) details several ways to collect those metrics.* 2 | 3 | If you’ve already read our [first](https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/) [two](https://www.datadoghq.com/blog/how-to-collect-cassandra-metrics/) posts in this series, you know that monitoring Cassandra gives you a range of metrics that allow you to explore the health of your data store in great depth. But to get lasting value from those metrics, you need a robust monitoring system that collects, aggregates, and visualizes your Cassandra metrics—and alerts you when things go awry. In this post, we’ll show you how to set up Cassandra monitoring in Datadog. 4 | 5 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/intro-dashboard.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/intro-dashboard.png) 6 | 7 | ## Integrating Datadog and Cassandra 8 | 9 | ### Install the Datadog Agent 10 | 11 | The Datadog Agent is [the open-source software](https://github.com/DataDog/dd-agent) that collects and reports metrics from your hosts so that you can view and monitor them in Datadog. Installing the agent usually takes just a single command. 12 | 13 | Install instructions for a variety of platforms are available [here](https://app.datadoghq.com/account/settings#agent). 14 | 15 | As soon as your Agent is up and running, you should see your host reporting metrics [in your Datadog account](https://app.datadoghq.com/infrastructure). 16 | 17 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/infra_2.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/infra_2.png) 18 | 19 | ### Configure the Agent 20 | 21 | Next you’ll need to create a simple Cassandra configuration file for the Agent. For Linux hosts, the configuration files are typically located in `/etc/dd-agent/conf.d/`, but you can find OS-specific config information [here](http://docs.datadoghq.com/guides/basic_agent_usage/). 22 | 23 | The Agent configuration file `cassandra.yaml` is where you provide the hostname and the port (note that Cassandra uses port 7199 by default for JMX monitoring), as well as your JMX authentication credentials (if enabled on your cluster). You can also use the config to define which Cassandra metrics Datadog will collect, or to apply tags to your Cassandra instances for filtering and aggregating your metrics. The Agent ships with a `cassandra.yaml.example` [template](https://github.com/DataDog/dd-agent/blob/master/conf.d/cassandra.yaml.example) that enables you to monitor all of the metrics described in [Part 1](https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/) of this series. 24 | 25 | ### Restart the Agent 26 | 27 | You must restart the Agent to load your new configuration file. The restart command varies somewhat by platform (see the specific commands for your platform [here](http://docs.datadoghq.com/guides/basic_agent_usage/)). For Debian/Ubuntu: 28 | 29 | ``` 30 | sudo /etc/init.d/datadog-agent restart 31 | ``` 32 | 33 | ### Verify the configuration settings 34 | 35 | To check that Datadog and Cassandra are properly integrated, run the Datadog `info` command. The command for each platform is available [here](http://docs.datadoghq.com/guides/basic_agent_usage/). For Debian/Ubuntu the command is: 36 | 37 | ``` 38 | sudo /etc/init.d/datadog-agent info 39 | ``` 40 | 41 | If the configuration is correct, you will see a section like this in the info output: 42 | 43 | ``` 44 | Checks 45 | ====== 46 | 47 | [...] 48 | 49 | cassandra 50 | --------- 51 | - instance #cassandra-localhost [OK] collected 81 metrics 52 | - Collected 81 metrics, 0 events & 0 service checks 53 | ``` 54 | 55 | ### Install the integration 56 | 57 | Finally, switch on the Cassandra integration inside your Datadog account. It’s as simple as clicking the “Install Integration” button under the Configuration tab in the [Cassandra integration settings](https://app.datadoghq.com/account/settings#integrations/cassandra) of your Datadog account. 58 | 59 | ## Metrics! 60 | 61 | Once the Agent is properly configured, you will see dozens of Cassandra metrics available for monitoring, graphing, and correlation in Datadog. 62 | 63 | You can easily create a comprehensive dashboard for your data store and its associated systems by graphing the Cassandra metrics from Part 1 with important metrics from outside Cassandra. For example, you may want to monitor system metrics, such as CPU and memory usage, as well as JVM metrics, such as the duration of stop-the-world garbage collection (GC) episodes, which is captured by the `jvm.gc.parnew.time` metric: 64 | 65 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/gc-parnew.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/gc-parnew.png) 66 | 67 | You can also manipulate the raw metrics that come out of Cassandra into something much more usable. For instance, recent versions of Cassandra expose metrics on *total* latency but not recent latency, which is the metric you will likely want. In Datadog you can easily extract and graph real-time latency, resampled several times a minute, using two metrics scoped to `clientrequest:read`: 68 | 69 | - `cassandra.total_latency.count `(the total number of microseconds elapsed in servicing client read requests) 70 | - `cassandra.latency.count` (the total number of read requests processed) 71 | 72 | By taking the diffs of each metric at every sampling interval and dividing them, you can monitor the real-time read latency (divided by 1,000 here to measure latency in milliseconds). In Datadog this just takes a few clicks in the graph editor: 73 | 74 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/diff.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/diff.png) 75 | 76 | ## Monitoring Cassandra 77 | 78 | Once Datadog is capturing and visualizing your metrics, you will likely want to set up some monitors to keep watch over your metrics—and to [alert](https://www.datadoghq.com/blog/monitoring-101-alerting/) you when there are problems. 79 | 80 | Datadog allows you to monitor individual hosts, services, processes, and metrics—or virtually any combination thereof. For instance, you can monitor all of your hosts in a certain availability zone, or you can monitor a single key metric being reported by each of your Cassandra hosts. As an example, you can set a change alert to notify you if your request throughput drops by a certain percentage in a short time, which can be a high-level indicator of problems in your systems. 81 | 82 | ## Conclusion 83 | 84 | In this post we’ve walked you through integrating Cassandra with Datadog to visualize your key metrics and notify your team whenever Cassandra shows signs of trouble. 85 | 86 | If you’ve followed along using your own Datadog account, you should now have unparalleled visibility into what’s happening in your Cassandra infrastructure, as well as the ability to create automated alerts tailored to your environment, your usage patterns, and the metrics that are most valuable to your organization. 87 | 88 | If you don’t yet have a Datadog account, you can sign up for [a free 14-day trial](https://app.datadoghq.com/signup) and start monitoring Cassandra alongside the rest of your infrastructure, your applications, and your services today. 89 | 90 | ------------------------------------------------------------------------ 91 | 92 | *Source Markdown for this post is available [on GitHub](https://github.com/DataDog/the-monitor/blob/master/cassandra/monitoring_cassandra_with_datadog.md). Questions, corrections, additions, etc.? Please [let us know](https://github.com/DataDog/the-monitor/issues).* 93 | 94 |

95 | -------------------------------------------------------------------------------- /translations/ja-jp/cassandra/monitoring_cassandra_with_datadog.md: -------------------------------------------------------------------------------- 1 | *This post is the last of a 3-part series about monitoring Cassandra. [Part 1](https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/) is about the key performance metrics available from Cassandra, and [Part 2](https://www.datadoghq.com/blog/how-to-collect-cassandra-metrics/) details several ways to collect those metrics.* 2 | 3 | If you’ve already read our [first](https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/) [two](https://www.datadoghq.com/blog/how-to-collect-cassandra-metrics/) posts in this series, you know that monitoring Cassandra gives you a range of metrics that allow you to explore the health of your data store in great depth. But to get lasting value from those metrics, you need a robust monitoring system that collects, aggregates, and visualizes your Cassandra metrics—and alerts you when things go awry. In this post, we’ll show you how to set up Cassandra monitoring in Datadog. 4 | 5 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/intro-dashboard.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/intro-dashboard.png) 6 | 7 | ## Integrating Datadog and Cassandra 8 | 9 | ### Install the Datadog Agent 10 | 11 | The Datadog Agent is [the open-source software](https://github.com/DataDog/dd-agent) that collects and reports metrics from your hosts so that you can view and monitor them in Datadog. Installing the agent usually takes just a single command. 12 | 13 | Install instructions for a variety of platforms are available [here](https://app.datadoghq.com/account/settings#agent). 14 | 15 | As soon as your Agent is up and running, you should see your host reporting metrics [in your Datadog account](https://app.datadoghq.com/infrastructure). 16 | 17 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/infra_2.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/infra_2.png) 18 | 19 | ### Configure the Agent 20 | 21 | Next you’ll need to create a simple Cassandra configuration file for the Agent. For Linux hosts, the configuration files are typically located in `/etc/dd-agent/conf.d/`, but you can find OS-specific config information [here](http://docs.datadoghq.com/guides/basic_agent_usage/). 22 | 23 | The Agent configuration file `cassandra.yaml` is where you provide the hostname and the port (note that Cassandra uses port 7199 by default for JMX monitoring), as well as your JMX authentication credentials (if enabled on your cluster). You can also use the config to define which Cassandra metrics Datadog will collect, or to apply tags to your Cassandra instances for filtering and aggregating your metrics. The Agent ships with a `cassandra.yaml.example` [template](https://github.com/DataDog/dd-agent/blob/master/conf.d/cassandra.yaml.example) that enables you to monitor all of the metrics described in [Part 1](https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/) of this series. 24 | 25 | ### Restart the Agent 26 | 27 | You must restart the Agent to load your new configuration file. The restart command varies somewhat by platform (see the specific commands for your platform [here](http://docs.datadoghq.com/guides/basic_agent_usage/)). For Debian/Ubuntu: 28 | 29 | ``` 30 | sudo /etc/init.d/datadog-agent restart 31 | ``` 32 | 33 | ### Verify the configuration settings 34 | 35 | To check that Datadog and Cassandra are properly integrated, run the Datadog `info` command. The command for each platform is available [here](http://docs.datadoghq.com/guides/basic_agent_usage/). For Debian/Ubuntu the command is: 36 | 37 | ``` 38 | sudo /etc/init.d/datadog-agent info 39 | ``` 40 | 41 | If the configuration is correct, you will see a section like this in the info output: 42 | 43 | ``` 44 | Checks 45 | ====== 46 | 47 | [...] 48 | 49 | cassandra 50 | --------- 51 | - instance #cassandra-localhost [OK] collected 81 metrics 52 | - Collected 81 metrics, 0 events & 0 service checks 53 | ``` 54 | 55 | ### Install the integration 56 | 57 | Finally, switch on the Cassandra integration inside your Datadog account. It’s as simple as clicking the “Install Integration” button under the Configuration tab in the [Cassandra integration settings](https://app.datadoghq.com/account/settings#integrations/cassandra) of your Datadog account. 58 | 59 | ## Metrics! 60 | 61 | Once the Agent is properly configured, you will see dozens of Cassandra metrics available for monitoring, graphing, and correlation in Datadog. 62 | 63 | You can easily create a comprehensive dashboard for your data store and its associated systems by graphing the Cassandra metrics from Part 1 with important metrics from outside Cassandra. For example, you may want to monitor system metrics, such as CPU and memory usage, as well as JVM metrics, such as the duration of stop-the-world garbage collection (GC) episodes, which is captured by the `jvm.gc.parnew.time` metric: 64 | 65 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/gc-parnew.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/gc-parnew.png) 66 | 67 | You can also manipulate the raw metrics that come out of Cassandra into something much more usable. For instance, recent versions of Cassandra expose metrics on *total* latency but not recent latency, which is the metric you will likely want. In Datadog you can easily extract and graph real-time latency, resampled several times a minute, using two metrics scoped to `clientrequest:read`: 68 | 69 | - `cassandra.total_latency.count `(the total number of microseconds elapsed in servicing client read requests) 70 | - `cassandra.latency.count` (the total number of read requests processed) 71 | 72 | By taking the diffs of each metric at every sampling interval and dividing them, you can monitor the real-time read latency (divided by 1,000 here to measure latency in milliseconds). In Datadog this just takes a few clicks in the graph editor: 73 | 74 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/diff.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-cassandra/diff.png) 75 | 76 | ## Monitoring Cassandra 77 | 78 | Once Datadog is capturing and visualizing your metrics, you will likely want to set up some monitors to keep watch over your metrics—and to [alert](https://www.datadoghq.com/blog/monitoring-101-alerting/) you when there are problems. 79 | 80 | Datadog allows you to monitor individual hosts, services, processes, and metrics—or virtually any combination thereof. For instance, you can monitor all of your hosts in a certain availability zone, or you can monitor a single key metric being reported by each of your Cassandra hosts. As an example, you can set a change alert to notify you if your request throughput drops by a certain percentage in a short time, which can be a high-level indicator of problems in your systems. 81 | 82 | ## Conclusion 83 | 84 | In this post we’ve walked you through integrating Cassandra with Datadog to visualize your key metrics and notify your team whenever Cassandra shows signs of trouble. 85 | 86 | If you’ve followed along using your own Datadog account, you should now have unparalleled visibility into what’s happening in your Cassandra infrastructure, as well as the ability to create automated alerts tailored to your environment, your usage patterns, and the metrics that are most valuable to your organization. 87 | 88 | If you don’t yet have a Datadog account, you can sign up for [a free 14-day trial](https://app.datadoghq.com/signup) and start monitoring Cassandra alongside the rest of your infrastructure, your applications, and your services today. 89 | 90 | ------------------------------------------------------------------------ 91 | 92 | *Source Markdown for this post is available [on GitHub](https://github.com/DataDog/the-monitor/blob/master/cassandra/monitoring_cassandra_with_datadog.md). Questions, corrections, additions, etc.? Please [let us know](https://github.com/DataDog/the-monitor/issues).* 93 | 94 |

95 | -------------------------------------------------------------------------------- /azure/monitor_azure_vms_using_datadog.md: -------------------------------------------------------------------------------- 1 | # Monitor Azure VMs using Datadog 2 | 3 | *This post is part 3 of a 3-part series on monitoring Azure virtual machines. [Part 1](/blog/how-to-monitor-microsoft-azure-vms) explores the key metrics available in Azure, and [Part 2](/blog/how-to-collect-azure-metrics) is about collecting Azure VM metrics.* 4 | 5 | If you’ve already read [our post](/blog/how-to-collect-azure-metrics) on collecting Azure performance metrics, you’ve seen that you can view and alert on metrics from individual VMs using the Azure web portal. For a more dynamic, comprehensive view of your infrastructure, you can connect Azure to Datadog. 6 | 7 | ## Why Datadog? 8 | 9 | By integrating Datadog and Azure, you can collect and view metrics from across your infrastructure, correlate VM metrics with application-level metrics, and slice and dice your metrics using any combination of properties and custom tags. You can use the Datadog Agent to collect more metrics—and at higher resolution—than are available in the Azure portal. And with more than 100 supported integrations, you can route automated alerts to your team using third-party collaboration tools such as PagerDuty and Slack. 10 | 11 | In this post we’ll show you how to get started. 12 | 13 | ## How to integrate Datadog and Azure 14 | 15 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/3-azure-dash-2.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/3-azure-dash-2.png) 16 | 17 | *Host map of Azure VMs by region* 18 | 19 | As with all hosts, you can install the Datadog Agent on an Azure VM (whether [Windows](https://app.datadoghq.com/account/settings#agent/windows) or [Linux](https://app.datadoghq.com/account/settings#agent/ubuntu)) using the command line or as part of your automated deployments. But Azure users can also integrate with Datadog using the Azure and Datadog web interfaces. There are two ways to set up the integration from your browser: 20 | 21 | 1. Enable Datadog to collect metrics via the Azure API 22 | 2. Install the Datadog Agent using the Azure web portal 23 | 24 | Both options provide basic metrics about your Azure VMs with a minimum of overhead, but the two approaches each provide somewhat different metric sets, and hence can be complementary. In this post we’ll walk you through both options and explain the benefits of each. 25 | 26 | ## Enable Datadog to collect Azure performance metrics 27 | 28 | The easiest way to start gathering metrics from Azure is to connect Datadog to Azure’s read-only monitoring API. You won’t need to install anything, and you’ll start seeing basic metrics from all your VMs right away. 29 | 30 | To authorize Datadog to collect metrics from your Azure VMs, simply click [this link](https://app.datadoghq.com/azure/landing) and follow the directions on the configuration pane under the heading “To start monitoring all your Azure Resources”. 31 | 32 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/azure-config-update.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/azure-config-update.png) 33 | 34 | ### View your Azure performance metrics 35 | 36 | Once you have successfully integrated Datadog with Azure, you will see [an Azure VM default screenboard](https://app.datadoghq.com/screen/integration/azure_vm) on your list of [Integration Dashboards](https://app.datadoghq.com/dash/list). The basic Azure dashboard displays all of the key CPU, disk I/O, and network metrics highlighted in Part 1 of this series, “How to monitor Microsoft Azure VMs”. 37 | 38 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/azure-vm-screenboard-update.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/azure-vm-screenboard-update.png) 39 | 40 | ### Customize your Azure dashboards 41 | 42 | Once you are capturing Azure metrics in Datadog, you can build on the default screenboard by adding additional Azure VM metrics or even graphs and metrics from outside systems. To start building a custom screenboard, clone the default Azure dashboard by clicking on the gear on the upper right of the dashboard and selecting “Clone Dash”. You can also add VM metrics to any custom timeboard, which is an interactive Datadog dashboard displaying the evolution of multiple metrics across any timeframe. 43 | 44 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/azure-clone-update.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/azure-clone-update.png) 45 | 46 | ## Install the Datadog Agent on an Azure VM 47 | 48 | Installing the Datadog Agent lets you monitor additional server-level metrics from the host, as well as real-time metrics from the applications running on the VM. Agent metrics are collected at higher resolution than per-minute Azure portal metrics. 49 | 50 | Azure users can install the Datadog Agent as an Azure extension in seconds. 51 | 52 | ### Install the Agent from the Azure portal 53 | 54 | In the [Azure web portal](https://portal.azure.com/), click on the name of your VM to bring up the details of that VM. From the details pane, click the “Settings” gear and select “Extensions.” 55 | 56 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/3-extensions.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/3-extensions.png) 57 | 58 | On the Extensions tile, click “Add” to select a new extension. From the list of extensions, select the Datadog Agent for your operating system. 59 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/3-dd-agent.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/3-dd-agent.png) 60 | 61 | Click “Create” to add the extension. 62 | 63 | ### Configure the Agent with your Datadog API key 64 | 65 | At this point you will need to provide your Datadog API key to connect the Agent to your Datadog account. You can find your API key via [this link](https://app.datadoghq.com/azure/landing/). 66 | 67 | ### Viewing your Azure VMs and metrics 68 | 69 | Once the Agent starts reporting metrics, you will see your Azure VMs appear as part of your monitored infrastructure in Datadog. 70 | 71 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/3-hostmap.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/3-hostmap.png) 72 | 73 | Clicking on any VM allows you to view the integrations and metrics from that VM. 74 | 75 | ### Agent metrics 76 | 77 | Installing the Agent provides you with system metrics (such as `system.disk.in_use`) for each VM, as opposed to the Azure metrics (such as `azure.vm.memory_pages_per_sec`) collected via the Azure monitoring API as described above. 78 | 79 | The Agent can also collect application metrics so that you can correlate your application’s performance with the host-level metrics from your compute layer. The Agent monitors services running in an Azure VM, such as IIS and SQL Server, as well as non-Windows integrations such as MySQL, NGINX, and Cassandra. 80 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/3-wmi.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-08-azure/3-wmi.png) 81 | 82 | ## Conclusion 83 | 84 | In this post we’ve walked you through integrating Azure with Datadog so you can visualize and alert on your key metrics. You can also see which VMs are overutilized or underutilized and should be resized to improve performance or save costs. 85 | 86 | Monitoring Azure with Datadog gives you critical visibility into what’s happening with your VMs and your Azure applications. You can easily create automated alerts on any metric across any group of VMs, with triggers tailored precisely to your infrastructure and your usage patterns. 87 | 88 | If you don’t yet have a Datadog account, you can sign up for a [free trial](https://app.datadoghq.com/signup) and start monitoring your cloud infrastructure, your applications, and your services today. 89 | 90 | ------------------------------------------------------------------------ 91 | 92 | *Source Markdown for this post is available [on GitHub](https://github.com/DataDog/the-monitor/blob/master/azure/monitor_azure_vms_using_datadog.md). Questions, corrections, additions, etc.? Please [let us know](https://github.com/DataDog/the-monitor/issues).* 93 | -------------------------------------------------------------------------------- /mysql/mysql_monitoring_with_datadog.md: -------------------------------------------------------------------------------- 1 | *This is the final post in a 3-part series about MySQL monitoring. [Part 1][part-1] explores the key metrics available from MySQL, and [Part 2][part-2] explains how to collect those metrics.* 2 | 3 | If you’ve already read [our post][part-2] on collecting MySQL metrics, you’ve seen that you have several options for ad hoc performance checks. For a more comprehensive view of your database's health and performance, however, you need a monitoring system that continually collects MySQL statistics and metrics, that lets you identify both recent and long-term performance trends, and that can help you identify and investigate issues when they arise. This post will show you how to set up comprehensive MySQL monitoring by installing the Datadog Agent on your database servers. 4 | 5 | [![MySQL dashboard in Datadog][dash-img]][dash-img] 6 | 7 | ## Integrate Datadog with MySQL 8 | As explained in [Part 1][part-1], MySQL exposes hundreds of valuable metrics and statistics about query execution and database performance. To collect those metrics on an ongoing basis, the Datadog Agent connects to MySQL at regular intervals, queries for the latest values, and reports them to Datadog for graphing and alerting. 9 | 10 | ### Installing the Datadog Agent 11 | 12 | Installing the Agent on your MySQL server is easy: it usually requires just a single command, and the Agent can collect basic metrics even if [the MySQL performance schema][p_s] is not enabled and the sys schema is not installed. Installation instructions for a variety of operating systems and platforms are available [here][agent-install]. 13 | 14 | ### Configure the Agent to collect MySQL metrics 15 | 16 | Once the Agent is installed, you need to grant it access to read metrics from your database. In short, this process has four steps: 17 | 18 | 1. Create a `datadog` user in MySQL and grant it permission to run metric queries on your behalf. 19 | 2. Copy Datadog's `conf.d/mysql.yaml.example` [template][conf] to `conf.d/mysql.yaml` to create a configuration file for Datadog. 20 | 3. Add the login credentials for your newly created `datadog` user to `conf.d/mysql.yaml`. 21 | 4. Restart the agent. 22 | 23 | The [MySQL configuration tile][mysql-config] in the Datadog app has the full instructions, including the exact SQL commands you need to run to create the `datadog` user and apply the appropriate permissions. 24 | 25 | ### Configure collection of additional MySQL metrics 26 | 27 | Out of the box, Datadog collects more than 60 standard metrics from modern versions of MySQL. Definitions and measurement units for most of those standard metrics can be found [here][metric-list]. 28 | 29 | Starting with Datadog Agent version 5.7, many additional metrics are available by enabling specialized checks in the `conf.d/mysql.yaml` file (see [the configuration template][conf] for context): 30 | 31 | ``` 32 | # options: 33 | # replication: false 34 | # galera_cluster: false 35 | # extra_status_metrics: true 36 | # extra_innodb_metrics: true 37 | # extra_performance_metrics: true 38 | # schema_size_metrics: false 39 | # disable_innodb_metrics: false 40 | ``` 41 | 42 | To collect average statistics on query latency, as described in [Part 1][runtime] of this series, you will need to enable the `extra_performance_metrics` option and ensure that [the performance schema is enabled][p_s]. The Agent's `datadog` user in MySQL will also need the [additional permissions][mysql-config] detailed in the MySQL configuration instructions in the Datadog app. 43 | 44 | Note that the `extra_performance_metrics` and `schema_size_metrics` options trigger heavier queries against your database, so you may be subject to performance impacts if you enable those options on servers with a large number of schemas or tables. Therefore you may wish to test out these options on a limited basis before deploying them to production. 45 | 46 | Other options include: 47 | 48 | * `extra_status_metrics` to expand the set of server status variables reported to Datadog 49 | * `extra_innodb_metrics` to collect more than 80 additional metrics specific to the InnoDB storage engine 50 | * `replication` to collect basic metrics (such as replica lag) on MySQL replicas 51 | 52 | To override default behavior for any of these optional checks, simply uncomment the relevant lines of the configuration file (along with the `options:` line) and restart the agent. 53 | 54 | The specific metrics associated with each option are detailed in [the source code][checks] for the MySQL Agent check. 55 | 56 | ## View your comprehensive MySQL dashboard 57 | 58 | [![MySQL dashboard in Datadog][dash-img]][dash-img] 59 | 60 | Once you have integrated Datadog with MySQL, a comprehensive dashboard called “[MySQL - Overview][mysql-dash]” will appear in your list of [integration dashboards][dash-list]. The dashboard gathers key MySQL metrics highlighted in [Part 1][part-1] of this series, along with server resource metrics, such as CPU and I/O wait, which are invaluable for investigating performance issues. 61 | 62 | ### Customize your dashboard 63 | 64 | The Datadog Agent can also collect metrics from the rest of your infrastructure so that you can correlate your entire system's performance with metrics from MySQL. The Agent collects metrics from [Docker][docker], [NGINX][nginx], [Redis][redis], and 100+ other applications and services. You can also easily instrument your own application code to [report custom metrics to Datadog using StatsD][statsd]. 65 | 66 | To add more metrics from MySQL or related systems to your MySQL dashboard, simply clone [the template dash][mysql-dash] by clicking on the gear in the upper right. 67 | 68 | ## Conclusion 69 | 70 | In this post we’ve walked you through integrating MySQL with Datadog so you can access all your database metrics in one place, whether standard metrics from MySQL, more detailed metrics from the InnoDB storage engine, or automatically computed metrics on query latency. 71 | 72 | Monitoring MySQL with Datadog gives you critical visibility into what’s happening with your database and the applications that depend on it. You can easily create automated alerts on any metric, with triggers tailored precisely to your infrastructure and your usage patterns. 73 | 74 | If you don’t yet have a Datadog account, you can sign up for a [free trial][trial] to start monitoring all your servers, applications, and services today. 75 | 76 | - - - 77 | 78 | *Source Markdown for this post is available [on GitHub][markdown]. Questions, corrections, additions, etc.? Please [let us know][issues].* 79 | 80 | [part-1]: https://www.datadoghq.com/blog/monitoring-mysql-performance-metrics/ 81 | [part-2]: https://www.datadoghq.com/blog/collecting-mysql-statistics-and-metrics/ 82 | [mysql-config]: https://app.datadoghq.com/account/settings#integrations/mysql 83 | [mysql-dash]: https://app.datadoghq.com/dash/integration/mysql 84 | [dd-agent]: https://github.com/DataDog/dd-agent 85 | [agent-install]: https://app.datadoghq.com/account/settings#agent 86 | [statsd]: https://www.datadoghq.com/blog/statsd/ 87 | [dash-list]: https://app.datadoghq.com/dash/list 88 | [trial]: https://app.datadoghq.com/signup 89 | [docker]: https://www.datadoghq.com/blog/the-docker-monitoring-problem/ 90 | [nginx]: https://www.datadoghq.com/blog/how-to-monitor-nginx/ 91 | [redis]: https://www.datadoghq.com/blog/how-to-monitor-redis-performance-metrics/ 92 | [elb]: https://www.datadoghq.com/blog/top-elb-health-and-performance-metrics/ 93 | [p_s]: https://www.datadoghq.com/blog/collecting-mysql-statistics-and-metrics/#querying-the-performance-schema-and-sys-schema 94 | [checks]: https://github.com/DataDog/dd-agent/blob/master/checks.d/mysql.py 95 | [conf]: https://github.com/DataDog/dd-agent/blob/master/conf.d/mysql.yaml.example 96 | [metric-list]: http://docs.datadoghq.com/integrations/mysql/#metrics 97 | [runtime]: https://www.datadoghq.com/blog/monitoring-mysql-performance-metrics/#query-performance 98 | [dash-img]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-04-mysql/mysql-dash-dd.png 99 | [markdown]: https://github.com/DataDog/the-monitor/blob/master/mysql/mysql_monitoring_with_datadog.md 100 | [issues]: https://github.com/DataDog/the-monitor/issues 101 | 102 | -------------------------------------------------------------------------------- /google-compute-engine/monitor-google-compute-engine-with-datadog.md: -------------------------------------------------------------------------------- 1 | --- 2 | blog/category: ["Series"] 3 | blog/tag: ["GCE", "GCP", "cloud"] 4 | date: 2017-03-07T00:00:05Z 5 | description: "Learn how to use Datadog to collect Google Compute Engine metrics." 6 | draft: false 7 | email: evan@datadoghq.com 8 | featured: false 9 | image: monitor-gce-with-dd-hero.png 10 | meta_title: null 11 | preview_image: monitor-gce-with-dd-hero.png 12 | scribbler: "Evan Mouzakitis" 13 | scribbler_image: eea029c16d26d3928612cf4f22613ee7.jpg 14 | slug: monitor-google-compute-engine-with-datadog 15 | sub_featured: true 16 | title: "How to monitor Google Compute Engine with Datadog" 17 | twitter_handle: vagelim 18 | toc_cta_text: "Start monitoring GCE" 19 | --- 20 | 21 | _This post is the final part of a 3-part series on how to monitor Google Compute Engine. [Part 1][part1] explores the key metrics available from GCE, and [part 2][part2] is about collecting those metrics using Google-native tools._ 22 | 23 | To have a clear picture of GCE's operations, you need a system dedicated to storing, visualizing, and correlating your Google Compute Engine metrics with metrics from the rest of your infrastructure. If you’ve read [our post][part2] on collecting GCE metrics, you've seen how you can quickly and easily pull metrics using the Stackdriver Monitoring API and gcloud, and had a chance to see Google's monitoring service, Stackdriver, in action. 24 | 25 | Though these solutions are excellent starting points, they have their limitations, especially when it comes to integration with varied infrastructure components and platforms, as well as data retention for long-term monitoring and trend analysis. 26 | 27 | {{< img src="gce-dashboard1.png" alt="Datadog's out-of-the-box, customizable Google Compute Engine dashboard" popup="true" >}} 28 | 29 | Datadog enables you to collect GCE metrics for visualization, alerting, and full-infrastructure correlation. Datadog will automatically collect the key performance metrics discussed in [parts one][part1] and [two][part2] of this series, and make them available in a customizable dashboard, as seen above. Datadog retains your data for {{< translate key="retention" >}} at full granularity, so you can easily compare real-time metrics against values from last month, last quarter, or last year. And if you [install the Datadog Agent](#install-the-agent), you gain additional system resource metrics (including memory usage, disk I/O, and more) and benefit from integrations with more than 150 technologies and services. 30 | 31 | You can integrate Datadog with GCE in two ways: 32 | 33 | - [Enable the Google Cloud Platform integration](#enable-integration) to collect all of the metrics from the first part of this series 34 | - [Install the Agent](#install-the-agent) to collect all system metrics, including those not available from Google's monitoring APIs 35 | 36 | 37 | ## Enable the Google Cloud Platform integration 38 | Enabling the [Google Cloud Platform integration][gcp-tile] is the quickest way to start monitoring your GCE instances and the rest of your Google Cloud Platform resources. And since Datadog supports OAuth login with your GCP account, you can start seeing your GCE metrics in just a few clicks. 39 | 40 | {{< img src="gcp-oauth-login.png" alt="Integrating GCP with Datadog is as easy as signing into your Google account." popup="true" >}} 41 | 42 | Once signed in, add the [id of the project][project-id] you want to monitor, optionally restrict the set of hosts to monitor, and click _Update Configuration_. 43 | 44 | After a couple of minutes you should see metrics streaming into the customizable [Google Compute Engine][gce-dash-link] dashboard. 45 | 46 | [gce-dash-link]: https://app.datadoghq.com/screen/integration/gce 47 | [gcp-tile]: https://app.datadoghq.com/account/settings#integrations/google_cloud_platform 48 | [project-id]: https://console.cloud.google.com/project 49 | 50 | ## Install the Agent 51 | 52 | The Datadog Agent is [open source software](https://github.com/DataDog/dd-agent) that collects and reports metrics from your hosts so that you can view and monitor them in Datadog. Installing the Agent usually takes just a single command. 53 | 54 | Installation instructions for a variety of platforms are available [here](https://app.datadoghq.com/account/settings#agent). 55 | 56 | As soon as the Agent is up and running, you should see your host reporting metrics in your [Datadog account](https://app.datadoghq.com/infrastructure). 57 | 58 | {{< img src="host1.png" alt="Hosts reporting in." popup="true" >}} 59 | 60 | No additional configuration is necessary, but if you want to collect more than just host metrics, head over to the [integrations page](https://app.datadoghq.com/account/settings) to enable monitoring for over {{< translate key="integration_count" >}} applications and services. 61 | 62 | ## Monitoring GCE with Datadog dashboards 63 | 64 | The template GCE dashboard in Datadog is a great resource, but you can easily create a more comprehensive dashboard to monitor your entire application stack by adding graphs and metrics from your other systems. For example, you might want to graph GCE metrics alongside metrics from [Kubernetes](https://www.datadoghq.com/blog/monitoring-kubernetes-era/) or [Docker](https://www.datadoghq.com/blog/the-docker-monitoring-problem/), [performance metrics](https://www.datadoghq.com/blog/announcing-apm/) from your applications, or host-level metrics such as memory usage on application servers. To start extending the template dashboard, clone the default GCE dashboard by clicking on the gear on the upper right of the dashboard and selecting _Clone Dashboard_. 65 | 66 | {{< img src="clone-dash.png" alt="Customize the out-of-the-box dashboard by making a clone." popup="true" >}} 67 | 68 | ## Drilling down with tags 69 | All Google Compute Engine metrics are [tagged](https://docs.datadoghq.com/guides/tagging/) with the following information: 70 | 71 | - `availability-zone` 72 | - `cloud_provider` 73 | - `instance-type` 74 | - `instance-id` 75 | - `automatic-restart` 76 | - `on-host-maintenace` 77 | - `numeric_project_id` 78 | - `name` 79 | - `project` 80 | - `zone` 81 | - any additional labels and tags you added in GCP 82 | 83 | {{< img src="template-vars.png" alt="Use template variables to slice and dice with tags." popup="true" >}} 84 | 85 | You can easily [slice your metrics](https://www.datadoghq.com/blog/the-power-of-tagged-metrics/) to isolate a particular subset of hosts using tags. In the out-of-the-box GCE screenboard, you can use the template variable selectors in the upper left to drill down to a specific host or set of hosts. And you can similarly use tags in any Datadog graph or alert definition to filter or aggregate your metrics. 86 | 87 | ## Alerts 88 | 89 | Once Datadog is capturing and visualizing your metrics, you will likely want to [set up some alerts](https://docs.datadoghq.com/guides/monitoring/) to be automatically notified of potential issues. With powerful algorithmic alerting features like [outlier](https://www.datadoghq.com/blog/introducing-outlier-detection-in-datadog/) [detection](https://www.datadoghq.com/blog/scaling-outlier-algorithms/) and [anomaly detection](https://www.datadoghq.com/blog/introducing-anomaly-detection-datadog/), you can be automatically alerted to unexpected instance behavior. 90 | 91 | 92 | ## Observability awaits 93 | We’ve now walked through how to use Datadog to collect, visualize, and alert on your Google Compute Engine metrics. If you’ve followed along with your Datadog account, you should now have greater visibility into the state of your instances. 94 | 95 | If you don’t yet have a Datadog account, you can start monitoring Google Compute Engine right away with a free trial. 96 | 97 | 98 | _Source Markdown for this post is available [on GitHub][the-monitor]. Questions, corrections, additions, etc.? Please [let us know][issues]._ 99 | 100 | [the-monitor]: https://github.com/datadog/the-monitor 101 | 102 | [part1]: /blog/monitoring-google-compute-engine-performance 103 | [part2]: /blog/how-to-collect-gce-metrics 104 | [part3]: /blog/monitor-google-compute-engine-with-datadog 105 | 106 | [issues]: https://github.com/DataDog/the-monitor/issues 107 | -------------------------------------------------------------------------------- /translations/ja-jp/mysql/mysql_monitoring_with_datadog.md: -------------------------------------------------------------------------------- 1 | *This is the final post in a 3-part series about MySQL monitoring. [Part 1][part-1] explores the key metrics available from MySQL, and [Part 2][part-2] explains how to collect those metrics.* 2 | 3 | If you’ve already read [our post][part-2] on collecting MySQL metrics, you’ve seen that you have several options for ad hoc performance checks. For a more comprehensive view of your database's health and performance, however, you need a monitoring system that continually collects MySQL statistics and metrics, that lets you identify both recent and long-term performance trends, and that can help you identify and investigate issues when they arise. This post will show you how to set up comprehensive MySQL monitoring by installing the Datadog Agent on your database servers. 4 | 5 | [![MySQL dashboard in Datadog][dash-img]][dash-img] 6 | 7 | ## Integrate Datadog with MySQL 8 | As explained in [Part 1][part-1], MySQL exposes hundreds of valuable metrics and statistics about query execution and database performance. To collect those metrics on an ongoing basis, the Datadog Agent connects to MySQL at regular intervals, queries for the latest values, and reports them to Datadog for graphing and alerting. 9 | 10 | ### Installing the Datadog Agent 11 | 12 | Installing the Agent on your MySQL server is easy: it usually requires just a single command, and the Agent can collect basic metrics even if [the MySQL performance schema][p_s] is not enabled and the sys schema is not installed. Installation instructions for a variety of operating systems and platforms are available [here][agent-install]. 13 | 14 | ### Configure the Agent to collect MySQL metrics 15 | 16 | Once the Agent is installed, you need to grant it access to read metrics from your database. In short, this process has four steps: 17 | 18 | 1. Create a `datadog` user in MySQL and grant it permission to run metric queries on your behalf. 19 | 2. Copy Datadog's `conf.d/mysql.yaml.example` [template][conf] to `conf.d/mysql.yaml` to create a configuration file for Datadog. 20 | 3. Add the login credentials for your newly created `datadog` user to `conf.d/mysql.yaml`. 21 | 4. Restart the agent. 22 | 23 | The [MySQL configuration tile][mysql-config] in the Datadog app has the full instructions, including the exact SQL commands you need to run to create the `datadog` user and apply the appropriate permissions. 24 | 25 | ### Configure collection of additional MySQL metrics 26 | 27 | Out of the box, Datadog collects more than 60 standard metrics from modern versions of MySQL. Definitions and measurement units for most of those standard metrics can be found [here][metric-list]. 28 | 29 | Starting with Datadog Agent version 5.7, many additional metrics are available by enabling specialized checks in the `conf.d/mysql.yaml` file (see [the configuration template][conf] for context): 30 | 31 | ``` 32 | # options: 33 | # replication: false 34 | # galera_cluster: false 35 | # extra_status_metrics: true 36 | # extra_innodb_metrics: true 37 | # extra_performance_metrics: true 38 | # schema_size_metrics: false 39 | # disable_innodb_metrics: false 40 | ``` 41 | 42 | To collect average statistics on query latency, as described in [Part 1][runtime] of this series, you will need to enable the `extra_performance_metrics` option and ensure that [the performance schema is enabled][p_s]. The Agent's `datadog` user in MySQL will also need the [additional permissions][mysql-config] detailed in the MySQL configuration instructions in the Datadog app. 43 | 44 | Note that the `extra_performance_metrics` and `schema_size_metrics` options trigger heavier queries against your database, so you may be subject to performance impacts if you enable those options on servers with a large number of schemas or tables. Therefore you may wish to test out these options on a limited basis before deploying them to production. 45 | 46 | Other options include: 47 | 48 | * `extra_status_metrics` to expand the set of server status variables reported to Datadog 49 | * `extra_innodb_metrics` to collect more than 80 additional metrics specific to the InnoDB storage engine 50 | * `replication` to collect basic metrics (such as replica lag) on MySQL replicas 51 | 52 | To override default behavior for any of these optional checks, simply uncomment the relevant lines of the configuration file (along with the `options:` line) and restart the agent. 53 | 54 | The specific metrics associated with each option are detailed in [the source code][checks] for the MySQL Agent check. 55 | 56 | ## View your comprehensive MySQL dashboard 57 | 58 | [![MySQL dashboard in Datadog][dash-img]][dash-img] 59 | 60 | Once you have integrated Datadog with MySQL, a comprehensive dashboard called “[MySQL - Overview][mysql-dash]” will appear in your list of [integration dashboards][dash-list]. The dashboard gathers key MySQL metrics highlighted in [Part 1][part-1] of this series, along with server resource metrics, such as CPU and I/O wait, which are invaluable for investigating performance issues. 61 | 62 | ### Customize your dashboard 63 | 64 | The Datadog Agent can also collect metrics from the rest of your infrastructure so that you can correlate your entire system's performance with metrics from MySQL. The Agent collects metrics from [Docker][docker], [NGINX][nginx], [Redis][redis], and 100+ other applications and services. You can also easily instrument your own application code to [report custom metrics to Datadog using StatsD][statsd]. 65 | 66 | To add more metrics from MySQL or related systems to your MySQL dashboard, simply clone [the template dash][mysql-dash] by clicking on the gear in the upper right. 67 | 68 | ## Conclusion 69 | 70 | In this post we’ve walked you through integrating MySQL with Datadog so you can access all your database metrics in one place, whether standard metrics from MySQL, more detailed metrics from the InnoDB storage engine, or automatically computed metrics on query latency. 71 | 72 | Monitoring MySQL with Datadog gives you critical visibility into what’s happening with your database and the applications that depend on it. You can easily create automated alerts on any metric, with triggers tailored precisely to your infrastructure and your usage patterns. 73 | 74 | If you don’t yet have a Datadog account, you can sign up for a [free trial][trial] to start monitoring all your servers, applications, and services today. 75 | 76 | - - - 77 | 78 | *Source Markdown for this post is available [on GitHub][markdown]. Questions, corrections, additions, etc.? Please [let us know][issues].* 79 | 80 | [part-1]: https://www.datadoghq.com/blog/monitoring-mysql-performance-metrics/ 81 | [part-2]: https://www.datadoghq.com/blog/collecting-mysql-statistics-and-metrics/ 82 | [mysql-config]: https://app.datadoghq.com/account/settings#integrations/mysql 83 | [mysql-dash]: https://app.datadoghq.com/dash/integration/mysql 84 | [dd-agent]: https://github.com/DataDog/dd-agent 85 | [agent-install]: https://app.datadoghq.com/account/settings#agent 86 | [statsd]: https://www.datadoghq.com/blog/statsd/ 87 | [dash-list]: https://app.datadoghq.com/dash/list 88 | [trial]: https://app.datadoghq.com/signup 89 | [docker]: https://www.datadoghq.com/blog/the-docker-monitoring-problem/ 90 | [nginx]: https://www.datadoghq.com/blog/how-to-monitor-nginx/ 91 | [redis]: https://www.datadoghq.com/blog/how-to-monitor-redis-performance-metrics/ 92 | [elb]: https://www.datadoghq.com/blog/top-elb-health-and-performance-metrics/ 93 | [p_s]: https://www.datadoghq.com/blog/collecting-mysql-statistics-and-metrics/#querying-the-performance-schema-and-sys-schema 94 | [checks]: https://github.com/DataDog/dd-agent/blob/master/checks.d/mysql.py 95 | [conf]: https://github.com/DataDog/dd-agent/blob/master/conf.d/mysql.yaml.example 96 | [metric-list]: http://docs.datadoghq.com/integrations/mysql/#metrics 97 | [runtime]: https://www.datadoghq.com/blog/monitoring-mysql-performance-metrics/#query-performance 98 | [dash-img]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-04-mysql/mysql-dash-dd.png 99 | [markdown]: https://github.com/DataDog/the-monitor/blob/master/mysql/mysql_monitoring_with_datadog.md 100 | [issues]: https://github.com/DataDog/the-monitor/issues 101 | 102 | -------------------------------------------------------------------------------- /elasticache/how-coursera-monitors-elasticache-and-memcached-performance.md: -------------------------------------------------------------------------------- 1 | *This post is part 3 of a 3-part series on monitoring Amazon ElastiCache.* [*Part 1*](https://www.datadoghq.com/blog/monitoring-elasticache-performance-metrics-with-redis-or-memcached) *explores the key ElastiCache performance metrics, and* [*Part 2*](https://www.datadoghq.com/blog/collecting-elasticache-metrics-its-redis-memcached-metrics) *explains how to collect those metrics.* 2 | 3 | [Coursera](https://www.coursera.org/) launched its online course platform in 2013, and quickly became a leader in online education. With more than 1,000 courses and millions of students, Coursera uses [ElastiCache](https://aws.amazon.com/elasticache/) to cache course metadata, as well as membership data for courses and users, helping to ensure a smooth user experience for their growing audience. In this article we take you behind the scenes with Coursera’s engineering team to learn their best practices and tips for using ElastiCache, keeping it performant, and monitoring it with Datadog. 4 | 5 | ## Why monitoring ElastiCache is crucial 6 | 7 | ElastiCache is a critical piece of Coursera’s cloud infrastructure. Coursera uses ElastiCache as a read-through cache on top of [Cassandra](https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/). They decided to use [Memcached](https://www.datadoghq.com/blog/speed-up-web-applications-memcached/) as the backing cache engine because they only needed a simple key-value cache, because they found it easier than [Redis](https://www.datadoghq.com/blog/how-to-monitor-redis-performance-metrics/) to manage with its simpler model, and because it is multi-threaded. 8 | 9 | Among other uses, they cache most of the elements on a course page, such as title, introduction video, course description, and other information about the course. 10 | 11 | If ElastiCache is not properly monitored, the cache could run out of memory, leading to evicted items. This in turn could impact the hit rate, which would increase the latency of the application. That’s why Coursera’s engineers continuously monitor ElastiCache. They use [Datadog](https://www.datadoghq.com/) so they can correlate all the relevant ElastiCache performance metrics with metrics from other parts of their infrastructure, all in one place. They can spot at a glance if their cache is the root cause of any application performance issue, and set up advanced alerts on crucial metrics. 12 | 13 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/screenboard.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/screenboard.png) 14 | 15 | ## Key metrics for Coursera 16 | 17 | ### CPU Utilization 18 | 19 | Since, unlike Redis, Memcached’s CPU can go up to 90 percent without impacting the performance, it is not a metric that Coursera alerts on. Nonetheless, they track it to facilitate investigation of problems. 20 | 21 | ### Memory 22 | 23 | Memory metrics, on the other hand, are critical and are closely monitored. By making sure the memory allocated to the cache is always higher than the **memory usage**, Coursera’s engineering team avoids **evictions**. Indeed they want to keep a very high **hit rate** in order to ensure optimal performance, but also to protect their databases. Coursera’s traffic is so high that their backend wouldn’t be able to address the massive amount of requests it would get if the cache hit rate were to decrease significantly. 24 | 25 | They tolerate some swap usage for one of their cache clusters but it remains far below the 50-megabyte limit AWS recommends when using Memcached (see [part 1](https://www.datadoghq.com/blog/monitoring-elasticache-performance-metrics-with-redis-or-memcached)). 26 | 27 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/memory.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/memory.png) 28 | 29 | ### Get and Set 30 | 31 | Coursera uses a consistent hashing mechanism, which means that keys are distributed evenly across the nodes of a cluster. Thus monitoring Get and Set commands, broken down by node, allows them to check if nodes are all healthy, and if the traffic is well balanced among nodes. If a node has significantly more gets than its peers, this may indicate that it is hosting one or more hot keys (items requested with extreme frequency as compared to the others). A very hot key can max out the capacity of a node, and adding more nodes may not help, since the hot key will still be hosted on a single node. Nodes with higher throughput performance may be required. 32 | 33 | ### Network 34 | 35 | Coursera also tracks network throughput because ElastiCache is so fast that it can easily saturate the network. A bottleneck would prevent more bytes from being sent despite available CPU and memory. That’s why Coursera needs to visualize these network metrics broken down by host and by cluster separately to be able to quickly investigate and act before saturation occurs. 36 | 37 | ### Events 38 | 39 | Lastly, seeing ElastiCache events along with cache performance metrics allows them to keep track of cache activities—such as cluster created, node added, or node restarted—and their impact on performance. 40 | 41 | ![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/events.png) 42 | 43 | ## Alerting via the right channel 44 | 45 | ### Critical alerts 46 | 47 | Some metrics are critical, and Coursera’s engineers want to make sure they never exceed a certain threshold. Datadog alerts allow them to send notifications via their usual communication channels (PagerDuty, chat apps, emails…) so they can target specific teams or people, and quickly act before a metric goes out of bounds. 48 | 49 | Coursera’s engineers have set up alerts on eviction rate, available memory, hit rate, and swap usage. 50 | 51 | Datadog alerts can also be configured to trigger on host health, whether services or processes are up or down, events, [outliers](https://www.datadoghq.com/blog/introducing-outlier-detection-in-datadog/), and more. 52 | 53 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/monitor-type.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/monitor-type.png) 54 | 55 | For example, as explained in [part 1](https://www.datadoghq.com/blog/monitoring-elasticache-performance-metrics-with-redis-or-memcached) of this series, where we detail the key ElastiCache metrics and which ones to alert on, CPU usage shouldn’t exceed 90 percent with Memcached. Here is how an alert can be triggered any time any individual node sees its CPU utilization approaching this threshold: 56 | 57 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/define-metric.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/define-metric.png) 58 | 59 | ### The right communication channel 60 | 61 | Coursera uses [PagerDuty](https://www.datadoghq.com/blog/pagerduty/) for critical issues, and [Slack](https://www.datadoghq.com/blog/collaborate-share-track-performance-slack-datadog/) or email for low-priority problems. When configuring an alert, you can define a custom message (including suggested fixes or links to internal documentation), the people or team that will be notified, and the specific channel by which the alert will be sent. For example, you can send the notification to the people on-call via PagerDuty and to a specific Slack channel: 62 | 63 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/alert-msg.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/alert-msg.png) 64 | 65 | ## Why Datadog? 66 | 67 | Using Datadog allows Coursera to track all the metrics they need from the different parts of their infrastructure, in one place, with any relevant type of visualization. Thus they can spot at a glance any potential issue related to their cache and quickly find the root cause. 68 | 69 | By creating [timeboards](http://help.datadoghq.com/hc/en-us/articles/204580349-What-is-the-difference-between-a-ScreenBoard-and-a-TimeBoard-) they can overlay events from a specific service like ElastiCache and correlate them with performance metrics from other parts of their infrastructure. 70 | 71 | Datadog also makes it easy to collect and monitor native cache metrics from Redis or Memcached, in addition to generic ElastiCache metrics from Amazon, for even deeper insight into cache performance. 72 | 73 | If you’re using ElastiCache and Datadog already, we hope that these tips help you gain improved visibility into what’s happening in your cache. If you don’t yet have a Datadog account, you can start tracking your cache’s health and performance today with a [free trial](https://app.datadoghq.com/signup). 74 | 75 | ## Acknowledgments 76 | 77 | We want to thank the Coursera team, and especially [Daniel Chia](https://twitter.com/DanielChiaJH), who worked with us to share their monitoring techniques for Amazon ElastiCache. 78 | -------------------------------------------------------------------------------- /elb/monitor_elb_performance_with_datadog.md: -------------------------------------------------------------------------------- 1 | *This post is the last of a 3-part series on monitoring Amazon ELB. [Part 1](https://www.datadoghq.com/blog/top-elb-health-and-performance-metrics) explores its key performance metrics, and [Part 2](https://www.datadoghq.com/blog/how-to-collect-aws-elb-metrics) explains how to collect these metrics.* 2 | 3 | *__Note:__ The metrics referenced in this article pertain to [classic](https://aws.amazon.com/elasticloadbalancing/classicloadbalancer/) ELB load balancers. We will cover Application Load Balancer metrics in a future article.* 4 | 5 | If you’ve already read [our post](https://www.datadoghq.com/blog/how-to-collect-aws-elb-metrics) on collecting Elastic Load Balancing metrics, you’ve seen that you can visualize their recent evolution and set up simple alerts using the AWS Management Console’s web interface. For a more dynamic and comprehensive view, you can connect ELB to Datadog. 6 | 7 | Datadog lets you collect and view ELB metrics, access their historical evolution, and slice and dice them using any combination of properties or custom tags. Crucially, you can also correlate ELB metrics with metrics from any other part of your infrastructure for better insight—especially native metrics from your backend instances. And with more than 100 supported integrations, you can create and send advanced alerts to your team using collaboration tools such as [PagerDuty](https://www.datadoghq.com/blog/pagerduty/) and [Slack](https://www.datadoghq.com/blog/collaborate-share-track-performance-slack-datadog/). 8 | 9 | In this post we’ll show you how to get started with the ELB integration, and how to correlate your load balancer performance metrics with your backend instance metrics. 10 | 11 | [![ELB metrics graphs](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/3-01.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/3-01.png) 12 | 13 | *ELB metrics graphs on Datadog* 14 | 15 | ## Integrate Datadog and ELB 16 | 17 | To start monitoring ELB metrics, you only need to configure [configure our integration with AWS CloudWatch](http://docs.datadoghq.com/integrations/aws/). Create a new user via the [IAM Console](https://console.aws.amazon.com/iam/home#s=Home) and grant that user (or group of users) the required set of permissions. These can be set via the [Policy management](https://console.aws.amazon.com/iam/home?#policies) in the console or using the Amazon API. 18 | 19 | Once these credentials are configured within AWS, follow the simple steps on the [AWS integration tile](https://app.datadoghq.com/account/settings#integrations/amazon_web_services) on Datadog to start pulling ELB data. 20 | 21 | Note that if, in addition to ELB, you are using RDS, SES, SNS, or other AWS products, you may need to grant additional permissions to the user. [See here](http://docs.datadoghq.com/integrations/aws/) for the complete list of permissions required to take full advantage of the Datadog–AWS integration. 22 | 23 | ## Keep an eye on all key ELB metrics 24 | 25 | Once you have successfully integrated Datadog with ELB, you will see [a default dashboard](https://app.datadoghq.com/screen/integration/aws_elb) called “AWS-Elastic Load Balancers” in your list of [integration dashboards](https://app.datadoghq.com/dash/list). The ELB dashboard displays all of the key metrics highlighted in [Part 1](https://www.datadoghq.com/blog/top-elb-health-and-performance-metrics) of this series: requests per second, latency, surge queue length, spillover count, healthy and unhealthy hosts counts, HTTP code returned, and more. 26 | 27 | [![ELB default dashboard on Datadog](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/3-02.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/3-02.png) 28 | 29 | *ELB default dashboard on Datadog* 30 | 31 | ## Customize your dashboards 32 | 33 | Once you are capturing metrics from Elastic Load Balancing in Datadog, you can build on the default dashboard and edit or add additional graphs of metrics from ELB or even from other parts of your infrastructure. To start building a custom [screenboard](https://www.datadoghq.com/blog/introducing-screenboards-your-data-your-way/), clone the default ELB dashboard by clicking on the gear on the upper right of the default dashboard. 34 | 35 | You can also create [timeboards](http://help.datadoghq.com/hc/en-us/articles/204580349-What-is-the-difference-between-a-ScreenBoard-and-a-TimeBoard-), which are interactive Datadog dashboards displaying the evolution of multiple metrics across any timeframe. 36 | 37 | ## Correlate ELB with EC2 metrics 38 | 39 | As explained in [Part 1](https://www.datadoghq.com/blog/top-elb-health-and-performance-metrics), CloudWatch’s ELB-related metrics inform you about your load balancers’ health and performance. ELB also provides backend-related metrics reflecting your backend instances health and performance. However, to fully monitor your backend instances, you should consider collecting these backend metrics directly from EC2 as well for better insight. By correlating ELB with EC2 metrics, you will be able to quickly investigate whether, for example, the high number of requests being queued by your load balancers is due to resource saturation on your backend instances (memory usage, CPU utilization, etc.). 40 | 41 | Thanks to our integration with CloudWatch and the permissions you set up, you can already access EC2 metrics on Datadog. Here is [your default dashboard](https://app.datadoghq.com/screen/integration/aws_ec2) for EC2. 42 | 43 | [![Default EC2 dashboard on Datadog](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/3-03.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/3-03.png) 44 | 45 | *Default EC2 dashboard on Datadog* 46 | 47 | You can add graphs to your custom dashboards and view side by side ELB and EC2 metrics. Correlating peaks in two different metrics to see if they are linked is very easy. 48 | 49 | You can also, for example, display a host map to spot at a glance if all your backend instances have a reasonable CPU utilization: 50 | 51 | [![Default EC2 dashboard on Datadog](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/3-04.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/3-04.png) 52 | 53 | ### Native metrics for more precision 54 | 55 | In addition to pulling in EC2 metrics via CloudWatch, Datadog also allows you to monitor your EC2 instances’ performance with higher resolution by installing the Datadog Agent to pull native metrics directly from the servers. The Agent is [open-source software](https://github.com/DataDog/dd-agent) that collects and reports metrics from your individual hosts so you can view, monitor and correlate them on the Datadog platform. Installing the Agent usually requires just a single command. Installation instructions for different operating systems are available [here](https://app.datadoghq.com/account/settings#agent). 56 | 57 | By using the [Datadog Agent](https://www.datadoghq.com/blog/dont-fear-the-agent/), you can collect backend instance metrics with a higher granularity for a better view of their health and performance. The Agent reports metrics directly, at rapid intervals, and does not rely on polling an intermediary (such as CloudWatch), so you can access metrics more frequently without being limited by the provider’s monitoring API. 58 | 59 | The Agent provides higher-resolution views of all key system metrics, such as CPU utilization or memory consumption by process. 60 | 61 | Once you have set up the Agent, correlating native metrics from your EC2 instances with ELB’s CloudWatch metrics is a piece of cake (as explained above), and will give you a full and precise picture of your infrastructure’s performance. 62 | 63 | The Agent can also collect application metrics so that you can correlate your application’s performance with the host-level metrics from your compute layer. The Agent integrates seamlessly with applications such as MySQL, [NGINX](https://www.datadoghq.com/blog/how-to-monitor-nginx/), Cassandra, and many more. It can also collect custom application metrics as well. 64 | 65 | To install the Datadog Agent, follow the [instructions here](http://docs.datadoghq.com/guides/basic_agent_usage/) depending on the OS your EC2 machines are running. 66 | 67 | ## Conclusion 68 | 69 | In this post we’ve walked you through integrating Elastic Load Balancing with Datadog so you can visualize and alert on its key metrics. You can also visualize EC2 metrics to keep tab on your backend instances, to improve performance, and to save costs. 70 | 71 | Monitoring ELB with Datadog gives you critical visibility into what’s happening with your load balancers and applications. You can easily create automated [alerts](https://www.datadoghq.com/blog/monitoring-101-alerting/) on any metric across any group of instances, with triggers tailored precisely to your infrastructure and usage patterns. 72 | 73 | If you don’t yet have a Datadog account, you can sign up for a [free trial](https://app.datadoghq.com/signup) and start monitoring your cloud infrastructure, applications, and services. 74 | -------------------------------------------------------------------------------- /translations/ja-jp/varnish/how_to_collect_varnish_metrics.md: -------------------------------------------------------------------------------- 1 | # How to collect Varnish metrics 2 | 3 | *This post is part 2 of a 3-part series on Varnish monitoring. [Part 1](https://www.datadoghq.com/blog/how-to-monitor-varnish/) explores the key metrics available in Varnish, and [Part 3](https://www.datadoghq.com/blog/monitor-varnish-using-datadog/) details how Datadog can help you to monitor Varnish.* 4 | 5 | *このポストは、"Varnishの監視"3回シリーズのPart 2です。 Part 1は、[「Varnishの監視方法」」](https://www.datadoghq.com/blog/how-to-monitor-varnish/)で、Part 3は、[「Datadogを使ったVarnishの監視」](https://www.datadoghq.com/blog/monitor-varnish-using-datadog/)になります。* 6 | 7 | ## How to get the Varnish metrics you need 8 | 9 | > Varnish Cache ships with very useful and precise monitoring and logging tools. As explained in [the first post of this series](https://www.datadoghq.com/blog/how-to-monitor-varnish/), for monitoring purposes, the most useful of the available tools is `varnishstat` which gives you a detailed snapshot of Varnish’s current performance. It provides access to in-memory statistics such as cache hits and misses, resource consumption, threads created, and more. 10 | 11 | Varnishキャッシュには、非常に便利で正確なモニタリングとロギングツールが同胞されています。このシリーズの最初のポスト[「Varnishの監視方法」](https://www.datadoghq.com/blog/how-to-monitor-varnish/)で紹介したように、監視のために最も便利なツールは`varnishstat`です。`varnishstat`は、Varnishの現時点でのパフォーマンスの詳細な情報を提供してくれます。更に、キャッシュのヒットやミス、リソース消費量、作成されたスレッドなど、メモリー内にある統計情報にもアクセスも提供しています。 12 | 13 | ### varnishstat 14 | 15 | > If you run `varnishstat` from the command line you will see a continuously updating list of all available Varnish metrics. If you add the `-1` flag, varnishstat will exit after printing the list one time. Example output below: 16 | 17 | コマンドラインから`varnishstat`を実行すると、継続的に更新されていくVarnishの全てのメトリクを見ることができます。このコマンドに、表示回数オプションの`-1`を付けて実行すれば、`varnishstat`は、リストを1度表示し、終了します。 18 | 19 | 以下は出力結果の例です: 20 | 21 | ``` 22 | $ varnishstat 23 | 24 | MAIN.uptime Child process uptime 25 | MAIN.sess_conn Sessions accepted 26 | MAIN.sess_drop Sessions dropped 27 | MAIN.sess_fail Session accept failures 28 | MAIN.sess_pipe_overflow Session pipe overflow 29 | MAIN.client_req Good client requests received 30 | MAIN.cache_hit Cache hits 31 | MAIN.cache_hitpass Cache hits for pass 32 | MAIN.cache_miss Cache misses 33 | MAIN.backend_conn Backend conn. success 34 | MAIN.backend_unhealthy Backend conn. not attempted 35 | MAIN.backend_busy Backend conn. too many 36 | MAIN.backend_fail Backend conn. failures 37 | MAIN.backend_reuse Backend conn. reuses 38 | MAIN.backend_toolate Backend conn. was closed 39 | MAIN.backend_recycle Backend conn. recycles 40 | MAIN.backend_retry Backend conn. retry 41 | MAIN.pools Number of thread pools 42 | MAIN.threads Total number of threads 43 | MAIN.threads_limited Threads hit max 44 | MAIN.threads_created Threads created 45 | MAIN.threads_destroyed Threads destroyed 46 | MAIN.threads_failed Thread creation failed 47 | MAIN.thread_queue_len Length of session queue 48 | ``` 49 | 50 | > To list specific values, pass them with the `-f` flag: e.g. `varnishstat -f field1,field2,field3`, followed by `-1` if needed. 51 | 52 | 特定の値を表示したい場合は、`-f`フラグを追記します: 例えば、`varnishstat -f field1,field2,field3`を追記し値を指定し、`-1`を追加し表示回数を指定します。 53 | 54 | > For example: `varnishstat -f MAIN.threads` will display a continuously updating count of threads currently being used: 55 | 56 | 例: `varnishstat -f MAIN.threads`は、現在使用しているスレッドの値を継続的に更新しながら表示します。 57 | 58 | [![varnishstat output](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-07-varnish/2-01.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-07-varnish/2-01.png) 59 | 60 | > Varnishstat is useful as a standalone tool if you need to spot-check the health of your cache. However, if Varnish is an important part of your software service, you will almost certainly want to graph its performance over time, correlate it with other metrics from across your infrastructure, and be alerted about any problems that may arise. To do this you will probably want to integrate the metrics that Varnishstat is reporting with a dedicated monitoring service. 61 | 62 | Varnishstatは、キャッシュの状態をスポット的にチェックするのはは非常に優れた単体ツールです。しかし、Varnishがソフトウェアサービスの重要な構成要素である場合は、パフォーマンスヒストリーを表示したグラフ、インフラ上の他のメトリクスと連携した分析、障害が発生した時のアラートのような機能が必要と感じるようになるでしょう。このようなニーズを満たすためにあなたは、Varnishstatが収集しているメトリクスを専用の監視サービスへ送信することになるでしょう。 63 | 64 | ### varnishlog 65 | 66 | > If you need to debug your system or tune configuration, `varnishlog` can be a useful tool, as it provides detailed information about each individual request. 67 | 68 | Varnishのシステムのデバッグや設定のチューニングをする必要がある場合は、`varnishlog`は、便利なツールになります。`varnishlog`は、個々のリクエストに関する詳細な情報を提供してくれます。 69 | 70 | > Here is an edited example of `varnishlog` output generated by a single request—a full example would be several times longer: 71 | 72 | 下記に示す内容は、`varnishlog`の出力の一部です。正式な出力は、下記に示したものの数倍の長さになります: 73 | 74 | ``` 75 | $ varnishlog 76 | 77 | 3727 RxRequest c GET 78 | 3727 RxProtocol c HTTP/1.1 79 | 3727 RxHeader c Content-Type: application/x-www-form-urlencoded; 80 | 3727 RxHeader c Accept-Encoding: gzip,deflate,sdch 81 | 3727 RxHeader c Accept-Language: en-US,en;q=0.8 82 | 3727 VCL_return c hit 83 | 3727 ObjProtocol c HTTP/1.1 84 | 3727 TxProtocol c HTTP/1.1 85 | 3727 TxStatus c 200 86 | 3727 Length c 316 87 | … 88 | ``` 89 | 90 | > The 4 columns represent: 91 | > 1. Request ID 92 | > 2. Data type (if the type starts with “Rx” that means Varnish is receiving data, and “Tx” means Varnish is sending data) 93 | > 3. Whether this entry records communication between Varnish and the client: “c”, or backend: “b” (see [Part 1](https://www.datadoghq.com/blog/how-to-monitor-varnish/)) 94 | > 4. The data, or details about this entry 95 | 96 | 出力内の4つ列は、内容は以下のようになります: 97 | 98 | 1. リクエストID 99 | 2. データタイプ (“Rx”の文字列で始まる項目は、Varnishがデーターを受信。“Tx”の文字列で始まる項目は、Varnishがデーターを送信。) 100 | 3. このメトリクスが、Varnishとクライアント間通信のものは、"c"。Varnishとバックエンド間通信のものは、"b"。(詳細は、[「Varnishの監視方法」](https://www.datadoghq.com/blog/how-to-monitor-varnish/ )を参照してください) 101 | 4. データー、又はその詳細 102 | 103 | ### varnishlog’s children 104 | 105 | > You can display a subset of `varnishlog`’s information via three specialized tools built on top of varnishlog: 106 | 107 | > - `varnishtop` presents a continuously updating list of the most commonly occurring log entries. Using filters, you can display the topmost requested documents, most common clients, user agents, or any other information which is recorded in the log. 108 | > - `varnishhist` presents a continuously updating histogram showing the distribution of the last N requests bucketed by total time between request and response. The value of N and the vertical scale are displayed in the top left corner. 109 | > - `varnishsizes` is very similar to varnishhist, except it shows the size of the objects requested rather than the processing time. 110 | 111 | `varnishlog`が提供している情報のサブセットを、varnishlogの上に構築された、次のような三つの特殊ツールで表示することができます。 112 | 113 | - `varnishtop`は、最も一般的なログエントリのリストを継続的に表示します。フィルターを使用し、最もリクエストされたドキュメンント、最も一般的なクライアント、ユーザエージェント、またはログに記録されている他の情報を表示することができます。 114 | - `varnishhist`は、直近N個のリクエストの受信から応答までの時間のヒストグラムを継続的に表示します。Nの値と縦軸の単位は、左上隅に表示されます。 115 | - `varnishsizes`は、varnishhistに非常に似ています。ただし、`varnishsizes`は、処理時間ではなく、リクエストされたオブジェクトの大きさを表示します。 116 | 117 | ## Conclusion 118 | 119 | Which metrics you monitor will depend on your use case, the tools available to you, and whether the insight provided by a given metric justifies the overhead of monitoring it. 120 | 121 | 122 | 何を監視するかは、持っているツールと監視可能なメトリクスの種類に依存します。そして、それぞれのメトリクスが提供する見識の価値こそがそのメトリクスを監視するための労力を正当化してくれるでしょう。 123 | 124 | At Datadog, we have built an integration with Varnish so that you can begin collecting and monitoring its metrics with a minimum of setup. Learn how Datadog can help you to monitor Varnish in the [next and final part of this series of articles](https://www.datadoghq.com/blog/monitor-varnish-using-datadog/). 125 | 126 | Datadogでは、Varnishに向けてIntegrationを提供しています。このIntegrationを採用することで、最小限の設定でメトリクスを収集し監視できるようになります。このシリーズに含まれる[「Datadogを使ったVarnishの監視」](https://www.datadoghq.com/blog/monitor-varnish-using-datadog/)では、Datadogを使ったVarnishの監視方法を解説しています。 127 | 128 | ------------------------------------------------------------------------ 129 | 130 | > *Source Markdown for this post is available [on GitHub](https://github.com/DataDog/the-monitor/blob/master/varnish/how_to_collect_varnish_metrics.md). Questions, corrections, additions, etc.? Please [let us know](https://github.com/DataDog/the-monitor/issues).* 131 | 132 | *このポストのMarkdownソースは、[GitHub](https://github.com/DataDog/the-monitor/blob/master/varnish/how_to_collect_varnish_metrics.md)で閲覧することができます。質問、訂正、追加、などがありましたら、[GitHubのissueページ](https://github.com/DataDog/the-monitor/issues)を使って連絡を頂けると幸いです。* 133 | -------------------------------------------------------------------------------- /elasticsearch/how_to_monitor_elasticsearch_with_datadog.md: -------------------------------------------------------------------------------- 1 | # How to monitor Elasticsearch with Datadog 2 | 3 | *This post is part 3 of a 4-part series on monitoring Elasticsearch performance. [Part 1][part-1-link] provides an overview of Elasticsearch and its key performance metrics, [Part 2][part-2-link] explains how to collect these metrics, and [Part 4][part-4-link] describes how to solve five common Elasticsearch problems.* 4 | 5 | If you've read [our post][part-2-link] on collecting Elasticsearch metrics, you already know that the Elasticsearch APIs are a quick way to gain a snapshot of performance metrics at any particular moment in time. However, to truly get a grasp on performance, you need to track Elasticsearch metrics over time and monitor them in context with the rest of your infrastructure. 6 | 7 | [![elasticsearch datadog dashboard](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-09-elasticsearch/elasticsearch-dashboard-final2.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-09-elasticsearch/elasticsearch-dashboard-final2.png) 8 | *Datadog's out-of-the-box Elasticsearch dashboard* 9 | 10 | This post will show you how to set up Datadog to automatically collect the key metrics discussed in [Part 1][part-1-link] of this series. We'll also show you how to set alerts and use tags to effectively monitor your clusters by focusing on the metrics that matter most to you. 11 | 12 | ## Set up Datadog to fetch Elasticsearch metrics 13 | Datadog's integration enables you to automatically collect, tag, and graph all of the performance metrics covered in [Part 1][part-1-link], and correlate that data with the rest of your infrastructure. 14 | 15 | ### Install the Datadog Agent 16 | The [Datadog Agent][agent-docs] is open source software that collects and reports metrics from each of your nodes, so you can view and monitor them in one place. Installing the Agent usually only takes a single command. View installation instructions for various platforms [here][Agent-installation]. You can also install the Agent automatically with configuration management tools like [Chef][datadog-chef-blog] or [Puppet][datadog-puppet-blog]. 17 | 18 | ### Configure the Agent 19 | After you have installed the Agent, it's time to create your integration configuration file. In your [Agent configuration directory][agent-docs], you should see a [sample Elasticsearch config file][elastic-config-file] named `elastic.yaml.example`. Make a copy of the file in the same directory and save it as `elastic.yaml`. 20 | 21 | Modify `elastic.yaml` with your instance URL, and set `pshard_stats` to true if you wish to collect metrics specific to your primary shards, which are prefixed with `elasticsearch.primaries`. For example, `elasticsearch.primaries.docs.count` tells you the document count across all primary shards, whereas `elasticsearch.docs.count` is the total document count across all primary **and** replica shards. In the example configuration file below, we've indicated that we want to collect primary shard metrics. We have also added a custom tag, `elasticsearch-role:data-node`, to indicate that this is a data node. 22 | 23 | ``` 24 | # elastic.yaml 25 | 26 | init_config: 27 | 28 | instances: 29 | 30 | - url: http://localhost:9200 31 | # username: username 32 | # password: password 33 | # cluster_stats: false 34 | pshard_stats: true 35 | # pending_task_stats: true 36 | # ssl_verify: false 37 | # ssl_cert: /path/to/cert.pem 38 | # ssl_key: /path/to/cert.key 39 | tags: 40 | - 'elasticsearch-role:data-node' 41 | ``` 42 | 43 | Save your changes and verify that the integration is properly configured by restarting the Agent and [running the Datadog `info` command][agent-docs]. If everything is working properly, you should see an `elastic` section in the output, similar to the below: 44 | 45 | ``` 46 | Checks 47 | ====== 48 | [...] 49 | elastic 50 | ------- 51 | - instance #0 [OK] 52 | - Collected 142 metrics, 0 events & 3 service checks 53 | ``` 54 | 55 | The last step is to navigate to [Elasticsearch's integration tile][es-tile] in the Datadog App and click on the **Install Integration** button under the "Configuration" tab. Once the Agent is up and running, you should see your hosts reporting metrics in [Datadog][datadog-infrastructure], as shown below: 56 | 57 | ![elastic-datadog-infra.png](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-09-elasticsearch/pt3-1-elastic-datadog-infra.png) 58 | 59 | ## Dig into the Elasticsearch metrics! 60 | Once the Agent is configured on your nodes, you should see an Elasticsearch overview screenboard among your [list of available dashboards][dashboard-link]. 61 | 62 | Datadog's [out-of-the-box dashboard][datadog-es-dash] displays many of the key performance metrics presented in [Part 1][part-1-link] and is a great starting point to gain more visibility into your clusters. You may want to clone and customize it by adding system-level metrics from your nodes, like I/O utilization, CPU, and memory usage, as well as metrics from other elements of your infrastructure. 63 | 64 | ### Tag your metrics 65 | In addition to any [tags][tags-docs] assigned or inherited from your nodes' other integrations (e.g. Chef `role`, AWS `availability-zone`, etc.), the Agent will automatically tag your Elasticsearch metrics with `host` and `url`. Starting in Agent 5.9.0, Datadog also tags your Elasticsearch metrics with `cluster_name` and `node_name`, which are pulled from `cluster.name` and `node.name` in the node's Elasticsearch configuration file (located in `elasticsearch/config`). (Note: If you do not provide a `cluster.name`, it will default to `elasticsearch`.) 66 | 67 | You can also add your own custom tags in the `elastic.yaml` file, such as the node type and environment, in order to [slice and dice your metrics][tagging-blog] and alert on them accordingly. 68 | 69 | For example, if your cluster includes dedicated master, data, and client nodes, you may want to create an `elasticsearch-role` tag for each type of node in the `elastic.yaml` configuration file. You can then use these tags in Datadog to view and alert on metrics from only one type of node at a time. 70 | 71 | ### Tag, you're (alerting) it 72 | Now that you've finished tagging your nodes, you can set up smarter, targeted alerts to watch over your metrics and notify the appropriate people when issues arise. In the screenshot below, we set up an alert to [notify team members][datadog-alerts] when any data node (tagged with `elasticsearch-role:data-node` in this case) starts running out of disk space. The `elasticsearch-role` tag is quite useful for this alert—we can exclude dedicated master-eligible nodes, which don't store any data. 73 | 74 | ![es-disk-space-monitor.png](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-09-elasticsearch/pt3-2-es-disk-space-monitor.png) 75 | 76 | Other useful alert triggers include long garbage collection times and search latency thresholds. You might also want to set up an Elasticsearch integration check in Datadog to find out if any of your master-eligible nodes have failed to connect to the Agent in the past five minutes, as shown below: 77 | 78 | [![es-status-check-monitor.png](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-09-elasticsearch/pt3-3-es-status-check-monitor.png) 79 | 80 | ## Conclusion 81 | In this post, we've walked through how to use Datadog to collect, visualize, and alert on your Elasticsearch metrics. If you've followed along with your Datadog account, you should now have greater visibility into the state of your clusters and be better prepared to address potential issues. The [next part in this series][part-4-link] describes how to solve five common Elasticsearch scaling and performance issues. 82 | 83 | If you don't yet have a Datadog account, you can start monitoring Elasticsearch right away with a free trial. 84 | 85 | [datadog-agent]: https://github.com/DataDog/dd-agent 86 | [agent-docs]: http://docs.datadoghq.com/guides/basic_agent_usage/ 87 | [Agent-installation]: https://app.datadoghq.com/account/settings#agent 88 | [datadog-chef-blog]: https://www.datadoghq.com/blog/monitor-chef-with-datadog/ 89 | [datadog-puppet-blog]: https://www.datadoghq.com/blog/monitor-puppet-datadog/ 90 | [elastic-config-file]: https://github.com/DataDog/dd-agent/blob/master/conf.d/elastic.yaml.example 91 | [datadog-es-dash]: https://app.datadoghq.com/dash/integration/elasticsearch 92 | [datadog-infrastructure]: https://app.datadoghq.com/infrastructure 93 | [dashboard-link]: https://app.datadoghq.com/dash/list 94 | [system-docs]: http://docs.datadoghq.com/integrations/system/ 95 | [datadog-alerts]: https://www.datadoghq.com/blog/monitoring-101-alerting/ 96 | [tagging-blog]: https://www.datadoghq.com/blog/the-power-of-tagged-metrics/ 97 | [es-tile]: https://app.datadoghq.com/account/settings#integrations/elasticsearch 98 | [tags-docs]: http://docs.datadoghq.com/guides/tagging/ 99 | [part-1-link]: https://www.datadoghq.com/blog/monitor-elasticsearch-performance-metrics 100 | [part-2-link]: https://www.datadoghq.com/blog/collect-elasticsearch-metrics/ 101 | [part-4-link]: https://www.datadoghq.com/blog/elasticsearch-performance-scaling-problems/ 102 | -------------------------------------------------------------------------------- /elasticache/collecting-elasticache-metrics-its-redis-memcached-metrics.md: -------------------------------------------------------------------------------- 1 | *This post is part 2 of a 3-part series on monitoring Amazon ElastiCache. [Part 1](https://www.datadoghq.com/blog/monitoring-elasticache-performance-metrics-with-redis-or-memcached) explores its key performance metrics, and [Part 3](https://www.datadoghq.com/blog/how-coursera-monitors-elasticache-and-memcached-performance) describes how Coursera monitors ElastiCache.* 2 | 3 | Many ElastiCache metrics can be collected from AWS via CloudWatch or directly from the cache engine, whether Redis or Memcached. When that’s the case, as discussed in [Part 1](https://www.datadoghq.com/using-elb-cloudwatch-metrics-to-detect-latency/), you should favor monitoring the native cache metric to ensure higher resolution and greater awareness and responsiveness. Therefore this article covers three different ways to access ElastiCache metrics from AWS CloudWatch, as well as the collection of native metrics from both caching engines: 4 | 5 | - CloudWatch metrics 6 | - [Using the AWS Management Console](#console) 7 | - [Using the command line interface (CLI)](#cli) 8 | - [Using a monitoring tool that accesses the CloudWatch API](#tool) 9 | - Caching engine metrics 10 | - [Redis](#redis) 11 | - [Memcached](#memcached) 12 | 13 | ## Using the AWS Management Console 14 | 15 | Using the online management console is the simplest way to monitor your cache with CloudWatch. It allows you to set up basic automated alerts and to get a visual picture of recent changes in individual metrics. Of course, you won’t be able to access native metrics from your cache engine, but their CloudWatch equivalent is sometimes available (see [Part 1](https://www.datadoghq.com/blog/monitoring-elasticache-performance-metrics-with-redis-or-memcached)[)](http://www.datadoghq.com/blog/monitoring-elasticache-performance-metrics-with-redis-or-memcached). 16 | 17 | ### Graphs 18 | 19 | Once you are signed in to your AWS account, you can open the [CloudWatch console](https://console.aws.amazon.com/cloudwatch/home#metrics:) and then browse the metrics related to the different AWS services. 20 | 21 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/2-1.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/2-1.png) 22 | 23 | By clicking on the ElastiCache Metrics category, you will see the list of available metrics: 24 | 25 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/2-2.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/2-2.png) 26 | 27 | You can also view these metrics per cache cluster: 28 | 29 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/2-3.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/2-3.png) 30 | 31 | Just select the checkbox next to the metrics you want to visualize, and they will appear in the graph at the bottom of the console: 32 | 33 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/2-4.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/2-4.png) 34 | 35 | ### Alerts 36 | 37 | With the CloudWatch Management Console you can also create simple alerts that trigger when a metric crosses a specified threshold. 38 | 39 | Click on the “Create Alarm” button at the right of your graph, and you will be able to set up the alert and configure it to notify a list of email addresses. 40 | 41 | [![](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/2-5.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-elasticache/2-5.png) 42 | 43 | ## Using the CloudWatch Command Line Interface 44 | 45 | You can also retrieve metrics related to your cache from the command line. First you will need to install the CloudWatch Command Line Interface (CLI) by following [these instructions](http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html). You will then be able to query for any CloudWatch metric, using different filters. 46 | 47 | Command line queries can be useful for spot checks and ad hoc investigations when you can’t, or don’t want to, use a browser. 48 | 49 | For example, if you want to know the CPU utilization statistics for a cache cluster, you can use the CloudWatch command **mon-get-stats** with the parameters you need: 50 | 51 | (on Linux) 52 | 53 | ``` lang:sh 54 | mon-get-stats CPUUtilization \ 55 | --dimensions="CacheClusterId=yourcachecluster,CacheNodeId=0004" \ 56 | --statistics=Average \ 57 | --namespace="AWS/ElastiCache" \ 58 | --start-time 2015-08-13T00:00:00 \ 59 | --end-time 2015-08-14T00:00:00 \ 60 | --period=60 61 | ``` 62 | 63 | [Here](http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/CLIReference.html) are all the commands you can run with the CloudWatch CLI. 64 | 65 | ## Monitoring tool integrated with CloudWatch 66 | 67 | The third way to collect CloudWatch metrics is via a dedicated monitoring tool that offers extended monitoring functionality, such as: 68 | 69 | - Correlation of CloudWatch metrics with metrics from the caching engine and from other parts of your infrastructure 70 | - Dynamic slicing, aggregation, and filters on metrics 71 | - Historical data access 72 | - Sophisticated alerting mechanisms 73 | 74 | CloudWatch can be integrated with outside monitoring systems via API, and in many cases the integration only needs to be enabled once to deliver metrics from all your AWS services. 75 | 76 | ## Collecting native Redis or Memcached metrics 77 | 78 | CloudWatch’s ElastiCache metrics can give you good insight about your cache’s health and performance. However, as explained in [Part 1](https://www.datadoghq.com/blog/monitoring-elasticache-performance-metrics-with-redis-or-memcached), supplementing CloudWatch metrics with native cache metrics provides a fuller picture with higher-resolution data. 79 | 80 | ### Redis 81 | 82 | Redis provides extensive monitoring out of the box. The `info` command in the Redis command line interface gives you a snapshot of current cache performance. If you want to dig deeper, Redis also provides a number of tools offering a more detailed look at specific metrics. You will find all the information you need in [our recent post about collecting Redis metrics](https://www.datadoghq.com/blog/how-to-collect-redis-metrics/). 83 | 84 | For spot-checking the health of your server or looking into causes of significant latency, Redis’s built-in tools offer good insights. 85 | 86 | However, with so many metrics exposed, getting the information you want all in one place can be a challenge. Moreover, accessing data history and correlating Redis metrics with metrics from other parts of your infrastructure can be essential. That’s why using a monitoring tool integrating with Redis, such as Datadog, will help to take the pain out of your monitoring work. 87 | 88 | ### Memcached 89 | 90 | Memcached is more limited than Redis when it comes to monitoring. The most useful tool is the stats command, which returns a snapshot of Memcached metrics. Here is an example of its output: 91 | 92 | ``` lang:sh 93 | stats 94 | 95 | STAT pid 14868 96 | STAT uptime 175931 97 | STAT time 1220540125 98 | STAT version 1.2.2 99 | STAT pointer_size 32 100 | STAT rusage_user 620.299700 101 | STAT rusage_system 1545.703017 102 | STAT curr_items 228 103 | STAT total_items 779 104 | STAT bytes 15525 105 | STAT curr_connections 92 106 | STAT total_connections 1740 107 | STAT connection_structures 165 108 | STAT cmd_get 7411 109 | STAT cmd_set 28445156 110 | STAT get_hits 5183 111 | STAT get_misses 2228 112 | STAT evictions 0 113 | STAT bytes_read 2112768087 114 | STAT bytes_written 1000038245 115 | STAT limit_maxbytes 52428800 116 | STAT threads 1 117 | END 118 | ``` 119 | 120 | If you need more details about the commands you can run with Memcached, you can check their [documentation on Github](https://github.com/memcached/memcached/blob/master/doc/protocol.txt). 121 | 122 | Obviously, you can’t rely only on this snapshot to properly monitor Memcached performance; it tells you nothing about historical values or acceptable bounds, and it is not easy to quickly digest and understand the raw data. From a devops perspective, Memcached is largely a black box, and it becomes even more complex if you run multiple or distributed instances. Other basic tools like [memcache-top](http://code.google.com/p/memcache-top/) (for a changing, real-time snapshot) are useful but remain very limited. 123 | 124 | Thus if you are using Memcached as your ElastiCache engine, like Coursera does (see [Part 3](https://www.datadoghq.com/blog/how-coursera-monitors-elasticache-and-memcached-performance)), you should use CloudWatch or a dedicated monitoring tool that integrates with [Memcached](https://www.datadoghq.com/blog/speed-up-web-applications-memcached/), such as Datadog. 125 | 126 | ## Conclusion 127 | 128 | In this post we have walked through how to use CloudWatch to collect, visualize, and alert on ElastiCache metrics, as well as how to access higher-resolution, native cache metrics from Redis or Memcached. 129 | 130 | In the [next and final part of this series](https://www.datadoghq.com/blog/how-coursera-monitors-elasticache-and-memcached-performance) we take you behind the scenes with Coursera’s engineering team to learn their best practices and tips for using ElastiCache and monitoring its performance with Datadog. 131 | -------------------------------------------------------------------------------- /cilium/monitor_cilium_and_kubernetes_performance_with_hubble.md: -------------------------------------------------------------------------------- 1 | # Monitor Cilium and Kubernetes Performance and Hubble 2 | 3 | In [Part 1][part-1], we looked at some key metrics for monitoring the health and performance of your Cilium-managed Kubernetes clusters and network. In this post, we'll look at how Hubble enables you to visualize network traffic via a [CLI](#the-hubble-cli) and [user interface](#the-hubble-ui). But first, we'll briefly look at Hubble's underlying infrastructure and how it provides visibility into your environment. 4 | 5 | ## Hubble's underlying infrastructure 6 | There are several Linux-based and Kubernetes command-line tools that enable you to review network data for individual {{< tooltip "pods" "top" "kubernetes" >}}, such as their IP addresses and hostnames. But in order to efficiently troubleshoot any performance degradation, such as service latency, you need a better understanding of pod-to-pod and client-to-pod communication. Hubble collects and aggregates network data from every pod in your environment to give you a better view into request throughput, status, errors, and more. Hubble also integrates with [OpenTelemetry](https://isovalent.com/blog/post/cilium-release-112/#opentel), enabling you to export log and trace data from Cilium-managed networks to a third-party monitoring platform. 7 | 8 | Because Cilium can control traffic at layers 3, 4, and 7 of the [OSI model][osi-docs], Hubble enables you to monitor multiple levels of network traffic, such as TCP connections, DNS queries, and HTTP requests across {{< tooltip "clusters" "top" "kubernetes" >}} or cluster meshes. To accomplish this, Hubble leverages two primary components: servers and the Hubble Relay. 9 | 10 | {{< img src="cilium-hubble-diagram.png" alt="Diagram of Hubble's architecture" border="false" box-shadow="false">}} 11 | 12 | Hubble servers run alongside the Cilium agent on each cluster node. Each server implements an [Observer service][observer-docs] to monitor pod traffic and a [Peer service][peer-docs] to keep track of Hubble instances on other nodes. The Hubble Relay is a stand-alone component that collects network flow data from each server instance and makes it available to the Hubble UI and CLI via a set of APIs. 13 | 14 | Though the Hubble platform is deployed automatically with Cilium, it is not enabled by default. You can enable it by running the following command on your host: 15 | 16 | {{}} 17 | cilium hubble enable 18 | {{}} 19 | 20 | You can also check the status of both Hubble and Cilium by running the `cilium status` command, which should give you output similar to the following: 21 | 22 | 23 | {{< img src="cilium-hubble-cli-output.png" alt="Cilium CLI output" border="true">}} 24 | 25 | You will see an `error` status in the command's output if either service failed to launch. This issue can sometimes happen if underlying nodes are running out of memory. Allocating more memory and relaunching Cilium can help resolve the problem. 26 | 27 | ## The Hubble CLI 28 | Hubble's CLI extends the visibility that is provided by standard kubectl commands like `kubectl get pods` to give you more network-level details about a request, such as its status and the [security identities](https://isovalent.com/blog/post/cilium-release-112/#better-hubble-cli) associated with its source and destination. You can view this information via the `hubble observe` command and monitor traffic to, from, and between pods in order to determine if your policies are working as expected. For example, you can view all dropped requests between services by using the following command: 29 | 30 | {{}} 31 | hubble observe --verdict DROPPED 32 | 33 | May 12 13:35:35.923: default/service-a:58578 (ID:1469) -> default/service-c:80 (ID:851) http-request DROPPED (HTTP/1.1 PUT http://service-c.default.svc.cluster.local/v1/endpoint-1) 34 | {{}} 35 | 36 | The sample output above shows that the destination pod (`service-c`) dropped requests from the source pod (`service-a`). You can investigate further by adding the `-o json` option to the `hubble observe` command. The JSON output provides more context for an event, including: 37 | 38 | - the request event's verdict and relevant error message (e.g., `drop_reason_desc`) 39 | - the direction of the request (e.g., `traffic_direction`) 40 | - the type of policy that manages the pods associated with the request (e.g., `"Type"`) 41 | - the IP addresses and ports for the source and destination endpoints 42 | 43 | Using our previous example, you can review the command's JSON output to determine why the `service-b` pod is dropping requests: 44 | 45 | {{}} 46 | { 47 | "time": "2022-05-12T14:16:09.475485361Z", 48 | "verdict": "DROPPED", 49 | "drop_reason": 133, 50 | "ethernet": {...}, 51 | 52 | 53 | "IP": { 54 | "source": "10.0.0.87", 55 | "destination": "10.0.0.154", 56 | "ipVersion": "IPv4" 57 | }, 58 | "l4": {...}, 59 | 60 | 61 | 62 | "source": { 63 | "ID": 3173, 64 | "identity": 12878, 65 | "namespace": "default", 66 | "labels": [ 67 | "k8s:app.kubernetes.io/name=service-b", 68 | "k8s:class=service-b", 69 | "k8s:io.cilium.k8s.policy.cluster=minikube", 70 | "k8s:io.cilium.k8s.policy.serviceaccount=default", 71 | "k8s:io.kubernetes.pod.namespace=default", 72 | "k8s:org=gobs-1" 73 | ], 74 | "pod_name": "service-b" 75 | }, 76 | "destination": { 77 | "ID": 939, 78 | "identity": 4418, 79 | "namespace": "default", 80 | "labels": [ 81 | "k8s:app.kubernetes.io/name=service-c", 82 | "k8s:class=service-c", 83 | "k8s:io.cilium.k8s.policy.cluster=minikube", 84 | "k8s:io.cilium.k8s.policy.serviceaccount=default", 85 | "k8s:io.kubernetes.pod.namespace=default", 86 | "k8s:org=gobs-2" 87 | ], 88 | "pod_name": "service-c", 89 | "workloads": [...] 90 | 91 | }, 92 | "Type": "L3_L4", 93 | "node_name": "minikube/minikube", 94 | "event_type": { 95 | "type": 1, 96 | "sub_type": 133 97 | }, 98 | "traffic_direction": "INGRESS", 99 | "drop_reason_desc": "POLICY_DENIED", 100 | "Summary": "TCP Flags: SYN" 101 | } 102 | {{}} 103 | 104 | In the sample snippet above, you can see that requests were dropped (`"drop_reason_desc": "POLICY_DENIED"`) due to an L3/L4 policy (`"Type": "L3_L4"`), which indicates that Cilium was managing traffic appropriately in this case. You can modify your policy if you need to enable communication between these two pods. 105 | 106 | 107 | ## The Hubble UI 108 | While the CLI provides insight into networking issues for individual pods, you still need visibility into how these problems affect the entire cluster. The Hubble UI offers a high-level service map for monitoring network activity and policy behavior, enabling you to get a better understanding of how each of your pods interact with one another. Service maps can automatically capture interdependencies between Kubernetes services, making them especially useful for monitoring large-scale environments. This level of visibility enables you to confirm that your network is routing traffic to the appropriate endpoints. 109 | 110 | To get started, you can enable and access the Hubble UI by running the following commands: 111 | 112 | {{}} 113 | cilium hubble enable --ui 114 | cilium hubble ui 115 | {{}} 116 | 117 | The Cilium CLI will automatically navigate to your Hubble UI instance at `http://localhost:12000/`, where you can select a Kubernetes {{< tooltip "namespace" "top" "kubernetes" >}} to view the service map for a particular set of pods. In the example service map below, the `service-b` pod is attempting to communicate with the `service-c` pod, but its requests are failing. 118 | 119 | {{< img src="cilium-hubble-service-map.png" alt="Hubble UI service map" border="true">}} 120 | 121 | In the request list below the service map, you can see that Cilium is dropping requests between the `service-b` and `service-c` pods. You can troubleshoot further by selecting an individual request to view more details and determine if the drop is the result of a network policy or another issue. Hubble's UI leverages the same data points as its CLI, so you have complete context for mitigating the problem. 122 | 123 | ## Monitor network traffic with Hubble 124 | In this post, we looked at how Hubble enables you to monitor network traffic across your Cilium-managed infrastructure. Check out [Cilium's documentation][hubble-docs] to learn more about leveraging the Hubble platform for monitoring the health and performance of your Kubernetes network. In the [next part][part-3] of this series, we'll show you how Datadog provides complete visibility into Cilium metrics, logs, and network data. 125 | 126 | [part-1]: /blog/cilium-metrics-and-architecture/ 127 | [part-3]: /blog/monitor-cilium-cni-with-datadog 128 | [osi-docs]: https://www.cloudflare.com/learning/ddos/glossary/open-systems-interconnection-model-osi/ 129 | [observer-docs]: https://docs.cilium.io/en/v1.11/internals/hubble/#the-observer-service 130 | [peer-docs]: https://docs.cilium.io/en/v1.11/internals/hubble/#the-peer-service 131 | [hubble-docs]: https://docs.cilium.io/en/stable/intro/ 132 | [dropped-error-codes]: https://github.com/cilium/hubble/blob/master/vendor/github.com/cilium/cilium/pkg/monitor/api/drop.go 133 | -------------------------------------------------------------------------------- /dynamodb/how_to_collect_dynamodb_metrics.md: -------------------------------------------------------------------------------- 1 | #How to Collect DynamoDB Metrics 2 | 3 | *This post is part 2 of a 3-part series on monitoring DynamoDB. [Part 1](https://www.datadoghq.com/blog/top-dynamodb-performance-metrics) explores its key performance metrics, and [Part 3](https://www.datadoghq.com/blog/how-medium-monitors-dynamodb-performance) describes the strategies Medium uses to monitor DynamoDB.* 4 | 5 | This section of the article is about collecting native DynamoDB metrics, which are available exclusively from AWS via CloudWatch. For other non-native metrics, see [Part 3](https://www.datadoghq.com/blog/how-medium-monitors-dynamodb-performance). 6 | 7 | CloudWatch metrics can be accessed in three different ways: 8 | 9 | - [Using the AWS Management Console and its web interface](#console) 10 | - [Using the command-line interface (CLI)](#cli) 11 | - [Using a monitoring tool integrating the CloudWatch API](#integrations) 12 | 13 |
14 | 15 | ## Using the AWS Management Console 16 | 17 | Using the online management console is the simplest way to monitor DynamoDB with CloudWatch. It allows you to set up simple automated alerts, and get a visual picture of recent changes in individual metrics. 18 | 19 | ### Graphs 20 | 21 | Once you are signed in to your AWS account, you can open the [CloudWatch console](https://console.aws.amazon.com/cloudwatch/home) where you will see the metrics related to the different AWS technologies. 22 | 23 | [![CloudWatch console](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/2-01b.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/2-01b.png) 24 | 25 | By clicking on DynamoDB’s “Table Metrics” you will see the list of your tables with the available metrics for each one: 26 | 27 | [![DynamoDB metrics in CloudWatch](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/2-02b.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/2-02b.png) 28 | 29 | Just select the checkbox next to the metrics you want to visualize, and they will appear in the graph at the bottom of the console. 30 | 31 | [![DynamoDB metric graph in CloudWatch](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/2-03b.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/2-03b.png) 32 | 33 | ### Alerts 34 | 35 | With the CloudWatch Management Console you can also create alerts which trigger when a certain metric threshold is crossed. 36 | 37 | Click on **Create Alarm** at the right of your graph, and you will be able to set up the alert, and configure it to notify a list of email addresses: 38 | 39 | [![DynamoDB CloudWatch alert](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/2-04b.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/2-04b.png) 40 | 41 |
42 | 43 | ## Using the Command Line Interface 44 | 45 | You can also retrieve metrics related to a specific table using the command line. To do so, you will need to install the AWS Command Line Interface by following [these instructions](http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html). You will then be able to query for any CloudWatch metrics you want, using different filters. 46 | 47 | Command line queries can be useful for spot checks and ad hoc investigations when you can’t or don’t want to use a browser. 48 | 49 | For example, if you want to retrieve the metrics related to `PutItem` requests throttled during a specified time period, you can run: 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 81 | 82 | 83 |
59 |
60 |   61 |
62 |
63 | aws cloudwatch get-metric-statistics 64 |
65 |
66 | --namespace AWS/DynamoDB  --metric-name ThrottledRequests 67 |
68 |
69 | --dimensions Name=TableName,Value=YourTable Name=Operation,Value=PutItem 70 |
71 |
72 | --start-time 2015-08-02T00:00:00Z --end-time 2015-08-04T00:00:00Z 73 |
74 |
75 | --period 300 --statistics=Sum 76 |
77 |
78 |   79 |
80 |
84 | 85 | Here is an example of a JSON output format you will see from a “get-metric-statistics“ query like above: 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 162 | 163 | 164 |
95 |
96 |   97 |
98 |
99 | { 100 |
101 |
102 |     "Datapoints": [ 103 |
104 |
105 |         { 106 |
107 |
108 |             "Timestamp": "2015-08-02T11:18:00Z", 109 |
110 |
111 |             "Average": 44.79, 112 |
113 |
114 |             "Unit": "Count" 115 |
116 |
117 |         }, 118 |
119 |
120 |         { 121 |
122 |
123 |             "Timestamp": "2015-08-02T20:18:00Z", 124 |
125 |
126 |             "Average": 47.92, 127 |
128 |
129 |             "Unit": "Count" 130 |
131 |
132 |         }, 133 |
134 |
135 |         { 136 |
137 |
138 |             "Timestamp": "2015-08-02T19:18:00Z", 139 |
140 |
141 |             "Average": 50.85, 142 |
143 |
144 |             "Unit": "Count" 145 |
146 |
147 |         }, 148 |
149 |
150 |     ], 151 |
152 |
153 |     "Label": "ThrottledRequests" 154 |
155 |
156 | } 157 |
158 |
159 |   160 |
161 |
165 | 166 |
167 | 168 | ## Integrations 169 | 170 | The third way to collect CloudWatch metrics is via your own monitoring tools which can offer extended monitoring functionality. For example if you want the ability to correlate metrics from one part of your infrastructure with other parts (including custom infrastructure or applications), or you want to dynamically slice, aggregate, and filter your metrics on any attribute, or you need specific alerting mechanisms, you probably are using a dedicated monitoring tool or platform. CloudWatch can be integrated with these platforms via API, and in many cases the integration just needs to be enabled to start working. 171 | 172 | In [Part 3](https://www.datadoghq.com/blog/how-medium-monitors-dynamodb-performance), we cover a real-world example of this type of metrics collection: [Medium](https://medium.com/)’s engineering team monitors DynamoDB using an integration with Datadog. 173 | 174 | ## Conclusion 175 | 176 | In this post we have walked through how to use CloudWatch to collect and visualize DynamoDB metrics, and how to generate alerts when these metrics go out of bounds. 177 | 178 | As discussed in [Part 1](https://www.datadoghq.com/blog/top-dynamodb-performance-metrics), DynamoDB can’t see, or doesn’t report on all the events that you likely want to monitor, including true latency and error rates. In the [next and final part on this series](https://www.datadoghq.com/blog/how-medium-monitors-dynamodb-performance) we take you behind the scenes with Medium’s engineering team to learn about the issues they’ve encountered, and how they’ve solved them with a mix of tools which includes CloudWatch, ELK, and Datadog. 179 | -------------------------------------------------------------------------------- /openstack/how_lithium_monitors_openstack.md: -------------------------------------------------------------------------------- 1 | # How Lithium monitors OpenStack 2 | 3 | [Lithium] was founded in 2001 as an offshoot from gamers.com (a gaming community) and has since evolved into a leading social software provider whose Total Community platform helps brands connect, engage and understand their customers. With more than [400 communities][communities] and growing, Lithium uses [OpenStack] as a private datacenter, with flexibility to deploy customized, public-facing communities to major brands across industries and regions. 4 | 5 | In this article we will pull back the curtain to learn Lithium's best practices and tips for using OpenStack, and how Lithium monitors OpenStack with the help of Datadog. 6 | 7 | ## Why monitoring OpenStack is critical 8 | 9 | [OpenStack] is a central component in Lithium's infrastructure, forming the backbone of their service platform. Lithium leverages OpenStack for both production and development environments, with OpenStack hosting a large number of production communities, as well as demo communities for sales engineers. 10 | In addition to community hosting, OpenStack also hosts infrastructure services, including [Kubernetes], [Chef] servers and [BIND] slaves. 11 | 12 | With such a far-reaching deployment, failure is not an option. If OpenStack were not properly monitored and managed, numerous and noticeable failures can occur: sales engineers wouldn't be able to create demo environments for prospects, developers wouldn't be able to spawn test environments, and the communities in production could go down or see increased response times as computing resources became unavailable. 13 | 14 | That's why Lithium's engineers monitor OpenStack around the clock. Using [Datadog], they can correlate all the relevant OpenStack metrics with metrics from other parts of their infrastructure, all in one place. Lithium engineers can spot issues at a glance and determine the root cause of the problem, in addition to setting up advanced alerts on mission-critical metrics. 15 | 16 | [![Lithium OpenStack dashboard][lithium-dash]][lithium-dash] 17 | _A Datadog dashboard that Lithium uses to monitor OpenStack_ 18 | 19 | ## Key metrics for Lithium 20 | 21 | ### Number of instances running 22 | Lithium engineers track the total number of instances running across their OpenStack deployment to correlate with changes in other metrics. For example, a large increase in total RAM used makes sense in light of additional instances being spun up. Tracking the number of instances running alongside other metrics helps inform decisions for capacity and [tenant quota][quotas] planning. 23 | 24 | ### Instances per project 25 | Like the total number of instances running, Lithium tracks the number of instances used per project to get a better idea of how their private cloud is being used. A common problem they found was that engineers would often spin up development environments and forget to shut them down, which means resources were provisioned but unused. By tracking the number of instances per project, admins could rein in excessive or unnecessary usage and free up resources without resorting to installing additional hardware. 26 | 27 | ### Available memory 28 | As mentioned in [Part 1][part 1] of our [series][part 2] on [monitoring OpenStack Nova][part 3], visibility into OpenStack's resource consumption is essential to ensuring smooth operation and preventing user frustration. If available resources were insufficient, sales engineers would be breathing down the neck of the techops team, unable to create demo accounts for prospects, and developers would be stuck without a dev environment. 29 | 30 | ### VCPU available 31 | Just like available memory, tracking the number of VCPUs available for allocation is critical—a lack of available CPUs prevents provisioning of additional instances. 32 | 33 | ### Metric deltas 34 | [![Change in instances used][instance-change-graph]][instance-change-graph] 35 | 36 | Finally, Lithium tracks the changes in metrics' values over time to give insight into the causes of changes in resource availability and consumption. 37 | 38 | Using Datadog's [Change graph feature][graph-change], engineers have a bird's eye view of week-to-week changes in resource usage. By analyzing resource deltas, engineers and decision makers have the data they need to inform hardware purchasing decisions and perform diligent capacity planning. 39 | 40 | ## Alerting the right people 41 | Alerting is an [essential component][alerting-101] of any monitoring strategy—alerts let engineers react to issues as they occur, before users are affected. With Datadog alerts, Lithium is able to send notifications via their usual communication channels (chat, PagerDuty, email, etc.), as well as provide engineers with suggested fixes or troubleshooting techniques—all without human intervention. 42 | 43 | Lithium generally uses [PagerDuty] for priority alerts, and [HipChat] or email for lower-priority alerts and for alerting specific engineers to a particular issue. For OpenStack, Lithium alerts on excessive resource consumption. As mentioned in [Part 1][part 1] of our OpenStack series, monitoring resource consumption is a **critical** part of a comprehensive OpenStack monitoring strategy. 44 | 45 | [![Lithium alerts][lithium-alert]][lithium-alert] 46 | 47 | Datadog alerts give Lithium engineers the flexibility to inform the right people that a problem has occurred, at the right time, across an [ever-growing][integration-list] list of platforms. 48 | 49 | ## Why Datadog? 50 | Before adopting Datadog, Lithium admins were relying on [Horizon][Horizon] (OpenStack's canonical dashboard) to extract meaningful metrics from their deployment. This approach was severely limited—engineers could only access rudimentary statistics about their deployment and lacked the ability to correlate metrics from OpenStack with metrics from across their infrastructure. 51 | 52 | With Datadog [screenboards], they can combine the historical perspective of graphed timeseries data with alert values to put current operations metrics in context. 53 | 54 | [![Lithium widgets][lithium-widgets]][lithium-widgets] 55 | 56 | Datadog also makes it easy to collect and monitor [RabbitMQ] and MySQL metrics, in addition to general OpenStack metrics, for even deeper insight into performance issues. For Lithium, having Datadog in place has allowed engineers to adjust internal workflows, reducing the total number of elements that need monitoring. 57 | 58 | ### Saving time, money, and reputation 59 | Adopting Datadog has allowed Lithium to catch problems in OpenStack as well as applications running on their OpenStack cloud. Now, Lithium engineers have the tools and information they need to react quickly to problems and resolve infrastructure issues with minimal customer impact, saving time, money, and reputation. 60 | 61 | ## Conclusion 62 | [![Nova default dash][nova-dash]][nova-dash] 63 | _Default Datadog OpenStack dashboard_ 64 | 65 | If you're already using OpenStack and Datadog, we hope these strategies will help you gain improved visibility into what's happening in your deployment. If you don't yet have a Datadog account, you can [start monitoring][part 3] OpenStack performance today with a [free trial]. 66 | 67 | ## Acknowledgments 68 | 69 | Thanks to [Lithium] and especially [Mike Tougeron][twit], Lead Cloud Platform Engineer, for generously sharing their OpenStack expertise and monitoring strategies for this article. 70 | 71 | 72 | 73 | [alerting-101]: https://www.datadoghq.com/blog/monitoring-101-alerting/ 74 | [BIND]: https://www.isc.org/downloads/bind/ 75 | [Chef]: https://www.chef.io/chef/ 76 | [communities]: http://www.lithium.com/why-lithium/customer-success/ 77 | [Datadog]: https://datadoghq.com 78 | [graph-change]: http://docs.datadoghq.com/graphing/#select-your-visualization 79 | [HipChat]: https://www.hipchat.com/ 80 | [Horizon]: http://docs.openstack.org/developer/horizon/ 81 | [instance-change-graph]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-OpenStack/lithium/instances-change-graph.png 82 | [integration-list]: http://docs.datadoghq.com/integrations/ 83 | [Kubernetes]: https://www.datadoghq.com/blog/corral-your-docker-containers-with-kubernetes-monitoring/ 84 | [Lithium]: http://www.lithium.com 85 | [lithium-alert]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-OpenStack/lithium/lithium-alert.png 86 | [lithium-dash]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-OpenStack/lithium/lithium-dashboard.png 87 | [lithium-widgets]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-OpenStack/lithium/lithium-widgets.png 88 | [OpenStack]: https://www.openstack.org 89 | [PagerDuty]: https://www.pagerduty.com/ 90 | [quotas]: http://docs.openstack.org/user-guide-admin/dashboard_set_quotas.html 91 | [nova-dash]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-12-OpenStack/default-dash.png 92 | [RabbitMQ]: https://www.datadoghq.com/blog/openstack-monitoring-nova/#rabbitmq-metrics 93 | [screenboards]: http://help.datadoghq.com/hc/en-us/articles/204580349-What-is-the-difference-between-a-ScreenBoard-and-a-TimeBoard- 94 | [twit]: https://twitter.com/mtougeron 95 | 96 | [free trial]: https://app.datadoghq.com/signup 97 | [part 1]: https://www.datadoghq.com/blog/openstack-monitoring-nova/ 98 | [part 2]: https://www.datadoghq.com/blog/collecting-metrics-notifications-openstack-nova/ 99 | [part 3]: https://www.datadoghq.com/blog/openstack-monitoring-datadog/ -------------------------------------------------------------------------------- /hadoop/monitoring_hadoop_with_datadog.md: -------------------------------------------------------------------------------- 1 | _This post is part 4 of a 4-part series on monitoring Hadoop health and performance. [Part 1] gives a general overview of Hadoop's architecture and subcomponents, [Part 2] dives into the key metrics to monitor, and [Part 3] details how to monitor Hadoop performance natively._ 2 | 3 | If you’ve already read our post on collecting Hadoop metrics, you’ve seen that you have several options for ad hoc performance checks. For a more comprehensive view of your cluster's health and performance, however, you need a monitoring system that continually collects Hadoop statistics, events, and metrics, that lets you identify both recent and long-term performance trends, and that can help you quickly resolve issues when they arise. 4 | 5 | This post will show you how to set up detailed Hadoop monitoring by installing the Datadog Agent on your Hadoop nodes. 6 | 7 | [![Hadoop dashboard image][dash]][dash] 8 | 9 | With Datadog, you can collect Hadoop metrics for visualization, alerting, and full-infrastructure correlation. Datadog will automatically collect the key metrics discussed in parts [two][Part 2] and [three][Part 3] of this series, and make them available in a [template dashboard][dashboarding], as seen above. 10 | 11 | ## Integrating Datadog, Hadoop, and ZooKeeper 12 | ### Verify Hadoop and ZooKeeper status 13 | Before you begin, you should verify that all Hadoop components, including ZooKeeper, are up and running. 14 | 15 | #### Hadoop 16 | To verify that all of the Hadoop processes are started, run `sudo jps` on your NameNode, ResourceManager, and DataNodes to return a list of the running services. 17 | 18 | Each service should be running a process which bears its name, i.e. `NameNode` on NameNode, etc: 19 | 20 | ``` 21 | [hadoop@sandbox ~]$ sudo jps 22 | 2354 NameNode 23 | [...] 24 | ``` 25 | 26 | #### ZooKeeper 27 | For ZooKeeper, you can run this one-liner which uses the [4-letter-word] `ruok`: 28 | `echo ruok | nc 2181` 29 | 30 | If ZooKeeper responds with `imok`, you are ready to install the Agent. 31 | 32 | ### Install the Datadog Agent 33 | The Datadog Agent is the [open source software][dd-agent] that collects and reports metrics from your hosts so that you can view and monitor them in Datadog. Installing the Agent usually takes just a single command. 34 | 35 | Installation instructions for a variety of platforms are available [here][agent-install]. 36 | 37 | As soon as the Agent is up and running, you should see your host reporting metrics in your [Datadog account][infra-list]. 38 | 39 | [![Agent reporting in][host0]][host0] 40 | 41 | ### Configure the Agent 42 | 43 | Next you will need to create Agent configuration files for your Hadoop infrastructure. In the [Agent configuration directory][os-config], you will find template configuration files for the NameNode, DataNodes, MapReduce, YARN, and ZooKeeper. If your services are running on their default ports (50075 for DataNodes, 50070 for NameNode, 8088 for the ResourceManager, and 2181 for ZooKeeper), you can copy the templates without modification to create your config files. 44 | 45 | On your NameNode: 46 | `cp hdfs_namenode.yaml.example hdfs_namenode.yaml` 47 | On your DataNodes: 48 | `cp hdfs_namenode.yaml.example hdfs_namenode.yaml` 49 | On your (YARN) ResourceManager: 50 | `cp mapreduce.yaml.example mapreduce.yaml` 51 | `cp yarn.yaml.example yarn.yaml` 52 | Lastly, on your ZooKeeper nodes: 53 | `cp zk.yaml.example zk.yaml` 54 | _Windows users: use copy in place of cp_ 55 | 56 | ### Verify configuration settings 57 | 58 | To verify that all of the components are properly integrated, on each host [restart the Agent][os-config] and then run the Datadog [`info`][os-config] command. If the configuration is correct, you will see a section resembling the one below in the `info` output,: 59 | 60 | ``` 61 | Checks 62 | ====== 63 | [...] 64 | hdfs_datanode 65 | ------------- 66 | - instance #0 [OK] 67 | - Collected 10 metrics, 0 events & 2 service checks 68 | hdfs_namenode 69 | ------------- 70 | - instance #0 [OK] 71 | - Collected 23 metrics, 0 events & 2 service checks 72 | 73 | mapreduce 74 | --------- 75 | - instance #0 [OK] 76 | - Collected 4 metrics, 0 events & 2 service checks 77 | 78 | yarn 79 | ---- 80 | - instance #0 [OK] 81 | - Collected 38 metrics, 0 events & 4 service checks 82 | ``` 83 | 84 | ### Enable the intergrations 85 | Next, click the **Install Integration** button for [HDFS][hdfs-int], [MapReduce][mapreduce-int], [YARN][yarn-int], and [ZooKeeper][zk-int] under the *Configuration* tab in each technology's integration settings page. 86 | 87 | ![Install Hadoop integration][install-integration] 88 | 89 | ## Show me the metrics! 90 | Once the Agent begins reporting metrics, you will see a comprehensive Hadoop dashboard among your [list of available dashboards in Datadog][dash-list]. 91 | 92 | The Hadoop dashboard, as seen at the top of this article, displays the key metrics highlighted in our [introduction on how to monitor Hadoop][Part 1]. 93 | 94 | [![ZooKeeper dashboard image][zk-dash]][zk-dash] 95 | 96 | The default ZooKeeper dashboard above displays the key metrics highlighted in our [introduction on how to monitor Hadoop][Part 1]. 97 | 98 | You can easily create a more comprehensive dashboard to monitor your entire data-processing infrastructure by adding additional graphs and metrics from your other systems. For example, you might want to graph Hadoop metrics alongside metrics from [Cassandra] or [Kafka], or alongside host-level metrics such as memory usage on application servers. To start building a custom dashboard, clone the template Hadoop dashboard by clicking on the gear on the upper right of the dashboard and selecting **Clone Dash**. 99 | 100 | 101 | [![Clone dashboard image][clone-dash]][clone-dash] 102 | 103 | ## Alerting 104 | Once Datadog is capturing and visualizing your metrics, you will likely want to [set up some alerts][alerting] to be automatically notified of potential issues. 105 | 106 | Datadog can monitor individual hosts, containers, services, processes—or virtually any combination thereof. For instance, you can view all of your DataNodes, NameNodes, and containers, or all nodes in a certain availability zone, or even a single metric being reported by all hosts with a specific tag. Datadog can also monitor Hadoop events, so you can be notified if jobs fail or take abnormally long to complete. 107 | 108 | ## Start monitoring today 109 | 110 | In this post we’ve walked you through integrating Hadoop with Datadog to visualize your [key metrics][Part 1] and [set alerts][monitoring] so you can keep your Hadoop jobs running smoothly. If you’ve followed along using your own Datadog account, you should now have improved visibility into your data-processing infrastructure, as well as the ability to create automated alerts tailored to the metrics and events that are most important to you. If you don’t yet have a Datadog account, you can and start monitoring Hadoop right away. 111 | 112 | _Source Markdown for this post is available [on GitHub][markdown]. Questions, corrections, additions, etc.? Please [let us know][issues]._ 113 | 114 | []: Images 115 | 116 | [clone-dash]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-07-hadoop/dd/clone-dash.png 117 | [dash]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-07-hadoop/dd/default-dash2.png 118 | [zk-dash]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-07-hadoop/dd/zk-dash.png 119 | [host0]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-02-kafka/default-host.png 120 | [install-integration]: https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-07-hadoop/dd/install-integration.png 121 | 122 | []: Links 123 | 124 | [4-letter-word]: https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#The+Four+Letter+Words 125 | [agent-install]: https://app.datadoghq.com/account/settings#agent 126 | [alerting]: http://docs.datadoghq.com/guides/monitoring/ 127 | [Cassandra]: https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/ 128 | [dashboarding]: https://www.datadoghq.com/dashboarding/ 129 | [dash-list]: https://app.datadoghq.com/dash/list 130 | [dd-agent]: https://github.com/DataDog/dd-agent 131 | [HAProxy]: https://www.datadoghq.com/blog/monitoring-haproxy-performance-metrics 132 | [Kafka]: https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ 133 | [infra-list]: https://app.datadoghq.com/infrastructure 134 | [monitoring]: http://docs.datadoghq.com/guides/monitoring/ 135 | [os-config]: http://docs.datadoghq.com/guides/basic_agent_usage/ 136 | [outlier]: https://www.datadoghq.com/blog/introducing-outlier-detection-in-datadog/ 137 | [signup]: https://app.datadoghq.com/signup 138 | 139 | []: Conf_links 140 | 141 | [hdfs-int]: https://app.datadoghq.com/account/settings#integrations/hdfs 142 | [mapreduce-int]: https://app.datadoghq.com/account/settings#integrations/mapreduce 143 | [yarn-int]: https://app.datadoghq.com/account/settings#integrations/yarn 144 | [zk-int]: https://app.datadoghq.com/account/settings#integrations/zookeeper 145 | [dn-conf]: https://github.com/DataDog/dd-agent/blob/master/conf.d/hdfs_datanode.yaml.example 146 | [nn-conf]: https://github.com/DataDog/dd-agent/blob/master/conf.d/hdfs_namenode.yaml.example 147 | [mr-conf]: https://github.com/DataDog/dd-agent/blob/master/conf.d/mapreduce.yaml.example 148 | [yarn-conf]: https://github.com/DataDog/dd-agent/blob/master/conf.d/yarn.yaml.example 149 | [zk-conf]: https://github.com/DataDog/dd-agent/blob/master/conf.d/zk.yaml.example 150 | 151 | []: Bottom_Links 152 | 153 | [issues]: https://github.com/DataDog/the-monitor/issues 154 | [markdown]: https://github.com/DataDog/the-monitor/blob/master/hadoop/monitoring_hadoop_with_datadog.md 155 | [Part 1]: https://www.datadoghq.com/blog/hadoop-architecture-overview/ 156 | [Part 2]: https://www.datadoghq.com/blog/monitor-hadoop-metrics/ 157 | [Part 3]: https://www.datadoghq.com/blog/collecting-hadoop-metrics/ 158 | [Part 4]: https://www.datadoghq.com/blog/monitor-hadoop-metrics-datadog/ -------------------------------------------------------------------------------- /mongodb/collecting-mongodb-metrics-and-statistics.md: -------------------------------------------------------------------------------- 1 | #Collecting MongoDB metrics and statistics 2 | *This post is part 2 of a 3-part series about monitoring MongoDB metrics and performance. Part 1 presents the key performance metrics available from MongoDB: there is [one post for the WiredTiger](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-wiredtiger) storage engine and [one for MMAPv1](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-mmap). In [Part 3](https://www.datadoghq.com/blog/monitor-mongodb-performance-with-datadog) you will discover how to monitor MongoDB performance with Datadog.* 3 | 4 | If you’ve already read our guide to key MongoDB metrics in Part 1 of this series, you’ve seen that MongoDB provides a vast array of metrics on performance and resource utilization. This post covers the different options for collecting MongoDB metrics in order to monitor them. There are three ways to collect MongoDB metrics from your hosts: 5 | 6 | - Using [utilities](#utilities) offered by MongoDB to collect real-time activity statistics 7 | - Using [database commands](#commands) to check the database’s current state 8 | - Using a dedicated [monitoring tool](#tools) for more advanced monitoring features and graphing capabilities, which are essential for databases running in production 9 | 10 | ## Utilities 11 | 12 | Utilities provide real-time statistics on the current activity of your MongoDB cluster. They can be useful for ad hoc checks, but to get actionable insights and more advanced monitoring features, you should check the last section about dedicated monitoring tools. 13 | The two main utilities line are **mongostat** and **mongotop**. 14 | 15 | ### mongostat 16 | 17 | **mongostat** is the most powerful utility. It reports real-time statistics about connections, inserts, queries, updates, deletes, queued reads and writes, flushes, memory usage, page faults, and much more. It can be useful to quickly spot-check database activity, see if values are not abnormally high, and make sure you have enough capacity. 18 | 19 | However **mongostat** does not provide insights on metrics about Replication and oplog, cursors, storage, resource saturation, asserts, or host-level metrics. **mongostat** returns cache statistics only if you use the WiredTiger storage engine. 20 | [![mongostat](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/2-collect/mongostat.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/2-collect/mongostat.png) 21 | 22 | You can find in the [MongoDB documentation](https://docs.mongodb.com/manual/reference/program/mongostat/#bin.mongostat) the meaning of [the different fields](https://docs.mongodb.com/manual/reference/program/mongostat/#fields) returned by mongostat along with the available [options](https://docs.mongodb.com/manual/reference/program/mongostat/#options). 23 | 24 | mongostat relies on the `db.serverStatus()` command ([see below](#commands)). 25 | 26 | NOTE: Prior version 3.2, MongoDB offered an HTTP console displaying monitoring statistics on a web page, but this has been deprecated since v3.2. 27 | 28 | ### mongotop 29 | 30 | **mongotop** returns the amount of time a MongoDB instance spends performing read and write operations. It is broken down by collection (namespace). This allows you to make sure there is no unexpected activity and see where resources are consumed. All active namespaces are reported. 31 | [![mongotop](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/2-collect/mongotop.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/2-collect/mongotop.png) 32 | 33 | By default, values are printed every second but you can specify the frequency. For example if you want it to return every 20 seconds, you can run mongotop 20. Many other [options](https://docs.mongodb.com/manual/reference/program/mongotop/#options) are available as well. 34 | 35 | Utilities are great for quick checks and ad hoc investigations, but for more detailed insights into the health and performance of your database, explore MongoDB commands discussed in the next section. 36 | 37 | ## Commands 38 | 39 | MongoDB provides several commands that can be used to collect the different metrics from your database presented in [Part 1](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-wiredtiger). Here are the most useful ones. 40 | 41 | ### serverStatus 42 | 43 | **serverStatus** (`db.serverStatus()` if run from the mongo shell) is the most complete native metrics-gathering command for MongoDB. It provides a document with statistics from most of the key MongoDB metrics categories we talked about in [Part 1](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-wiredtiger): connections, operations, journaling, background flushing, locking, cursors, memory, asserts, etc. You can find the full list of metrics it can return [here](https://docs.mongodb.com/manual/reference/command/serverStatus/#output). 44 | 45 | This command is used by most [third party monitoring tools](#tools) to collect MongoDB metrics along with the dbStats and replSetGetStatus commands that are still necessary to collect storage metrics and statistics about your replica sets (see next paragraphs). 46 | 47 | ### dbStats 48 | 49 | **dbStats** (`db.stats()` in the mongo shell) provides metrics about storage usage of the database: number of objects, or memory taken by documents and padding in the database (see memory metrics in [Part 1](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-wiredtiger) of this series). [Here](https://docs.mongodb.com/manual/reference/command/dbStats/#output) is the full list of metrics it returns. 50 | 51 | ### collStats 52 | 53 | **collStats** (`db.collection.stats()` in the shell) returns metrics similar to the dbStats output but for a specified collection: size of a collection, number of objects inside it, average size of objects, number of indexes in the collection, etc. See the full list [here](https://docs.mongodb.com/manual/reference/command/collStats/#output). 54 | 55 |   56 | 57 | For example the following command runs collStats on the “movie” collection, with a scale of 1024 bytes: 58 | db.runCommand( { collStats : “restaurant”, scale: 1024 } ) 59 | 60 | ### getReplicationInfo 61 | 62 | getReplicationInfo (`db.printReplicationInfo()` in the shell) returns metrics about oplogs of the different members of a replica set like the oplog size or the oplog window. See the list of output fields [here](https://docs.mongodb.com/manual/reference/method/db.printReplicationInfo/#output-fields). 63 | 64 | ### replSetGetStatus 65 | 66 | **replSetGetStatus** (`rs.status()` from the shell) reports metrics about members of your replica set: state, metrics required to calculate replication lag. [See Part 1](https://www.datadoghq.com/blog/monitoring-mongodb-performance-metrics-wiredtiger) for more info about these metrics. This command is used to check the health of a replica set’s members and make sure replication is correctly configured. You can find the full list of metrics of the output [here](https://docs.mongodb.com/manual/reference/command/replSetGetStatus/#output). 67 | 68 | ### sh.status 69 | 70 | Sh.status (`sh.status()` from the shell) provides metrics about sharding configuration and existing chunks (contiguous range of shard key values in a specific [shard](https://docs.mongodb.com/manual/reference/glossary/#term-shard)) for a sharded cluster. The full list of metrics of the output is available [here](https://docs.mongodb.com/manual/reference/method/sh.status/#output-fields). 71 | 72 | ### getProfilingStatus 73 | 74 | getProfilingStatus (`db.getProfilingStatus()` in the shell) returns the current [profile](https://docs.mongodb.com/manual/reference/command/profile/#dbcmd.profile) level and the defined threshold above which the profiler considers a query slow (slowOpThresholdMs). 75 | 76 | ## Production monitoring 77 | 78 | The first two sections of this post cover built-in ways to manually access MongoDB metrics using simple lightweight tools. For databases running in production, you will likely want a more comprehensive monitoring system that ingests MongoDB metrics as well as metrics from other technologies in your stack. 79 | 80 | ### MongoDB’s own tools 81 | 82 | With [MongoDB Enterprise Advanced](https://www.mongodb.com/products/mongodb-enterprise-advanced), you will be able to collect performance metrics, automate, and backup your deployment through MongoDB’s management tools: 83 | 84 | - [Ops Manager](https://www.mongodb.com/products/ops-manager) is the easiest way to manage MongoDB from your own data center 85 | - [Cloud Manager](https://www.mongodb.com/cloud/) allows you to manage your MongoDB deployment through MongoDB’s cloud service 86 | 87 | [![MongoDB Cloud Manager](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/2-collect/mongodb-cloud-manager.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/2-collect/mongodb-cloud-manager.png) 88 | 89 | If you have it, MongoDB Ops Manager will likely be your go-to place to take actions to monitor, prevent or resolve MongoDB performance issues. 90 | 91 | ### Visibility into all your infrastructure with Datadog 92 | 93 | At Datadog, we worked with MongoDB’s team to develop a strong integration. Using Datadog you can start collecting, graphing, and monitoring all MongoDB metrics from your instances with a minimum of overhead, and immediately correlate what’s happening in MongoDB with the rest of your stack 94 | 95 | Datadog offers extended monitoring functionality, such as: 96 | 97 | - Dynamic slicing, aggregation, and filters on metrics 98 | - Historical data access 99 | - Advanced alerting mechanisms 100 | 101 | [![MongoDB Datadog dashboard](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/2-collect/mongodb-metrics.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2016-05-mongodb/2-collect/mongodb-metrics.png) 102 | 103 | For more details, check out our guide to monitoring MongoDB metrics with Datadog in the [third and last part of this series](https://www.datadoghq.com/blog/monitor-mongodb-performance-with-datadog). 104 | -------------------------------------------------------------------------------- /elb/how_to_collect_aws_elb_metrics.md: -------------------------------------------------------------------------------- 1 | *This post is part 2 of a 3-part series on monitoring Amazon ELB. [Part 1](https://www.datadoghq.com/blog/top-elb-health-and-performance-metrics) explores its key performance metrics, and [Part 3](https://www.datadoghq.com/blog/monitor-elb-performance-with-datadog) shows you how Datadog can help you monitor ELB.* 2 | 3 | *__Note:__ The metrics referenced in this article pertain to [classic](https://aws.amazon.com/elasticloadbalancing/classicloadbalancer/) ELB load balancers. We will cover Application Load Balancer metrics in a future article.* 4 | 5 | This part of the series is about collecting ELB metrics, which are available from AWS via CloudWatch. They can be accessed in three different ways: 6 | 7 | - [Using the AWS Management Console](#console) 8 | - [Using the command-line interface (CLI)](#cli) 9 | - [Using a monitoring tool integrating the CloudWatch API](#tools) 10 | 11 | We will also explain how [using ELB access logs](#logs) can be useful when investigating on specific request issues. 12 | 13 |
14 | 15 | ## Using the AWS Management Console 16 | 17 | Using the online management console is the simplest way to monitor your load balancers with CloudWatch. It allows you to set up basic automated alerts and to get a visual picture of recent changes in individual metrics. 18 | 19 | ### Graphs 20 | 21 | Once you are signed in to your AWS account, you can open the [CloudWatch console](https://console.aws.amazon.com/cloudwatch/home#metrics:) and then browse the metrics related to the different AWS services. 22 | 23 | [![ELB metrics in AWS Console](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/2-01.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/2-01.png) 24 | 25 | By clicking on the ELB Metrics category, you will see the list of available metrics per load balancer, per availability zone: 26 | 27 | [![List of ELB metrics in AWS Console](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/2-02.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/2-02.png) 28 | 29 | You can also view the metrics across all your load balancers: 30 | 31 | [![List of ELB metrics across all load balancers](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/2-03.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/2-03.png) 32 | 33 | Just select the checkbox next to the metrics you want to visualize, and they will appear in the graph at the bottom of the console: 34 | 35 | [![ELB metrics graphs in AWS Console](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/2-04.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/2-04.png) 36 | 37 | ### Alerts 38 | 39 | With the CloudWatch Management Console you can also create simple alerts that trigger when a metric crosses a specified threshold. 40 | 41 | Click on the “Create Alarm” button at the right of your graph, and you will be able to set up the alert and configure it to notify a list of email addresses: 42 | 43 | [![ELB alerts in AWS Console](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/2-05.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-10-elb/2-05.png) 44 | 45 |
46 | 47 | ## Using the AWS Command Line Interface 48 | 49 | You can also retrieve metrics related to a load balancer from the command line. To do so, you will need to install the AWS Command Line Interface (CLI) by following [these instructions](http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html). You will then be able to query for any CloudWatch metric, using different filters. 50 | 51 | Command line queries can be useful for spot checks and ad hoc investigations when you can’t, or don’t want to, use a browser. 52 | 53 | For example, if you want to know the health state of all the backend instances registered to a load balancer, you can run: 54 | 55 | `aws elb describe-instance-health --load-balancer-name my-load-balancer` 56 | 57 | That command should return a JSON output of this form: 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 122 | 123 | 124 |
67 |
68 |   69 |
70 |
71 | { 72 |
73 |
74 |   "InstanceStates": [ 75 |
76 |
77 |       { 78 |
79 |
80 |           "InstanceId": "i-xxxxxxxx", 81 |
82 |
83 |           "ReasonCode": "N/A", 84 |
85 |
86 |           "State": "InService", 87 |
88 |
89 |           "Description": "N/A" 90 |
91 |
92 |       }, 93 |
94 |
95 |       { 96 |
97 |
98 |           "InstanceId": "i-xxxxxxxx", 99 |
100 |
101 |           "ReasonCode": "N/A", 102 |
103 |
104 |           "State": "InService", 105 |
106 |
107 |           "Description": "N/A" 108 |
109 |
110 |       }, 111 |
112 |
113 |   ] 114 |
115 |
116 | } 117 |
118 |
119 |   120 |
121 |
125 | 126 | [Here](http://docs.aws.amazon.com/cli/latest/reference/elb/index.html) are all the ELB commands you can run with the CLI. 127 | 128 |
129 | 130 | ## Monitoring tool integrated with CloudWatch 131 | 132 | The third way to collect CloudWatch metrics is via your own monitoring tools, which can offer extended monitoring functionality. 133 | 134 | You probably need a dedicated monitoring system if, for example, you want to: 135 | 136 | - Correlate metrics from one part of your infrastructure with others (including custom infrastructure or applications) 137 | - Dynamically slice, aggregate, and filter your metrics on any attribute 138 | - Access historical data 139 | - Set up sophisticated alerting mechanisms 140 | 141 | CloudWatch can be integrated with outside monitoring systems  via API, and in many cases the integration just needs to be enabled to start working. 142 | 143 | As explained in [Part 1](https://www.datadoghq.com/blog/top-elb-health-and-performance-metrics), CloudWatch’s ELB-related metrics give you great insight about your load balancers’ health and performance. However, for more precision and granularity on your backend instances’ performance, you should consider monitoring their resources directly. Correlating native metrics from your EC2 instances with ELB metrics will give you a fuller, more precise picture. In [Part 3](https://www.datadoghq.com/blog/monitor-elb-performance-with-datadog), we cover a concrete example of this type of metrics collection and detail how to monitor ELB using Datadog. 144 | 145 |
146 | 147 | ## ELB Access Logs 148 | 149 | ELB access logs capture all the information about every request received by the load balancer, such as a time stamp, client IP address, path, backend response, latency, and so on. It can be useful to investigate the access logs for particular requests in case of issues. 150 | 151 | ### Configuring the access logs 152 | 153 | First you must enable the access logs feature, which is disabled by default. Logs are stored in an Amazon S3 bucket, which incurs additional storage costs. 154 | 155 | Elastic Load Balancing creates [log files](http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/access-log-collection.html#access-log-file-format) at user-defined intervals, between 5 and 60 minutes. Every single request received by ELB is logged, including those requests that couldn’t be processed by your backend instances (see [Part 1](https://www.datadoghq.com/blog/top-elb-health-and-performance-metrics) for the different root causes of ELB issues). You can see more details [here](http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/access-log-collection.html#access-log-entry-format) about the log entry format and the different fields containing information about a request. 156 | 157 | ### Analyzing logs 158 | 159 | ELB access logs can be useful when troubleshooting and investigating specific requests. However, if you want to find and analyze patterns in the overall access log files, you might want to use dedicated log analytics tools, especially if you are dealing with large amount of traffic generating heavy log file volume. 160 | 161 | ## Conclusion 162 | 163 | In this post we have walked through how to use CloudWatch to collect and visualize ELB metrics, how to generate alerts when these metrics go out of bounds, and how to use access logs for troubleshooting. 164 | 165 | In the [next and final part on this series](https://www.datadoghq.com/blog/monitor-elb-performance-with-datadog) you will learn how you can monitor ELB metrics using the Datadog integration, along with native metrics from your backend instances for a complete view, with a minimum of setup. 166 | -------------------------------------------------------------------------------- /dynamodb/how_medium_monitors_dynamodb_performance.md: -------------------------------------------------------------------------------- 1 | #How Medium Monitors DynamoDB Performance 2 | *This post is the last of a 3-part series on monitoring Amazon DynamoDB. [Part 1](https://www.datadoghq.com/blog/top-dynamodb-performance-metrics) explores its key performance metrics, and [Part 2](https://www.datadoghq.com/blog/how-to-collect-dynamodb-metrics) explains how to collect these metrics.* 3 | 4 | [Medium](https://medium.com/) launched to the public in 2013 and has grown quickly ever since. Growing fast is great for any company, but requires continuous infrastructure scaling—which can be a significant challenge for any engineering team (remember the [fail whale](https://en.wikipedia.org/wiki/Twitter#Outages)?). Anticipating their growth, Medium used DynamoDB as one of its primary data stores, which successfully helped them scale up rapidly. In this article we share with you DynamoDB lessons that Medium learned over the last few years, and discuss the tools they use to monitor DynamoDB and keep it performant. 5 | 6 | ## **Throttling: the primary challenge** 7 | 8 | As explained in [Part 1](https://www.datadoghq.com/blog/top-dynamodb-performance-metrics), throttled requests are the most common cause of high latency in DynamoDB, and can also cause user-facing errors. Properly monitoring requests and provisioned capacity is essential for Medium in order to ensure an optimal user experience. 9 | 10 | ### Simple view of whole-table capacity 11 | 12 | Medium uses Datadog to track the number of reads and writes per second on each of their tables, and to compare the actual usage to provisioned capacity. A snapshot of one of their Datadog graphs is below. As you can see, except for one brief spike their actual usage is well below their capacity. 13 | 14 | [![DynamoDB Read Capacity](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/3-01.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/3-01.png) 15 | 16 | ### Invisibly partitioned capacity 17 | 18 | Unfortunately, tracking your remaining whole-database capacity is only the first step toward accurately anticipating throttling. Even though you can provision a specific amount of capacity for a table (or a [Global Secondary Index](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html)), the actual request-throughput limit can be much lower. As described by AWS [here](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions), DynamoDB automatically partitions your tables behind the scenes, and divides their provisioned capacity equally among these smaller partitions. 19 | 20 | That’s not a big issue if your items are accessed uniformly, with each key requested at about the same frequency as others. In this case, your requests will be throttled about when you reach your provisioned capacity, as expected. 21 | 22 | However, some elements of a Medium “story” can’t be cached, so when one of them goes viral, some of its assets are requested extremely frequently. These assets have “hot keys” which create an extremely uneven access pattern. Since Medium’s tables can go up to 1 TB and can require tens of thousands of reads per second, they are highly partitioned. For example, if Medium has provisioned 1000 reads per second for a particular table, and this table is actually split into 10 partitions, then a popular post will be throttled at 100 requests per second at best, even if other partitions’ allocated throughput are never consumed. 23 | 24 | The challenge is that the AWS console does not expose the number of partitions in a DynamoDB table even if [partitioning is well documented](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions). In order to anticipate throttling of hot keys, Medium calculates the number of partitions it expects per table, using the formula described in the AWS documentation. Then they calculate the throughput limit of each partition by dividing their total provisioned capacity by the expected number of partitions. 25 | 26 | Next Medium logs each request, and feeds the log to an ELK stack ([Elasticsearch](https://www.elastic.co/products/elasticsearch), [Logstash](https://www.elastic.co/products/logstash), and [Kibana](https://github.com/elastic/kibana)) so that they can track the hottest keys. As seen in the snapshot below (bottom chart), one post on Medium is getting more requests per second than the next 17 combined. If the number of requests per second for that post starts to approach their estimated partitioned limit, they can take action to increase capacity. 27 | 28 | [![Kibana screenshot](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/3-02.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/3-02.png) 29 | 30 | Note that since partitioning is automatic and invisible, two “semi-hot” posts could be in the same partition. In that case, they may be throttled even before this strategy would predict. 31 | 32 | [Nathaniel Felsen](https://medium.com/@faitlezen) from Medium describes in detail, [in this post](https://medium.com/medium-eng/how-medium-detects-hotspots-in-dynamodb-using-elasticsearch-logstash-and-kibana-aaa3d6632cfd), how his team tackles the “hot key” issue. 33 | 34 | ### The impact on Medium’s users 35 | 36 | Since it can be difficult to predict when DynamoDB will throttle requests on a partitioned table, Medium also tracks how throttling is affecting its users. 37 | 38 | DynamoDB’s API [automatically retries its queries](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ErrorHandling.html#APIRetries) if they are throttled so the vast majority of Medium’s throttled requests eventually succeed. 39 | 40 | Using Datadog, Medium created the two graphs below. The bottom graph tracks each throttled request “as seen by CloudWatch”. The top graph, “as seen by the apps”, tracks requests that failed, despite retries. Note that there are about two orders of magnitude more throttling events than failed requests. That means retries work, which is good since throttling may only slow down page loads, while failed requests can cause user-facing issues. 41 | 42 | [![Throttling CloudWatch vs. application](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/3-03.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/3-03.png) 43 | 44 | In order to track throttling as seen by the app, Medium created a custom throttling metric: Each time that Medium’s application receives an error response from DynamoDB, it checks [the type of error](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ErrorHandling.html). If it’s a **ProvisionedThroughputExceededException**, it increments the custom metric. The metric is reported to Datadog via [DogStatsD](http://docs.datadoghq.com/guides/dogstatsd/), which implements the [StatsD](https://www.datadoghq.com/blog/statsd/) protocol (along with a few extensions for Datadog features). This approach also has the secondary benefit of providing real-time metrics and alerts on user-facing errors, rather than waiting through the slight [delay in information from CloudWatch metrics](http://docs.datadoghq.com/integrations/aws/#metrics-delayed). 45 | 46 | In any event, Medium still has some DynamoDB-throttled requests. To reduce throttling frequency, they use [Redis](https://www.datadoghq.com/blog/how-to-monitor-redis-performance-metrics/) as a cache in front of DynamoDB, which at the same time lowers consumed throughput and cost. 47 | 48 | ## Alerting the right people with the right tool 49 | 50 | [Properly alerting](https://www.datadoghq.com/blog/monitoring-101-alerting/) is essential to resolve issues as quickly as possible and preserve application performance. Medium uses Datadog’s alerting features: 51 | 52 | - When a table on “staging” or “development” is impacted, only email notifications to people who are on call, and also send them [Slack](https://www.datadoghq.com/blog/collaborate-share-track-performance-slack-datadog/) messages. 53 | - When a table on “prod” is impacted, a page alert is also sent via [PagerDuty](https://www.datadoghq.com/blog/end-end-reliability-testing-pagerduty-datadog/). 54 | 55 | Since Datadog alerts can be triggered by any metric (including custom metrics), they set up alerts on their production throttling metrics which are collected by their application for each table: 56 | 57 | [![Throttling alert](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/3-04.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/3-04.png) 58 | 59 | Throttled requests reported by their application mean they failed even after several retries, which means potential user-facing impact. So they send a high-priority alert, set up with the right channels and an adapted message: 60 | 61 | [![Throttling alert configuration](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/3-05.png)](https://don08600y3gfm.cloudfront.net/ps3b/blog/images/2015-09-dynamodb/3-05.png) 62 | 63 | ## Saving money 64 | 65 | By properly monitoring DynamoDB, Medium’s IT team can scale up easily and avoid most throttled requests to ensure an excellent user experience on their platform. But monitoring also helps them identify when they can scale down. Automatically tuning up and down the provisioned throughput for each table is on their road map and will help to optimize their infrastructure expenses. 66 | 67 | ## Tracking backups 68 | 69 | Medium’s engineering team created a Last\_Backup\_Age custom metric which they submit to Datadog via statsd. This metric helps Medium ensure that DynamoDB tables are backed up regularly, which reduces the risk of data loss. They graph the evolution of this metric on their Datadog dashboard and trigger an alert if too much time passes between backups. 70 | 71 | ## Acknowledgements 72 | 73 | We want to thank the Medium teams who worked with us to share their monitoring techniques for Amazon DynamoDB. 74 | 75 | If you’re using DynamoDB and Datadog already, we hope that these strategies will help you gain improved visibility into what’s happening in your databases. If you don’t yet have a Datadog account, you can [start tracking](http://docs.datadoghq.com/integrations/aws/) DynamoDB performance today with a [free trial](https://app.datadoghq.com/signup). 76 | -------------------------------------------------------------------------------- /azure-sql-database/azure-sql-database-monitoring-datadog.md: -------------------------------------------------------------------------------- 1 | # Monitor Azure SQL databases with Datadog 2 | 3 | In [Part 2][p2-post] of this series, we showed you how to monitor Azure SQL Database metrics and logs using the Azure platform. In this post, we will look at how you can use Datadog to monitor your Azure SQL databases alongside other technologies in your infrastructure. Datadog provides turn-key integrations for Azure along with more than {{< translate key="integration_count" >}} other technologies, enabling you to track long-term performance trends across all systems in your infrastructure, not just your SQL databases. 4 | 5 | We will walk through how to: 6 | 7 | - [Collect Azure telemetry data](#integrate-azure-with-datadog) via Datadog's Azure integration 8 | - [Visualize](#get-full-visibility-into-your-azure-sql-databases) key performance metrics 9 | - [Analyze and alert on database logs](#proactively-monitor-database-queries-with-log-analytics-and-alerts) to get a better understanding of database performance and activity 10 | - [Use audit logs](#surface-potential-threats-to-your-sql-databases) to surface potential threats to your SQL databases 11 | 12 | ## Integrate Azure with Datadog 13 | Datadog's [Azure integration](https://docs.datadoghq.com/integrations/azure/) enables you to easily forward telemetry data from your database instances and elastic pools to Datadog in order to monitor key metrics, analyze database performance, and alert on potentially malicious activity on your database instances. The integration collects metrics from [Azure Monitor](https://docs.microsoft.com/en-us/azure/azure-monitor/overview) and enables Datadog to generate additional metrics that give you further visibility into resource limits and quotas, the state of your databases' [geo-replication links](https://docs.microsoft.com/en-us/azure/azure-sql/database/active-geo-replication-overview), and more. 14 | 15 | You can integrate Datadog with your Azure account using either the [Azure CLI](https://docs.datadoghq.com/integrations/azure/?tab=azurecliv20#installation) or [Azure Portal](https://docs.datadoghq.com/integrations/azure/?tab=azurecliv20#integrating-through-the-azure-portal). In either case, this process will generate client and tenant IDs and a client secret, which are required for creating a new app registration in [Datadog's Azure integration tile](https://app.datadoghq.com/account/settings#integrations/azure). 16 | 17 | Once you create a new app registration, Datadog will start collecting metrics from all available Azure resources, such as your virtual machines, app services, load balancers, and SQL database instances. We'll look at how you can visualize and alert on key database performance metrics [in more detail later](#get-full-visibility-into-your-azure-sql-databases). Next, we'll show you how to export diagnostic and audit logs from your Azure SQL databases to Datadog. 18 | 19 | ### Export database diagnostic and audit logs 20 | Azure SQL Database instances generate diagnostic and audit logs, which you can collect with Datadog to give you better visibility into database activity that could affect performance or pose a security risk. In Part 2, we looked at [how to configure your SQL databases][p2-logs] to write these logs to an event hub, which is the recommended method for forwarding logs to a third-party service for analysis. To export database resource and audit logs from an event hub to Datadog, you can run an [automated script](https://docs.datadoghq.com/integrations/azure/?tab=azurecliv20#sending-activity-logs-from-azure-to-datadog) via the cloud shell in Azure Portal, making sure to replace `` with your [Datadog API key](https://app.datadoghq.com/organization-settings/api-keys) and `` with your Azure Subscription ID. 21 | 22 | Datadog automatically parses and enriches incoming database logs via a built-in Azure SQL [log pipeline](https://docs.datadoghq.com/logs/log_configuration/pipelines/?tab=source#integration-pipelines), enabling you to search them by key attributes, such as a database's name or availability zone. 23 | 24 | Now that you are collecting Azure telemetry data, you can use Datadog to monitor the key performance metrics we discussed in [Part 1][p1-metrics], [create visualizations](#proactively-monitor-database-queries-with-log-analytics-and-alerts) of your log data to analyze database activity, and [build custom threat detection rules](#surface-potential-threats-to-your-sql-databases) to proactively monitor database security. 25 | 26 | ## Get full visibility into your Azure SQL databases 27 | Datadog provides full visibility into the health and performance of your databases via an out-of-the-box integration dashboard. The dashboard gives you a high-level overview of all of your database instances and elastic pools and includes some of the key metrics we discussed in [Part 1][p1-metrics], such as CPU utilization and deadlocks. 28 | 29 | 30 | {{< img src="azure-sql-database-dashboard.png" alt="Datadog's out-of-the-box Azure SQL Database dashboard" border="true" popup="true">}} 31 | 32 | You can also create custom dashboards to track database performance alongside other Azure services and technologies in your stack. This can provide better visibility into how well your databases are performing in relation to the services they support. For example, you can use the dashboard to determine if a sudden increase in database CPU utilization was caused by an increase in the number of incoming requests to an application server. If there is not a correlated spike in server requests, then an inefficient query may be to blame. You can then review your database logs, which provide more details about query performance, to troubleshoot further. 33 | 34 | ## Proactively monitor database queries with Log Analytics and alerts 35 | It's important to always be aware of the status and performance of your database instances, but that can become more difficult as you scale your databases to support growing applications. You can use Datadog Log Analytics to surface performance issues that are easily missed, such as long-running queries. Inefficient queries are often the primary cause of poor database performance because they can quickly consume resources and trigger deadlocks. 36 | 37 | 38 | {{< img src="azure-sql-database-log-analytics.png" alt="Azure SQL Database log analytics" border="true" popup="true" caption="Review the top SQL database queries that took the longest amount of time to execute.">}} 39 | 40 | You can also [export any log query](https://docs.datadoghq.com/logs/explorer/export/) to create custom alerts that automatically notify you of a decline in database performance, such as when a query's duration exceeds a specified threshold. 41 | 42 | 43 | {{< img src="azure-sql-database-monitor.png" alt="Azure SQL Database alert" border="true" popup="true">}} 44 | 45 | Analyzing and alerting on your SQL Database logs enables you to prioritize the queries that you should optimize. Azure provides [several recommendations](https://docs.microsoft.com/en-us/azure/azure-sql/database/database-advisor-implement-performance-recommendations#performance-recommendation-options) for optimizing query performance, such as dropping indexes that are no longer used. 46 | 47 | ## Surface potential threats to your SQL databases 48 | Monitoring database performance is one aspect of ensuring that your database instances are able to continue supporting your application. It's also important to make sure that databases are secure, as they store valuable application and customer data, which make them a primary target for attackers. As mentioned in [Part 2][p2-auditing], Azure SQL Database can generate audit logs that contain key information about database activity, such as who is connecting to an instance, what queries they ran, and whether they accessed sensitive information. If you've enabled [auditing][p2-auditing] for Azure SQL databases, you can use Datadog to [collect these logs](#export-database-diagnostic-and-audit-logs) and surface potential threats to your database instances with [Datadog Security Monitoring](https://docs.datadoghq.com/security_platform/security_monitoring/). 49 | 50 | For example, Datadog's built-in detection rules can scan your audit logs and notify you when a firewall rule for a database has been modified to allow connections from unauthorized sources. 51 | 52 | {{< img src="azure-sql-database-security-rule2.png" alt="Azure SQL Database security rule" border="true" popup="true">}} 53 | 54 | You can also use rules to detect when an source attempts to execute a SQL query as part of an application's input (e.g., username or password field), which could indicate that the application is vulnerable to [SQL injection attacks](https://docs.microsoft.com/en-us/sql/relational-databases/security/sql-injection?view=sql-server-ver15). This type of activity enables attackers to manipulate SQL queries in order to tamper with a database or get access to sensitive information. 55 | 56 | When a detection rule flags an incoming log, Datadog will generate a [security signal](https://www.datadoghq.com/blog/announcing-security-monitoring/#correlate-and-triage-security-signals) that provides more context about the activity. You can use signals to share mitigation steps that help teams troubleshoot and resolve the issue faster, such as leveraging [stored procedures](https://cheatsheetseries.owasp.org/cheatsheets/SQL_Injection_Prevention_Cheat_Sheet.html) for a service's SQL queries or [sanitizing](https://docs.microsoft.com/en-us/azure/security/develop/threat-modeling-tool-input-validation) application inputs. 57 | 58 | ## Monitor Azure SQL databases with Datadog 59 | In this post, we've shown how to collect telemetry data from your Azure SQL databases and get full visibility into their health, performance, and security—alongside the other technologies supporting your applications. Check out our [documentation](https://docs.datadoghq.com/integrations/azure/) to learn more about Datadog's Azure integration and start collecting data from your SQL databases, or sign up for a today. 60 | 61 | [p2-post]: /blog/azure-sql-analytics-for-azure-sql-database-monitoring/ 62 | [p2-logs]: /blog/azure-sql-analytics-for-azure-sql-database-monitoring/#collect-azure-sql-database-metrics-and-logs 63 | [p1-metrics]: /blog/key-metrics-for-monitoring-azure-sql-database/ 64 | [p2-auditing]: /blog/azure-sql-analytics-for-azure-sql-database-monitoring/#review-audit-logs-in-log-analytics 65 | --------------------------------------------------------------------------------