Hi friends, I recently passed the AWS Certified Solutions Architect – Associate Exam. Woo Hoo! And although this was a cool accomplishment, I feel that I barely have my toe in the water when it comes to knowledge with in AWS – so I have to keep going! I am knowledge hungry! After tossing a coin to choose between next AWS Exam, Routing specialty or SysOps, the AWS SysOps Exam won out. Here is the first of a series of study sheets for the SysOps.
AWS CloudWatch comes up first on the Monitoring section. As Amazon puts it: “Amazon CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWS. You can use Amazon CloudWatch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources”
CloudWatch is a metrics repository; AWS service places metrics in the repository; and you the AWS user, view statistics from those metrics. Custom metrics are supported.
Namespace is a container for CloudWatch metrics. Metrics in separate namespaces are isolated from one another.
Metric represents a time-ordered set of data points that are published to CloudWatch. Each metric must be marked with a timestamp.
The timestamp can be up to two weeks in the past and up to two days in the future. Timestamps are based on current time in UTC.
CloudWatch retains metric data:
- Data Points gathered every 60 seconds are available for 15 days.
- Data Points gathered every 300 ( 5min) are saved for 63 days
- Data Points gathered every 3600 sec ( 1 hr ) are saved for 455 days (15 months )
CloudWatch is also supports the concept of alarms, derived from the metrics in the repository. “An alarm watches a single metric over time; and performs one or more actions, based on the value of a metric threshold over time.” Alarms create actions when a service is in a specific state for a sustained period of time.
Dimensions are name/value pairs that identifies a metric. AWS services that send data to CLoudWatch attach dimensions to each metric. Dimensions are used to filter results, Example, you can get stats for an EC2 instance by calling the ‘InstanceId’ Dimension. You can assign up to ten dimensions to a metric.
Statistics are metric data aggregations over time. CloudWatch provides statistic based on the metric data points provided by custom data / or by other services in AWS. Aggregations are made using the namespace, metric name, dimensions and the datapoint of each unit of measure, within the time you call out. [ min, Max, Sum, Average ] Each static has a unit of measure. If you do not specify, CloudWatch uses None as the unit.
Period is the length of time associated with a specific AWS CloudWatch statistic.
- Alarms: 10 month/ customer free 5000 per region per account.
- API requests 1,000,000 / month / customer free
- Dimensions: 10 per metric
- Metric Data: 15 months
4 Standard [default ] CloudWatch Metrics are:
- CPU, Disk, Network and Status Checks
Memory Metrics statistics are NON- standard / Non -default on Cloudwatch
Two types of status checks:
- System Status Checks [ for underlying physical host ] [ start /stop VM to resolve ]
- Instance Status Checks [ for actual VM ] [ reboot instance to resolve ]
EBS Monitoring on Cloudwatch
Two types of Monitoring for EBS
- Basic: 5-minute periods at no charge. This includes data for the root device volumes for EBS-backed instances.
- Detailed: Provisioned IOPS SSD (
io1) volumes automatically send one-minute metrics to CloudWatch
EBS sends data to CloudWatch, several metrics for these storage types:
- Amazon EBS General Purpose SSD (gp2), # the () denotes the API name
- Throughput Optimized HDD (st1)
- Cold HDD (sc1) volumes automatically send five-minute metrics to CloudWatch
- Magnetic (standard) volumes automatically send five-minute metrics to CloudWatch.
- Provisioned IOPS SSD (io1) volumes automatically send one-minute metrics to CloudWatch. # SUPER fast, high IOPS!
Specific EBS metric names are here – with special emphasis on VolumeQueueLength: “The number of read and write operation requests waiting to be completed in a specified period of time”. If this increments, your disk IOPs may need increase.
Two Volume status metrics to which you should pay attention:
- warning: means: “Degraded (Volume performance is below expectations) Severely Degraded (Volume performance is well below expectations”
- impaired means: “Stalled (Volume performance is severely impacted) Not Available (Unable to determine I/O performance because I/O is disabled)” Your Volume is basically hosed!
EBS Burst Balance Percent Metric is described here and here are my notes:
- General Purpose SSD (gp2) EBS volumes have a base of 3 IOPS per GiB of Volume size, Max Volume of 16,384 GiB and Max Burstable IOPS size of 10,000 [ if you exceed this, you need to move to a Provisioned IOPS SSD (io1) ]
Cloud Architect Dariusz Dwornikowski describes the i/o credit concept for burst balance very well in his blog. ” think of I/O credits as of money a disk needs to spend to buy I/O operations (read or write). Each such operation costs 1 I/O credit.When you create a disk it is assigned given an initial credit of 5.4 million I/O credits. Now these credits are enough to sustain a burst of highly intensive I/O operations at the maximum rate of 3000 IOPS (I/O per second) for 30minutes. When the balance is drained, we are left with an EBS that is totally non-responsive.”
Pre-Warming EBS – Initializing a snapshot reading all blocks before you use it for best performance .
- Per Amazon: “Amazon Relational Database Service sends metrics to CloudWatch for each active database instance every minute. Detailed monitoring is enabled by default.”
- In RDS itself, you monitor RDS by EVENTS
- In CloudWatch you monitor RDS by Metrics
Two metric to which you should pay close attention in RDS:
- Replica Lag “The amount of time Read Replica DB instance lags behind source DB [ SQL, Maria, PostGRESQL ] “
- DiskQueueDepth “The number of outstanding IOs (read/write requests) waiting to access the disk”
- ELB only reports metrics only when there is traffic. Or as Amazon puts it “If there are requests flowing through the load balancer, Elastic Load Balancing measures and sends its metrics in 60-second intervals. If there are no requests flowing through the load balancer or no data for a metric, the metric is not reported.”
- HealthyHostCount metric “he number of healthy instances registered with your load balancer.”
- Other useful counters are statis that backend pool members send:
Which metrics should I monitor? #is the source for the below information:
- Metrics for Memcached CPUUtilization – This is a host-level metric reported as a percent. For more information, see Host-Level Metrics. Since Memcached is multi-threaded, this metric can be as high as 90%. If you exceed this threshold, scale your cache cluster up by using a larger cache node type, or scale out by adding more cache nodes.
- SwapUsage: This metric should not exceed 50 MB. If it does, we recommend that you increase the ConnectionOverhead parameter value.
- Metrics for Redis:
- CPUUtilization – Redis is single-threaded, the threshold is calculated as (90 / number of processor cores). For example, suppose you are using a cache.m1.xlarge node, which has four cores. In this case, the threshold for CPUUtilization would be (90 / 4), or 22.5%.
- SwapUsage: No recommended setting with Redis, you can only scale out
Evictions This is a cache engine metric, published for both Memcached and Redis cache clusters. We recommend that you determine your own alarm threshold for this metric based on your application needs.
- Memcached: If you exceed your chosen threshold, scale you cache cluster up by using a larger node type, or scale out by adding more nodes.
- Redis: If you exceed your chosen threshold, scale your cluster up by using a larger node type”