AWS Certified SysOps Administrator – Associate: Study Sheet – Monitoring Section

Hi friends, I recently passed the AWS Certified Solutions Architect – Associate Exam. Woo Hoo! And although this was a cool accomplishment, I feel that I barely have my toe in the water when it comes to knowledge with in AWS – so I have to keep going! I am knowledge hungry! After tossing a coin to choose between next AWS Exam,  Routing specialty or SysOps, the AWS SysOps Exam won out. Here is the first of a series of study sheets for the SysOps.

AWS CloudWatch  comes up first on the Monitoring section.  As Amazon puts it:  “Amazon CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWS. You can use Amazon CloudWatch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources”

CloudWatch is a metrics repository; AWS service places metrics in the repository; and you the AWS user, view statistics from those metrics. Custom metrics are supported.

Namespace is a container for CloudWatch metrics. Metrics in separate namespaces are isolated from one another.

Metric represents a time-ordered set of data points that are published to CloudWatch. Each metric must be marked with a timestamp.

The timestamp can be up to two weeks in the past and up to two days in the future.  Timestamps are based on current time in UTC. 

CloudWatch retains metric data:

  • Data Points gathered every 60 seconds are available for 15 days.
  • Data Points gathered every 300 ( 5min) are saved for 63 days
  • Data Points gathered every 3600 sec ( 1 hr ) are saved for 455 days (15 months )

CloudWatch is also supports the concept of alarms, derived from the metrics in the repository. “An alarm watches a single metric over time; and performs one or more actions, based on the value of a metric threshold over time.” Alarms create actions when a service is in a specific state for a sustained period of time.

Dimensions are name/value pairs that identifies a metric. AWS services that send data to CLoudWatch attach dimensions to each metric. Dimensions are used to filter results, Example, you can get stats for an EC2 instance by calling the ‘InstanceId’ Dimension. You can assign up to ten dimensions to a metric.

Statistics are metric data aggregations over time. CloudWatch provides statistic based on the metric data points provided by custom data / or by other services in AWS. Aggregations are made using the namespace, metric name, dimensions and the datapoint of each unit of measure, within the time you call out.  [ min, Max, Sum, Average ] Each static has a unit of measure. If you do not specify, CloudWatch uses None as the unit. 

Period is the length of time associated with a specific AWS CloudWatch statistic.

CloudWatch Limits:

  • Alarms: 10 month/ customer free 5000 per region per account.
  • API requests 1,000,000 / month / customer free
  • Dimensions: 10 per metric
  • Metric Data: 15 months

4 Standard [default ] CloudWatch Metrics are:

  • CPU, Disk, Network and Status Checks

Memory Metrics statistics are NON- standard / Non -default on Cloudwatch

Two types of status checks:

  • System Status Checks [ for underlying physical host ] [ start /stop VM to resolve ]
  • Instance Status Checks [ for actual VM ] [ reboot instance to resolve ]

EBS Monitoring  on Cloudwatch

Two types of Monitoring for EBS

  • Basic: 5-minute periods at no charge. This includes data for the root device volumes for EBS-backed instances.
  • Detailed: Provisioned IOPS SSD (io1) volumes automatically send one-minute metrics to CloudWatch

EBS sends data to CloudWatch, several metrics for these storage types:

  • Amazon EBS General Purpose SSD (gp2),  # the () denotes the API name
  • Throughput Optimized HDD (st1)
  • Cold HDD (sc1) volumes automatically send five-minute metrics to CloudWatch
  • Magnetic (standard) volumes automatically send five-minute metrics to CloudWatch.
  • Provisioned IOPS SSD (io1) volumes automatically send one-minute metrics to CloudWatch. # SUPER fast, high IOPS!

Specific EBS metric names are here – with special emphasis on VolumeQueueLength: “The number of read and write operation requests waiting to be completed in a specified period of time”. If this increments, your disk IOPs may need increase.

Two Volume status metrics to which you should pay attention:

  • warning:  means: “Degraded (Volume performance is below expectations) Severely Degraded (Volume performance is well below expectations”
  • impaired means: “Stalled (Volume performance is severely impacted) Not Available (Unable to determine I/O performance because I/O is disabled)” Your Volume is basically hosed!

Burst Balance

EBS Burst Balance Percent Metric  is described here and here are my notes:

  • General Purpose SSD (gp2) EBS volumes have a base of 3 IOPS per GiB of Volume size, Max Volume of 16,384 GiB and Max Burstable IOPS size of 10,000 [ if you exceed this, you need to move to a Provisioned IOPS SSD (io1) ]

Cloud Architect Dariusz Dwornikowski describes the i/o credit concept for burst balance very well in his blog. ” think of I/O credits as of money a disk needs to spend to buy I/O operations (read or write). Each such operation costs 1 I/O credit.When you create a disk it is assigned given an initial credit of 5.4 million I/O credits. Now these credits are enough to sustain a burst of highly intensive I/O operations at the maximum rate of 3000 IOPS (I/O per second) for 30minutes. When the balance is drained, we are left with an EBS that is totally non-responsive.”

Pre-Warming EBS – Initializing a snapshot reading all blocks before you use it for best performance .

RDS Monitoring 

  • Per Amazon: “Amazon Relational Database Service sends metrics to CloudWatch for each active database instance every minute. Detailed monitoring is enabled by default.”
  • In RDS itself, you monitor RDS by EVENTS
  • In CloudWatch you monitor RDS by Metrics 

Two metric to which you should pay close attention in RDS:

  • Replica Lag “The amount of time Read Replica DB instance lags behind source DB [ SQL, Maria, PostGRESQL ] “
  • DiskQueueDepth “The number of outstanding IOs (read/write requests) waiting to access the disk”

Elastic Load Balancer Metrics for CloudWatch

  • ELB only reports metrics only when there is traffic. Or as Amazon puts it “If there are requests flowing through the load balancer, Elastic Load Balancing measures and sends its metrics in 60-second intervals. If there are no requests flowing through the load balancer or no data for a metric, the metric is not reported.”
  • HealthyHostCount metric “he number of healthy instances registered with your load balancer.”
  • Other useful counters are statis that backend pool members send:  HTTPCode_Backend_2XX,HTTPCode_Backend_3XX,HTTPCode_Backend_4XX,HTTPCode_Backend_5XX

ElasticCache Monitoring CloudWatch

Which metrics should I monitor?   #is the source for the below information:

  • Metrics for Memcached  CPUUtilization – This is a host-level metric reported as a percent. For more information, see Host-Level Metrics. Since Memcached is multi-threaded, this metric can be as high as 90%. If you exceed this threshold, scale your cache cluster up by using a larger cache node type, or scale out by adding more cache nodes.
  • SwapUsage: This metric should not exceed 50 MB. If it does, we recommend that you increase the ConnectionOverhead parameter value.

 

  • Metrics for Redis:
  • CPUUtilization Redis is single-threaded, the threshold is calculated as (90 / number of processor cores). For example, suppose you are using a cache.m1.xlarge node, which has four cores. In this case, the threshold for CPUUtilization would be (90 / 4), or 22.5%.
  • SwapUsage: No recommended setting with Redis, you can only scale out

       

         Evictions This is a cache engine metric, published for both Memcached and Redis cache clusters. We recommend that you          determine your own alarm threshold for this metric based on your application needs.

  • Memcached: If you exceed your chosen threshold, scale you cache cluster up by using a larger node type, or scale out by adding more nodes.
  • Redis: If you exceed your chosen threshold, scale your cluster up by using a larger node type”

Amazon AWS Certified Solutions Architect SWF / SQS Study Sheet

Simple WorkFlow Service – SWF

Web service to coordinate  work across distributed application components [ Human tasks outside of process can be included as well ]  – Tasks represent invocations of logical steps in Applications.

SWF Task is assigned once, never duplicated.

SWF Tasks can be stored for up to one year

SWF keeps track of all tasks in an application

SWF ACTORS

  • Workflow Starters [ Application or event ] that kicks off workflow
  • Workflow Deciders Controls the flow of activity based on outcomes of task state
  • Activity Workers – programs that interact with SWF to get tasks, process them and return results

Simple Queue Service – SQS

SQS is a Web Service that gives access  to message queues that can be used to store messages while they are waiting to be processed.

SQS is a distributed Queue System that enables applications to queue messages that one part of an app generates to be consumed by another [ de-coupled ] part of that application.

De-Couple Application components so they can run independently; SQS acts as a buffer between components.

SQS is “Pull based” , meaning instances poll and ask it for work.

Messages are 256 KB [ and can be in 64 KB chunks ]

Messages can be stored in SQS for:

  • as little as 1 min
  • DEFAULT of 4 days
  • up to 14 days

For SQS STANDARD QUEUE: VisibilityTimeOut is the amount of time that the message is “invisible” in the SQS queue after a EC2 (or other reading software) retrieves that message.

  • If job is process BEFORE the VisibilityTimeOut expires, messages is deleted from queue
  • If job is not processed within VisibilityTimeOut, the message will become “visible” again and another EC2 will pull it; possibly resulting in same message being delivered twice.

VisibilityTimeOut MAX is 12 hours 

SQS [ Standard Queue ] will guarantee a message is delivered at least once.

  • but will NOT guarantee message order
  • but will NOT guarantee message is ONLY delivered once ( e.g. could be delivered twice )

Long Polling vs. Short Polling: In almost all cases, Amazon SQS long polling is preferable to short polling. Long-polling requests let your queue consumers receive messages as soon as they arrive in your queue while reducing the number of empty ReceiveMessageResponse instances returned.

Long-Polling does not return a response until message is in message queue. [ will save money, because you are not polling an empty queue ]

Short-Polling, returns immediately; even if queue is empty.

AWS Certified Solutions Architect Associate ELB & AutoScaling Study Sheet

AWS Elastic Load Balancer is the “card dealer” that evenly distributes “cards” [traffic ] across “card players” [ EC2 instances ] .

Works across EC2 instances in multiple Availability Zones

  • supports http, https, TCP and SSL traffic / listeners
  • uses Route 53 DNS CNAME only
  • supports internet facing and internal
  • supports SSL offload / SSL termination at ELB, relieving load from EC2 instances

Idle Connection Timeout and Keep Alive Options

 

ELB sets the Idle timeout at 60 seconds for both connections; and will timeout if data is still being transferred.  Increase this setting for longer operations, ( file uploads ), etc.

For https and http listeners, use Keep Alive  load balancer to re-use back-end connections, reducing CPU.

AWS Cloud Watch for ELB and EC2

Service for monitoring all AWS resources and application in near real time. Collect and track metrics, collect and monitor log files, set alarms and react to changes in AWS environment. [ SNS notifications, kick off auto scaling group ]

Basic Monitoring / Every 5 minutes  [ DEFAULT ]

Detailed Monitoring / every 1 minute ( more expensive ) 

Each account limited to 5000 alarms.

Metrics data retained two weeks by default.

CloudWatch Logs Agent available for automated way to send log data to CloudWatch Logs for EC2 if running AWS Linux or Ubuntu.

The AWS/EC2 namespace includes the following default instance metrics:

CPU Metrics, Disk Metrics, Network Metrics,.

Auto Scaling and Launch Configuration

A Launch Configuration is basically a template that AWS Auto Scaling will use to spin up new instances. Launch Configurations are composed of:

  • AMI
  • EC2 instance type
  • Security Group
  • Instance Key Pair

Auto-Scaling is basically provisioning servers on demand and releasing them when no longer needed – you spin up more servers when there is peak demand; e.g., black Friday, World Series ticket sales . .

Auto-Scaling Plans:

Maintain Current Instance Levels – health checks on current instances; and if one dies; another will replace it.

Manual Scaling – This is a bad name for this group; because the auto-scaling itself is still automatic, the metrics input is manual .. e.g., you tell a change in the min, max capacity [ metrics, think max CPU, etc.. ] of group and Autoscaling will spin up more instances when your metrics are seen.

Scheduled Scaling – For predictable behavior [ Black friday thru christmas ] all actions performed automatically as a function of data and time.

Dynamic Scaling – you define different parameters, using cloud watch logs, network bandwidth ,etc

Scaling Policy

A scaling policy is used by Auto scaling with Cloud Watch alarms to determine when your AS group should scale in or scale out. Each Cloud watch alarm watches  a single metric and sends a message when the metric breaches a threshold.

AWS Certified Architect Associate VPC Study Sheet

AWS VPC

As a Network Engineer; it fascinates me what Amazon has done to virtualize the Network in its Virtual Private Cloud.   Here go the notes!

AWS VPC is a logically isolated section of the AWS Cloud, a virtual network in which you can launch your EC2 instances that can be private or public.

All AWS VPCs contain: subnets, route tables, DHCP option sets, Security Groups and ACLs.

Optional VPC elements are: Internet Gateways, Elastic IP. Elastic Network interfaces, EndPoints, Peering, NAT servers or Gateways, Virtual Private Gateway, Customer Gateways, and VPNs.

Largest Subnet in a VPC is /16  Smallest Subnet is /28.

One Subnet per one Availability Zone; and do not span Availability Zones 

VPC Route Tables:

VPC has a Router (Implicit).

VPC comes with a “Main” Route table which you can change; and you can also create separate routes tables within your VPC that are not associated with “Main”.  Each subnet you create has to be associated with one of the route tables.

VPC Internet Gateways [ IGW]:

An AWS VPC Internet Gateway is a Horizontally Scaled, redundant, highly VPC component that allows communication between your EC2 instances and the internet. [ Basically default route Target out points to the IGW ]

IGW’s must be attached to VPCs

Route table must have a 0.0.0.0/0 route to send all non-VPC traffic out.

ACLs and Security Groups MUST be configured so the bad guys don’t get in .

Below image is from AWS: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Internet_Gateway.html

VPC DHCP Option SETS

AWS automatically creates and associates a DHCP Option Set for your VPC and sets two options: 1. DNS servers 2. Domain-Name. These are set to Amazon default DNS and Domain name for your region.  To assign your own Domain name; you can create a custom DHCP option set and configure the following:

  1. DNS servers
  2. Domain-Name
  3. NTP Servers
  4. NetBios Name servers
  5. NetBios Node type.

Elastic IP addresses [ EIP ]: 

These are AWS Public IP Addresses that you can allocate to your account from their larger pool; reachable from anywhere on the internet.

  • Create the EIP first for your VPC; and then assign it to an EC2
  • EIPs are specific to a region
  • One to one relationship between network interfaces and EIP
  • EIPs can be moved from one EC2 instance to a different EC2 instances.
  • EIPs remain with your account until you release them.
  • There are charges incurred fro EIPs allocated to your AWS account; when they are not in use.
  • charged when instance is stopped; charged when un-attached
  • Free only only one EIP per instance and instance is running.

Elastic Network Interfaces [ EIN ]:

This is a network interface available within a single VPC that can be attached to an instance; and are associated with a subnet when they are created

  • Can have one public and multiple private IPs
  • Can exist independently of Instance
  • Allow you to create a management Network, use Network and Security AMIs/Appliances create dual-homed solutions.

VPC EndPoints

VPC Endpoints allows you to create a private connection between your VPC and other AWS servers without going over the internet. Works with Route TAbles; where Endpoint for particular service can be a target.

VPC Peering

Allows for communication between two VPCs; e.g., communication from instances in one VPC to instances in another VPC.

  • you can create peering with your VPC and:
    • another VPC in your account
    • VPC in another AWS account
  • within a single region

Peering Rules:

  • no peering between VPC that have matching over overlapping CIDR blocks
  • cannot peer with VPC in different Regions
  • No transitive peering
  • No more than one peering connection between two VPCs

Security Groups (SG) in a VPC

A security Group is a stateful Firewall that controls inbound and outbound network traffic to individual EC2 instances and other AWS resources.  All EC2 instances must be launched into a Security Group. Only the Default Security Group  allows communications between all resources in that same Security Group . Instances in Security Groups you create cannot talk to each other by default.

  •  500 SGs per AWS VPC
  • 50 inbound, 50 outbound rules for each SG
  • 5 SG’s per network interface
  • applied selectively to individual instances
  • Can specify ALLOW rules / but no DENY [ whitelist ]
  • By default, no inbound traffic is allowed from anything not in SG
  • New SGs by Default have a permit all outbound.
  • Stateful
  • Evaluates ALL rules before deciding permit / deny

Network Access Lists (ACLs)

so you Cisco guys, this is pretty much the same. . .

Subnet Level, state-less, number set of rules, processed top down.  VPCs have a default ACL associated with every subnet that allows all traffic in and out. When you crate an ACL, its deny all until you create rules.

  • Supports allow Rules and Deny rules
  • State-less, return traffic MUST be called out
  • processed in order
  • Because at Subnet level, applied to all instances in that subnet

NAT Instances ( AMI ) on VPC

AMIs on AWS that have the amzn-ami-vpc-nat – use is taking traffic from private subnet in VPC and forwarding it to the IGW ( Internet Gateway ) .

You need a SG will appropriate Rules / in and out

Launched in a PUBLIC subnet in VPC

-Disable Source / Destination Check of NAT or it won’t work 

Subnet with PROV host will have NAT host as the destination in the route table:

0.0.0.0/0 goes to > [NAT instance name]

NAT Gateway on VPC

Designed to operation like the NAT AMI, but easier to manage; ( no EC2 isntance to patch ); and highly avaialble within an AZ.

VPN, VPG and CGW in a VPC

A VPG ( Virtual Provate Gateway ) is the VPn concentrator on the AWS side of a VPN connection.

A CGW represents a physical device on the customer’s network; thier end of the tunnel.

The VPN handshake must be initiated on the customer’s side.

EC2 Virtualized Types

Hardware Virtual Machines  (HVM) vs. ParaVirtual Machines  (PV)

HVM AMIs – fully virtualized set of hardware and boot executes master boot record of block device of image; has support for special VM extensions ( GPU accelerator ).

PV-AMI – Use PV-GRUB boot loader; runs on hardware without explicit support for VM; but no special extensions; currently only C3 and M3 types can be used.

T2 instances must be launched into a VPC ( not supported in classic )

T2 must be on HVM AMI

Recommended use current generation instance types on HVM AMIs