AWS GuardDuty CloudWatch Hell

I feel it is important to share with the community. I’ve fought with GuardDuty and CloudWatch to develop an alerting policy that works. In the midst of testing my policy, I found an error in AWS documentation which they have since acknowledged. This all started when I was writing a CloudWatch Rule does not deliver a lot of noise. For a while, all I could get CloudWatch to do was fire on the default GuardDuty Rule:

{
  "source": [ "aws.guardduty" ],
  "detail-type": [ "GuardDuty Finding" ],
}

This was so noisy, I was getting alerted for every port scan of every EC2 instance as well as a variety of other events that were not actionable.

The AWS Admin Guide for GuardDuty outlines the severity types and the alert levels in  float decimal associated with each. For instance, Low severity response falls within the 0.1
to 3.9 range). When ever I tried to add severity to my CloudWatch Rule, I never got alerted!  The guide gives the example CloudWatch Rule for defining severity levels:

aws events put-rule --name Test --event-pattern "{\"source\":
[\"aws.guardduty\"],\"detail-type\":[\"GuardDuty Finding\"],\"detail
\":{\"severity\":[5.0,8.0]}'}"

Notice that the key value for 5 and 8 “severity” in this example is in float decimal. This is where the ‘fun’  began and why I could not ever get any of the sample alerts to work.  When you go into the GuardDuty Console > Settings > Generate Sample alerts, all of the samples (that I generated, and I did this 25 + times), ‘severity’ key value came in as integers. It took some digging and two different AWS Support cases; and I ended up finding it based on the different rules they were having me create and try. AWS support was able to verify my findings ( I opened tickets with both the CloudWatch team and the GuardDuty Team and they said the same thing:

—– GuardDuty Team

Hello Chris,

Thank you for contacting AWS Premium Support. It was a pleasure talking to you today.

So I did my own tests to conclude my findings about this case and I would like to share the results with you below. While I also noticed that the sample GuardDuty (GD) findings only produced a severity with whole numbers such as 5, 8, 2 etc., I tested your rule anyway that included the severity as decimals such as 5.0, 8.0 and 2.0 and noticed that I did not get any notifications about the same via the SNS topic that I had setup with my CW rule.

aws events put-rule --name HensonGDRule --event-pattern "{\"source\":[\"aws.guardduty\"],\"detail-type\":[\"GuardDuty Finding\"],\"detail\":{\"severity\”:[4.0,5.0,8.0]}}”


However, the moment I entered those values as whole numbers, such as 4,5 and 8, I started receiving alerts!

aws events put-rule --name HensonGDRule --event-pattern "{\"source\":[\"aws.guardduty\"],\"detail-type\":[\"GuardDuty Finding\"],\"detail\":{\"severity\”:[4,5,8]}}”

—-CloudWatch Team

Hi Chris,

Yes, you are correct. All the sample findings that were generated had an integer severity therefore our rule did not match. Adding integer values for severity in the rule did make it work.

I understand that you would like to do further testing. I have set the case to "Pending Merchant Action". This way the case will auto-resolve in 5 days if there is no activity. Even if the case auto-resolves you can reopen it by adding a correspondence. I have added myself to the case and will keep track of it till it is fully resolved.

Let me know if you have any questions.

Best regards,

Siddhart

What is unknown and I still need to test is if every same alert generates ‘severity’ as an integer. The CloudWatch Rule in my GitRepo will work based on both float or integer. I have yet to see an alert come in that is a float… most of them look like this:

"severity":5, or "severity":8


I have yet to see a  ‘severity ‘5.5’  . . .  AWS Support is still looking into some things, so I will update this later on.

UPDATED 5/29/2018: I received a new note on my support case that confirms AWS ackknowledges the issue:

Hello Chris,

Thank you for your patience with this case.

To provide an update to you, this is actually a known issue to our internal service teams where GuardDuty findings, which are supposed to be formatted with decimals, are being passed as integers; you will probably see this if you were to try and export a finding directly from the GuardDuty console, you'll see the "Severity" element has an integer value. Because CloudWatch Events is a pattern-based trigger system, this means that technically the values in your pattern are not the same as the values GuardDuty is presenting which explains why your patterns may not be triggering.


The GuardDuty team is aware of the issue and they are pushing forward with a fix as soon as possible, though I can't disclose any affirmative ETA on this. However, as a workaround for now, please try passing a pattern like this:


{

    "source": [

        "aws.guardduty"

    ],

    "detail-type": [

        "GuardDuty Finding"

    ],

    "detail": {

        "severity": [

            5,

            6,

            7,

            8

        ]

    }

}


I would like to apologize for the inconvenience caused to you due to this however, if you have any further questions or concerns, please feel free to get back to us and I'll be glad to look into it as well.


Thank you and have a good day!




Best regards,




Ketan S.

Amazon Web Services

It’s important to note, that, whenever they fix this, it could break the existing work-around, so I will keep on top of this.

 

Some good info unearth during the investigation:

Posted in AWS, Uncategorized | Leave a comment

GlueCon2018: AWS Security for DevOps by Chris Henson

Gratitude is what comes to mind when reflecting back to my speaking opportunity at GlueCon2018. Back in January this year,  I came up with the topic of ‘AWS Secuirty for DevOps’ as a way to introduce the concept of an IAM role, show some basic policies and understand why Principle of Least Privilege is needed when using Apps inside AWS.  I built the slide deck and prepared the talk – without knowing if I was going to be selected to speak.

GlueCon was amazing! I did get selected to speak in the main hall used for the Keynotes (YAY! Thank you, Eric Norlin)  and although the presentation was not filmed, you can down load the deck here from the ‘about me’ page ( at the bottom ) and use the included MD5 hash to ensure you are getting the same file I put up.

 

 

 

Posted in AWS, Gluecon2018 | Leave a comment

Gluecon2018 Keynote w/ Adrian Cockroft + AWESOME!

Cool things happen when a Security person gets to attend a Developer Conference! In all seriousness, Last January, I planned to attend GlueCon this year because I feel development is a critical part of Security and I want to understand development concepts more in depth so I can add value there.

Below are my notes on the Gluecon2018 Keynote w/ Adrian Cockroft. Adrian works for a Amazon Web Services as a Solutions Architect; and he was previously at Netflix as a key part of their team.  The keynote was solid ! … and one of the best presentations on Cloud Architecture I have seen.

 

Adrian opened up with the fact that Architects must ask the “Awkward”  questions of their customers. ‘ What should your system do when it fails?’ [ because it will fail ] ‘ If a permissions lookup fails, what should you do?’ ‘Do you have a real DR?’ ‘ How do you know your system works’? [ what metrics tell you? ]  ‘How often do you fail the entire data center all at once?’ ‘ How exactly does your your system return to normal after a DR?’ Most customers don’t want to talk about failure scenarios and it is an Architect’s job to bring them up and address them in the design. Adrian pointed out that some companies have “Availability Theatre” where true failure scenarios are not part of the over-all DR testing process; yet DR is touted as functional.

Next, Adrian moved onto to talk about avoidable failure scenarios. He pointed out a SaaS company that forgot to renew their domain name and due to expiry, everything failed. He also pointed out SSL certificate expiry as a failure scenario.  Aside from the obvious controls in place to avoid these types of failure scenarios; Adrian pointed out that you could program an alternative DNS name to which your API could fail, so that DNS dependent services could still function in the event of a domain name expiry. He mentioned that DNS is one of the weakest points of a large system and needs to be taken into consideration when we architect for chaos.

Transition to  Chaos Architecture  . . .  4 Layers . .

  1. Infrastructure Layer – No Single Point of Failure
  2. Switching and Interconnecting Layer – Data replication / traffic routing
  3. Application Layer – app Failures / Error handling
  4. Users / People Layer – Operator confusion, users not interpreting data properly and making changes based off of what they see vs. what is actually happening

To mitigate problems on the User Layer . .Not enough emphasis on ‘People training’ and Fire drills when it comes to reacting and responding to system failures. Implement training to help users / operators behave in a consistent way when certain failures occur.  Also, implement ‘Game Days‘  where failures are purposely introduced as a way of training.

To mitigate problems in the Application Layer – Leverage Simian Army toolsets that introduce specific problems into various components of the application to understand how your system reacts to these failures so they can be addressed.  Adrian mentioned cHAP to automate.

To mitigate on Switching / Interconnection and Infrastructure Layers; use Gremlin to run specific failures and experiments against your infrastructure to understand how it behaves in those scenarios. Adrian pointed out that we must not think ” the network is reliable ” and architect for failures in the network domain when we design our systems and applications.

Adrian talked about how a ‘Chaos Engineering Team’ is like a ‘Security Red Team’ wherein a Security Red Team identifies Security weaknesses in Security; a Chaos Engineering Team identifies weaknesses in Availability.

Other critical points Adrian touched on:

The ‘Red Queen theory‘ wherein as we evolve; the people and environments around us evolve around us as well. I took this as a warning as to not “Architect in a Bubble”

Read ‘ the Safety Anarchist’  by Sydney Decker.

Amazon is beginning to implement chaos tests for customer use via Aurora DB Cluster Fault Injection 

Recommended to get involved in Chaos Engineering Working Group 

 

 

Posted in AWS, Gluecon2018, Uncategorized | Leave a comment

OSSEC / Auto-OSSEC Automation in AWS Linux – More GLUE!

 

OSSEC is a tricky devil to automate. And what I mean by automate; is install the ossec-hids-server, install the ossec-hids agent, register the agent and have the server recognize that registration without human prompts. If you’ve done this before, you know there are lots of manual steps. The smart folks over at BinaryDefense have added some automation to that process with their auto-ossec tool

They really took a lot of work out of all of the manual steps needed to connect the client to the server, generate the key and exchange the key…

but… the process was still not as automated as I needed it to be. In AWS you don’t know what the OSSEC server IP will be, and that IP needs to be passed to auto-ossec as an arguement +  placed in the ossec-hids-agent config file.  Not to mention all of the repo adds, tweaks to ossec config files that must happen even for ossec to start properly.

I have written two scripts, located in my git repository,  that automates the installation of the remaining pieces that auto-ossec does not; that is outfitted for AWS Linux.

The LinuxOssecServer script installs ossec-hids-server and binarydefense auto-ossec listener on the AWS Linux Ec2 instance that will be in the role of Ossec Server.

We leverage S3 as a storehouse for needed files:
The atomix file/ script that you run to install the ossec repositories: https://updates.atomicorp.com/installers/atomic would go in s3://yourbucketname.

Also, a clone of binary defense repo https://github.com/binarydefense/auto-ossec would go in s3://yourbucketname.

You need to allow your EC2 instance access to S3 and to query other instances, so EC2 instance Role required for access to S3 and EC2.

The LinuxOssecClient script installs ossec-hids-agent and and binarydefense auto-ossec and then automatically locates the AWS EC2 instance ossec-server ip (via a pre-set tag) and registers the agent and starts services on AWS Linux. Same requirements as above for the fole.

The line with ‘aws ec2 describe-instances’ must have correct region, so put your region in there. For the public version of the code,  ossec server must have AWS tag of Name=tag:Role,Values=OssecMaster for script to locate the IP addr of the EC2 instance that is the ossec server, so when you start your OssecServer instance, be sure to add that tag.

You’ll notice some sleep commands I’ve put in the scripts. OSSEC initialization is a little buggy, meaning, [ see ref links 1 and 2 below ]  that you have to restart the ossec-hids-server process on the server after the first agent attempts to register; once that is done, all the subsequent agents will register with no problem. I don’t know why this is and this behavior is lame – and I hated to have to code around it.  I need to come up with a better way that just sleeping the script during the first agent registrations; and then running a restart after x minutes.  Or maybe the next version of OSSEC will fix this so the first agent will register without a restart.

Ref .1  Issue where you have to restart OSSEC after first agent registers

Ref 2 Issue where you have to restart OSSEC after first agent registers

Also, don’t forget to configure your Security Groups correctly.

You’ll need  9654 TCP open on the OSSEC server for the auto-ossec listener

You’ll need 1514 UDP open on the OSSEC server to accept agent keep alive messages.

 

Posted in Cloud Security, Cyber Security, Linux Security | Leave a comment

Path to AWS Architect Professional – Storage Anti-Patterns

 

This post a summary on my notes from reading the Storage Design Anti-Patterns addressed in this AWS Whitepaper.  

“An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive”

S3 Anti-Patterns: 

Amazon S3 doesn’t suit all storage situations. The following list presents somestorage needs for which you should consider other AWS storage options

Storage Need:  File-System. S3 uses a flat name space and is not meant to be a POSIX-compliant file system.  Instead, consider Amazon EFS as for a File System.

Storage Need:  Structured Data with Query: S3 does not offer query for specific objects. When you use S3, you need to know the bucketname and key for the files you want to retrieve. Instead use / or pair S3 with: AWS Dynamo DB, Amazon RDS or  CloudSearch

Storage Need:  Rapidly Changing data: Use solutions that take read and write latencies into account, such s Amazon EFS , AWS Dynamo DB, Amazon RDS, or Amazon EBS

Storage Need:  Archival data: For data that requires infrequent read access with encrypted archival storage with a long RTO is ideal for Amazon Glacier

Storage Need:  Dynamic Website hosting: Although S3 is ideal for hosting static content, dynamic websites that depend on server-side scripting or database interaction are more ideal for Amazon EC2 or Amazon EFS

Glacier Anti-Patterns: 

Amazon Glacier doesn’t suit all storage situations. The following list presents some storage needs for which you should consider other AWS storage options.

Storage Need:  Rapidly Changing data: Look for a stroage solution with lower read and write latencies such as Amazon RDS, Amazon EFSAWS Dynamo DB,  or DBs running on Amazon EC2

Storage Need:  Immediate Access Data store in Glacier is not available immediately, typically takes 3-5 hours, so if you need to access your data immediately, Amazon S3 is a better choice .

Amazon EFS Anti-Patterns: Amazon EFS doesn’t suit all storage situations. The following list presents some storage needs for which you should consider other AWS storage options

Storage Need:  Archival data: For data that requires infrequent read access with encrypted archival storage with a long RTO is ideal for Amazon Glacier

Storage Need:  Relational Database Storage: In most cases, relational databases require storage that is mounted, accessed, and locked by a single node (EC2 instance, etc.).. Instead use: AWS Dynamo DB, Amazon RDS

Storage Need:  Temporary Storage: Consider using local instance store for items like buffers, cache, queues and caches.

Amazon EBS Anti-Patterns:

Amazon EBS doesn’t suit all storage situations. The following list presents some
storage needs for which you should consider other AWS storage options.

Storage Need:  Temporary Storage: Consider using local instance store for items like buffers, cache, queues and caches.

Storage Need:  Multi-instance Storage: EBS volumes can only be attached to one EC2 instance at a time.  If you need multiple instances attached to a single data store, consider using Amazon EFS

Storage Need:  Highly Durable Storage: Instead use Amazon S3 or Amazon EFS. Amazon S3 Standard storage is designed for 99.999999999 percent (11 nines) annual durability per object. You can take a snapshot of EBS and that snapshot gets saved to S3, thereby providing the durability of S3.  Alternatively, Amazon EFS  is designed for high durability and high availability, with data stored in multiple Availability Zones within an AWS Region.

Storage Need:  Static Data or Web Content: If data is more static, Amazon S3 might represent a more cost-effective and scalable solution for storing  fixed information. Web content served out of Amazon EBS requires a web server running on Amazon EC2; in contrast, you can deliver web content directly out of Amazon S3 or from multiple EC2 instances using Amazon EFS.

Amazon EC2 Instance Store Anti-Patterns:

Amazon EC2 instance store doesn’t suit all storage situations. The following list presents some storage needs for which you should consider other AWS storage options.

Storage Need:  Persistent StorageIf you need disks that a similar to a disk drive and must persist beyond the life of the instance,  EBS Volumes, EFS file systems or S3 are more appropriate

Storage Need:  Relational Database Storage: In most cases, relational databases require storage that is mounted, accessed, and locked by a single node (EC2 instance, etc.).. Instead use: AWS Dynamo DB, Amazon RDS

Storage Need:  Shared Storage: Instance store can only be attached to one EC2 instance at a time.  If you need multiple instances attached to a single data store, consider using Amazon EFS   If you need storage that can be detached from one instance and attached to a different instance, or if you need the ability to share data easily, Amazon EFS, Amazon S3, or Amazon EBS are better choices.

Storage Need:  Snapshots: If you need, long-term durability, availability,
and the ability to share point-in-time disk snapshots, EBS volumes with snapshots stored in S3  are a better choice

Posted in AWS, AWS Certified Solutions Architect, Uncategorized | Leave a comment

Path to AWS Architect Professional – Which DB to use? re:Invent Notes

AWS reInvent vids on youtube are a goldmine for knowledge seekers. What follows are my notes from the AWS ReInvent 2017 ‘Which Database to use when’ Presentation I am using these vids to study for my AWS Architect Professional exam coming up at the end of May. I hope my notes help you, too.

Amazon Database Philosophy:  Purpose build Database to satisfy particular workload for best price, programability, cost and performance for customers.

Self Managed Database: You have full responsibility for upgrades and backup. You have full responsibility for Security. Full control over parameters of server and DB. Replication is expensive and requires significant Engineering.

          VS

AWS Managed Database: AWS provides upgrades, backup and fail-over as a service. AS provides infrastructure security, certification and tools for security. DB is a managed appliance so you can easily automate. AWS provides fail-over as a packaged service. You can leverage API calls to DB / S3. “Everything is at the end of an API call.”

Generalities -what are you doing with you DB?

Operational [transactional, system of record, content management ]

Usually a good fit for caching * Small Compute sizes, few rows, items documents per request. High-throughput, high concurrency. Mission critical, HA DR data protection. Size at limit, bounded or unbounded?

Things to consider: Size at limit, bounded or unbounded. Rows, key values or documents. Need relational capabilities, Push down compute to DB? Change velocity(insert only workload vs. update workload). Ingestion requirements.

Relational Store, are really good if you need Referential integrity, strong consistency. transactions and hardened scale. Complex query support with SQL.

Key-Value Store, low latency GET and PUT.  High throughput, partition-able, fast ingestion of data. Simply Query methods with Filters.

Document Store. Indexing and storing any document with support on querying any property. Simply query with filters, projections and aggregates.

Graph Store.  Creating and navigating relations between data easily and quickly. Easily express queries in terms of relations.

RDS

RDS is a great general purpose DB, start very small and grow with business. When you are building an application with common frameworks like Ruby on Rails, Jenjo, etc.., and you can choose particular engine based on skills on your team, ( python skill and PostgreSQL ) app requirements. Bringing apps in the cloud, people start with RDS, SQL server for bringing in IIS or .net – get out of business of DB management and focus on application. Aurora vs EBS.

Aurora is a fantastic storage env because its always Multi-AZ. For massive scale relational workloads – chose Aurora, then choose features based on application. RDS is bounded at limit. You provision storage, grow with it – but ultimately limits. Aurora offers encryption and rest and in transit. 5x better performance than SQL, cost efficient.

Dynamo DB.

Break out of limitations with relational DBs. Gives ability to operate efficiently with much more scale. You can mix and match Dynamo with RDS.  Putting shopping cart with high availability, high throughput, put that on DynamoDB. If you have data with huge amounts of “push down” computation, you may put that in RDS. Data migration service moves data between sources. DyanmoDB UpdateStreams listens for changes and updates data. Partitioned.

DAX for caching – gives unbounded low latency – Application implemented on DyanmoDB with DAX in front, where we don’t manage cache management – DAX is “write-through cache” massive acceloration in performance.

Elasticache. You add cache on and use an operational store. Application takes responsibility for making cache and data consistent. Memcache and Redis interfaces -for you have an open approach.

If your data is bigger than the cache size, and you are missing hits on the cache, then cache does no good. Not everything is cacheable.

Neptune – Graph DB. Store billions of relations and query with milliseconds latency. Six replicas of data across three AZs.  Build queries with Gremlin SPARQL. Relationship is persisted in graph store.  Push down compute

Analytics- [retrospective, streaming, predictive ] 

Analytic Workloads – Columnar Format. Data in Columns tends to repeat, very compressible.  Analytic workloads are large and usually partitioned. Large compute size. Heavy compute push-down. Little to no updates, nee lots of memory and in-memory compute.

Analytic Workloads – Primary decisions to consider: Streaming or not, latency requirements, ETL or no ETL Serverless or dedicate compute. Always active or occasionally active.  What is your data format?

Amazon Athena – interactive analysis products. Treat data at rest [ structure or unstructred ] like a DB and you can query it. Zero set up cost, point to S3 and start querying.  pay per query.  ANSI SQL interface. Zero administration. Serverless.  Doing retrospective analysis, develop trends.

RedShift – Fast, Powerful – data warehouse, schema and pattern are understood, ectremely fast queries at scale. Resize cluster up and down. Data encrypted at rest and intransit.  Manage your own keys with KMS. Inexpensive. Doing retrospective analysis, develop trends

Kinesis – Doing real-time analytical. process realtime data with SQL.  Example would be a customer support case, you want to react in real time. Or brake sensors on a train. You dont have time to index that into a Datawarehouse, you need it now.

Elasticsearch – good for log analysis, full text search and application monitoring. More flexible, more natural ways. IT has Kibana bundled in

 

Posted in AWS, AWS Certified Solutions Architect, Uncategorized | Leave a comment

Lambda Access Key Age Checker using Python

How old are those Keys, anyway? 

The story goes like this: I needed to automate a way of knowing how old every API Key is and the user to which it belongs –  in all the AWS accounts I work with.

I went to look for  a Lambda script on the interwebs that would do some checks on Access Key age for me. Seems like a basic thing you think would be out there.. but, no – no one had done this in this way. At least what I could find.  I did find some code pieces that others had done for pulling out just the key age, but I ended up putting most all of this together myself and learning a lot!

So what does this script do? Specifically, leveraging boto3 libraries, it’s calls out to IAM and return all usernames into a list, then, the next loop iterates through that list of users  – doing the following for each: get the key for each user and parse the create date against current date, if the key is older than 90 days,  append that user and key age to a new list – then, convert that new list into a string, (so it complies with the SNS message type),  pass the string to SNS – and SNS then emails out to those who have subscribed to that topic.

Lambda Access Key Age Checker using Python

Here is the code in my github.   The Python 2.7 code does exactly what is stated above. Also here is the JSON policy that gives Lambda access to IAM, SNS, and CloudWatch Logs that you will need to attach to a Lambda Execution Role that will need to be created for this function to run.

Lambda has a default function timeout of 3 seconds. In on of the regions in which I implemented this, the code took 5 seconds to run, so I had to increase the default timeout to 5 seconds, fyi –

I set up a CloudWatch Events Rule to Trigger this Lamba using AWS CRON Expressions

I don’t claim to be a Developer,  but I do love automating things with code and this script works, and works well, repeatedly.  I hope it works will for you, too!

Oh, one last thing!  This script could also be easily modded to disable keys that are older than 90 days, which is the next logical step after creating user awareness of key age and implementing and communicating a policy on key age.

Posted in AWS, Lambda | Leave a comment