AWS reInvent vids on youtube are a goldmine for knowledge seekers. What follows are my notes from the AWS ReInvent 2017 ‘Which Database to use when’ Presentation I am using these vids to study for my AWS Architect Professional exam coming up at the end of May. I hope my notes help you, too.
Amazon Database Philosophy: Purpose build Database to satisfy particular workload for best price, programability, cost and performance for customers.
Self Managed Database: You have full responsibility for upgrades and backup. You have full responsibility for Security. Full control over parameters of server and DB. Replication is expensive and requires significant Engineering.
AWS Managed Database: AWS provides upgrades, backup and fail-over as a service. AS provides infrastructure security, certification and tools for security. DB is a managed appliance so you can easily automate. AWS provides fail-over as a packaged service. You can leverage API calls to DB / S3. “Everything is at the end of an API call.”
Generalities -what are you doing with you DB?
Operational [transactional, system of record, content management ]
Usually a good fit for caching * Small Compute sizes, few rows, items documents per request. High-throughput, high concurrency. Mission critical, HA DR data protection. Size at limit, bounded or unbounded?
Things to consider: Size at limit, bounded or unbounded. Rows, key values or documents. Need relational capabilities, Push down compute to DB? Change velocity(insert only workload vs. update workload). Ingestion requirements.
Relational Store, are really good if you need Referential integrity, strong consistency. transactions and hardened scale. Complex query support with SQL.
Key-Value Store, low latency GET and PUT. High throughput, partition-able, fast ingestion of data. Simply Query methods with Filters.
Document Store. Indexing and storing any document with support on querying any property. Simply query with filters, projections and aggregates.
Graph Store. Creating and navigating relations between data easily and quickly. Easily express queries in terms of relations.
RDS is a great general purpose DB, start very small and grow with business. When you are building an application with common frameworks like Ruby on Rails, Jenjo, etc.., and you can choose particular engine based on skills on your team, ( python skill and PostgreSQL ) app requirements. Bringing apps in the cloud, people start with RDS, SQL server for bringing in IIS or .net – get out of business of DB management and focus on application. Aurora vs EBS.
Aurora is a fantastic storage env because its always Multi-AZ. For massive scale relational workloads – chose Aurora, then choose features based on application. RDS is bounded at limit. You provision storage, grow with it – but ultimately limits. Aurora offers encryption and rest and in transit. 5x better performance than SQL, cost efficient.
Break out of limitations with relational DBs. Gives ability to operate efficiently with much more scale. You can mix and match Dynamo with RDS. Putting shopping cart with high availability, high throughput, put that on DynamoDB. If you have data with huge amounts of “push down” computation, you may put that in RDS. Data migration service moves data between sources. DyanmoDB UpdateStreams listens for changes and updates data. Partitioned.
DAX for caching – gives unbounded low latency – Application implemented on DyanmoDB with DAX in front, where we don’t manage cache management – DAX is “write-through cache” massive acceloration in performance.
Elasticache. You add cache on and use an operational store. Application takes responsibility for making cache and data consistent. Memcache and Redis interfaces -for you have an open approach.
If your data is bigger than the cache size, and you are missing hits on the cache, then cache does no good. Not everything is cacheable.
Neptune – Graph DB. Store billions of relations and query with milliseconds latency. Six replicas of data across three AZs. Build queries with Gremlin SPARQL. Relationship is persisted in graph store. Push down compute
Analytics- [retrospective, streaming, predictive ]
Analytic Workloads – Columnar Format. Data in Columns tends to repeat, very compressible. Analytic workloads are large and usually partitioned. Large compute size. Heavy compute push-down. Little to no updates, nee lots of memory and in-memory compute.
Analytic Workloads – Primary decisions to consider: Streaming or not, latency requirements, ETL or no ETL Serverless or dedicate compute. Always active or occasionally active. What is your data format?
Amazon Athena – interactive analysis products. Treat data at rest [ structure or unstructred ] like a DB and you can query it. Zero set up cost, point to S3 and start querying. pay per query. ANSI SQL interface. Zero administration. Serverless. Doing retrospective analysis, develop trends.
RedShift – Fast, Powerful – data warehouse, schema and pattern are understood, ectremely fast queries at scale. Resize cluster up and down. Data encrypted at rest and intransit. Manage your own keys with KMS. Inexpensive. Doing retrospective analysis, develop trends
Kinesis – Doing real-time analytical. process realtime data with SQL. Example would be a customer support case, you want to react in real time. Or brake sensors on a train. You dont have time to index that into a Datawarehouse, you need it now.
Elasticsearch – good for log analysis, full text search and application monitoring. More flexible, more natural ways. IT has Kibana bundled in