Skip to content

Kundan001/AWS-MLS-C01-Study-Guide

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

AWS-MLS-C01-Study-Guide

Note these are my own personal notes and are a work in progress as I study torwards passing this exam. If this helps someone great, but I make no guarantees/promises.

Table of Contents

  1. Introduction
  2. Data Engineering
  3. Exploratory Data Analysis
  4. Modeling
  5. Machine Learning Implementation and Operations
  6. Acronyms

Introduction

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Guide

Exam Content Breakdown:

Domain % of Exam
Domain 1: Data Engineering 20%
Domain 2: Exploratory Data Analysis 24%
Domain 3: Modeling 36%
Machine Learning Implementation and Operations 20%
Total 100%

Data Engineering

Create data repositories for machine learning

Identify data sources (e.g., content and location, primary sources such as user data)(TBD)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Data Stores
Amazon Redshift:
  • fully managed, scalable cloud data warehouse, columnar instead of row based (no Multi-AZ, based on Postgres, No OLTP [row based], but OLAP [column based])
  • Offers parallel sql queries
  • Can be server less or use cluster(s)
  • Uses SQL to analyze structured and semi-structured data across data warehouses, operational DBs, and data lakes
  • Integrates with Quicksight or Tableau
  • Leader node for query planning, results aggregation
  • Compute node(s) for performing queries to be sent back to the leader
  • Provision node sizes in advance
  • Enhanced VPC Routing
  • Forces all COPY and UNLOAD traffic moving between your cluster and data repositories through your VPCs, otherwise over the internet routing, including to other AWS services
  • Can configure to automatically copy snapshots to other Regions
  • Large inserts are better (S3 copy, firehose)
Amazon Redshift Spectrum:
  • Resides on dedicated Amazon Redshift servers independent of your cluster
  • Can efficiently query and retrieve structured and semistructured data from files in S3 into Redshift Cluster tables (points at S3) without loading data in Redshift tables
  • Pushes many compute intensive tasks such as predicate filtering (ability to skip reading unnecessary data at storage level from a data set into RAM) and aggregation, down to the Redshift Spectrum layer
  • Redshift Spectrum queries use much less of the formal cluster's processing capacity than other queries
AWS RDS:
  • Autoscaling when running out of storage
  • OLTP based
  • Must be provisioned
  • Max read replicas: 5
  • Read replicas are not equal to a DB
  • Read replicas cross region/AZ incur $
  • IAM Auth
  • Integrates with Secrets Manager
  • Supports MySQL, MariaDB, Postgres, oracle, aurora
  • Fully customized => MS SQL Server or RDS Custom for Oracle => can use ssh or SSM session manager; full admin access to OS/DB
  • At rest encryption via KMS
  • Use SSL for data in transit to ensure secure access
  • Use permission boundary to control the maximum permissions employees can grant to the IAM principals (eg: to avoid dropped/deleted tables)
  • Multi-AZ:
    • Can be set at creation or live
    • Synchronous replication, at least 2 AZs in region, while Read replicas => asynchronous replication can be in an AZ, cross-AZ or cross Region
Aurora:
  • MySQL or Postgres
  • OLTP based
  • Better performance than RDS version
  • Lower price
  • At rest encryption via KMS
  • 2 copies in each AZ, with a minimum of 3 AZ => 6 copies
  • max read replica: 15 (autoscales)
  • Shareable snapshots with other accounts
  • Replicas: MySQL, Postgres, or Aurora
  • Replicas can autoscale
  • Cross region replication (< 1 second) support available
    • Aurora Global: multi region (up to 5)
    • Aurora Cloning: copy of production (faster than a snapshot)
  • Aurora multimaster (for write failover/high write availability)
  • Aurora serverless for cost effective option (pay per second) for infrequent, intermittent or unpredictable workloads
  • Non-serverless option must be provisioned
  • Automated backups
  • Automated failover with Aurora replicas
    • Fail over tiers: lowest ranking number first, then greatest size
  • Aurora ML: ML using SageMaker and Comprehend on Aurora
DynamoDB:
  • (Serverless) NoSQL Key-value and document DB that delivers single-digit millisecond performance at any scale. It's a fully managed, multi-region, multi-master, durable DB with built-in security, backup and restore, and in-memory caching for internet scale applications
  • Stored on SSD
  • Good candidate to store ML model served by application(s)
  • Stored across 3 geographically distinct data centers
  • Eventual consitent reads (default) or strongly consistent reads (1 sec or less)
  • Session storage alternative (TTL)
  • IAM for security, authorization, and administration
  • Primay key possibilities could involve creation time
  • On-Demand (pay per request pricing) => $$$
  • Provisioned Mode (default) is less expensive where you pay for provisioned RCU/WCU
  • Backup: optionally lasts 35 days and can be used to recreate the table
  • Standard and IA Table Classes are available
  • Max size of an item in DynamoDB Table: 400KB
  • Can be exported to S3 as DynamoDB JSON or ion format
  • Can be imported from S3 as CSV, DynamoDB JSON or ion format
Amazon OpenSearch Service (Amazon ElasticSearch Service)
  • Service to search any field, even partial matches at petabyte scale
  • Common to use as a complement to another DB (conduct search in the service, but retrieve data based on indices from an actual DB)
  • Requires a cluster of instances (can also be Multi-AZ)
  • Doesn't support SQL (own query language)
  • Comes with Opensearch dashboards (visualization)
  • Built in integrations: Kinesis Firehose, AWS IOT, λ, Cloudwatch logs for data ingest
  • Security through Cognito and IAM, KMS encryption, SSL and VPC
  • Can help efficiently store and analyze logs (amongst cluster) for uses such as Clickstream Analytics
  • Patterns:
sequenceDiagram
   participant Kinesis data streams
   participant Kinesis data firehose (near real time)
   participant OpenSearch
   Kinesis data streams->>Kinesis data firehose (near real time): 
   Kinesis data firehose (near real time)->>OpenSearch: 
   Kinesis data firehose (near real time)->>Kinesis data firehose (near real time): data tranformation via λ
Loading
sequenceDiagram
    participant Kinesis data streams
    participant λ (real time)
    participant OpenSearch
    Kinesis data streams->>λ (real time): 
    λ (real time)->>OpenSearch: 
Loading
AWS Elasticache
  • Good to improve latency and throughput for read heavy applications or compute intensive workloads
  • Good for storing sessions of instances
  • Good for performance improvement of DB(s), though use of involves heavy application code changes
  • Must provision EC2 instance type(s)
  • IAM auth not supported
  • Redis versus Mem Cached:
    • Redis:
      • backup and restore features
      • read replicas to scale reads/HA
      • data durability using AOF persistence
      • multi AZ with failover
      • Redis sorted sets are good for leaderboards
      • Redis Auth tokens enable Redis to require a token (password) before allowing clients to execute commands, thus improving security (SSL/Inflight encryption)
      • Fast in-memory data store providing sub-millisecond latency, Hippa compliant, replication, HA, and cluster sharding
    • Mem Cached:
      • Multinode partitioning of data (sharing)
      • No replication (HA)
      • Non persistence
      • No backup/restore
      • Multithreaded
      • Supports SASL auth
AWS DB Migration Service (AWS DMS):
  • Service to transition (no transformations) supported sources to relation DB, data warehouses, streaming platforms, and other data stores in AWS without new code (or any?)
  • Sources:
    • On-premises and EC2 DBs: Oracle, MS SQL Server, MySQL, MariaDB, postgres, mongoDB, SAP, DB2
    • Azure: Azure SQL DB
    • Amazon RDS: all including Aurora
    • S3
  • Targets
    • On-premises and EC2 DBs: Oracle, MS SQL Server, MySQL, MariaDB, postgres, SAP
    • Amazon RDS: all including Aurora
    • Amazon Redshift
    • DynamoDB
    • S3
    • Elastic Search service
    • Kinesis Data Streams
    • DocumentDB
  • Homogenous migration: Oracle => Oracle
  • Heteregenous: Oracle => Aurora
  • EC2 server runs replication software, as well as continuous data replication using Change Data Capture (CDC) [for new deltas] and DMS
  • Can pre-create target tables manually or use AWS Schema Conversion Tool (SCT) [runs on the same server] to create some/all of the target tables, indices, views, etc. (only necessary for heterogeneous case)
Data Lake
  • Offers centralized architecture within S3
  • Decouples storage (S3) from compute resources
  • Analagous to S3, any format is permitted, but typically they are: CSV, JSON, Parquet, Orc, Avro, and Protobuf
S3
Buckets:
  • Service to allow objects/files within a virtual "directory"
  • Bucket names must be globally unique
  • Buckets exist within AWS regions
  • Not a file system, and if a file system is needed, EBS/EFS/FSx should be considered
  • Not mountable as is a NFS
  • Supports any file format
  • Name formalities:
    • Must not start with the prefix 'xn--'
    • Must not end with the the suffix '-s3alias'
    • Not an IP address
    • Must start with a lowercase letter or number
    • Between 3-63 characters long
    • No uppercase
    • No underscores
Objects/Files
  • Each has a key, it's full path within the s3 bucket including the object/file separated by backslashes ("/")
  • Each has a value, it's content
  • Note there is no such thing as a true directory within S3, but the convention effectively serves as a namespace
  • Compression is good for cost savings concerning persistence
  • Max size is 5 TB
  • If uploading > 100MB and absolutely for > 5 GB, use Multi-Part upload
  • S3 Transfer Acceleration also can be utilized to increase transfer rates (upload and download) by going through an AWS edge location that passes the object to the target S3 bucket (can work with Multi-Part upload)
  • Strong consistency model to reflect latest version/value upon write/delete to read actions
  • Version ID if versioning enabled at the bucket level
  • Metadata (list of key/val pairs)
  • Tags (Unicode key/val pair <= 10) handy for lifecycle/security
  • Endpoint offers HTTP (non encrypted) and HTTPS (encryption in flight via SSL/TLS)
Security (IAM principle can access if either of the policy types below allows it and there is no Deny present):
  • Types
    • User Based: governed by IAM policies (eg: which user, within a given AWS account, via IAM should be allowed to access resources)
    • Resource Based:
      • Bucket Policies (JSON based statements)
        • Governing such things as:
          • (Blocking) public access [setting was created to prevent company data leaks, and can be set at the account level to ensure of inheritance]
          • Forced encryption at upload (necessitates encryption headers). Can be alternatively be done by "default encryption" via S3, though Bucket Policies are evaluated first
          • Cross account access
        • Bucket policy statement attributes
          • SID: statement id
          • Resources: per S3, either buckets or objects
          • Effect: Allow or Deny
          • Actions: The set of api action to apply the effect to
          • Principal: User/Account the policy applies to
      • Object Access Control List (ACL) - finer control of individual objects (eg: block public access)
      • Bucket Access Control List (ACL) - control at the bucket level (eg: block public access)
  • S3 Object(s) are owned by the AWS account that uploaded it, not the bucket owner
  • Settings to block public access to bucket(s)/object(s) can be set at the account level
  • S3 is accessible to other AWS resources via:
    • VPC endpoint (private connection)
      • bucket policy tied to AWS:SourceVpce (for one endpoint)
      • bucket policy tied to AWS:SourceVpc (for all possible endpoint(s))
    • Public internet via an IGW=>public ip tied to a bucket policy tied to AWS:SourceIP:
  • S3 Access Logs can be stored to another S3 bucket (not the same to prevent infinite looping)
  • Api calls can be sent to AWS CloudTrail
  • MFA Delete of object(s) within only versioned buckets to prevent accidental permanent deletions [only enabled/disabled by bucket owner via CLI]
S3 Storage Classes
Std Intelligent Tiering Std-IA One Zone-IA Glacier Instant Retrieval Glacier Flexible Retrieval Glacier Deep Archive
*Durability 99.99999999999% 99.99999999999% 99.99999999999% 99.99999999999% 99.99999999999% 99.99999999999% 99.99999999999%
*Availability 99.99% 99.9% 99.9% 99.5% 99.9% 99.99% 99.99%
*Availability SLA 99.99% 99% 99% 99% 99% 99.99% 99.99%
*AZs >= 3 >= 3 >= 3 1 >= 3 >= 3 >= 3
*Min Storage Duration Charge None None 30 days 30 days 90 days 90 days 180 days
*Min Billable Obj Size None None 128 KB 128 KB 128 KB 40 KB 40 KB
*Retieval Fee None None Per GB Per GB Per GB Per GB Per GB
*Storage Cost (GB per month) .023 .0025 - .023 .0125 .01 .004 .0036 .00099
*Retieval Cost (per 1000 requests) GET: .0004
POST: .005
GET: .0004
POST: .005
GET: .001
POST: .01
GET: .001
POST: .01
GET: .01
POST: .02
GET: .0004
POST: .03
Expediated: $10
Std: .05
Bulk: free
GET: .0004
POST: .05
Std: .10
Bulk: .025
*Retieval Time immediate immediate immediate immediate immediate (milliseconds) Expediated (1-5 mins)
Std (3-5 hrs)
Bulk (5-12 hrs)
Std (12 hrs)
Bulk (48 hrs)
*Monitoring Cost (per 1000 requests) 0 .0025 0 0 0 0 0
*Note: US-East-1 for the sake of example, entire table subject to change by AWS
  • Durability: How often a file is not to be lost
  • Availability: How readily S3 bucket/files are available
S3 Standard:
  • Used data frequently accessed
  • Provides high throughput and low latency
  • Good for mobile and gaming applications, pseudo cdn, big data/analytics
S3 Standard Infrequent Access
  • Good for data less frequently acessed that need immediate access
  • Cheaper than Standard
  • Good for Disaster Recovery and/or backups
S3 Intelligent-Tiering
  • Modest fee for monthly monitoring and auto-tiering
  • Moves objects between tiers based on usage
  • Access tiers include:
    • Frequent (automatic) default
    • Infrequent (automatic) not acessed for 30 days
    • Archive Instant (automatic) not accessed for 90 days
    • Archive (optional) configurable between 90 days to >= 700 days
    • Deep Archive (optional) configurable between 180 days to >= 700 days
S3 One Zone-IA
  • Data lost when AZ is lost/destroyed
  • Good for recreateable data or on-prem data secondary backup copies
S3 Glacier
  • Never setup a transition to glacier classes if usage might need to be rapid
  • Good for archiving/backup
  • Glacier Instant Retrieval is a good option for accessing data once a quarter
  • Harness Glacier Vault Lock (WORM) to no longer allow future edits, which is great for compliance and data retention
  • Glacier or Deep Archive are good for infrequentyly accessed objects that don't need immediate access
S3 Lifecycle Transitions (can also be conducted manually via AWS Console)
graph LR
    A[Std] --> B[Std-IA]
    A --> C[Intelligent-Tiering]
    A --> D[One Zone-IA]
    A --> E[Glacier]
    A --> F[Glacier Deep Archive]
    B --> C
    B --> D
    B --> E
    B --> F
    C --> D
    C --> E
    C --> F
    D --> E
    D --> F
    E --> F
Loading
S3 Lifecycle Rules
  • Transition Actions: rules for when to transtion objects between s3 classes (see S3 storage classed above)
  • Expiration Actions: rules for when to delete an object after some period of time
    • Good for deleting log files, deleting old versions of files (if versioning enabled), or incomplete multi-part uploads
  • Rules can be created for object prefixes (addresses) or associated object tags
S3 Data Partioning
  • Harnesses disparate key [path] to speed up queries (eg: Athena)
  • Typical scenarios are:
    • time/date (eg: s3://bucket/datasetname/year/month/day/....)
    • product (eg: s3://bucket/datasetname/productid/...)
  • Partitioning handled by tools such as Kinesis, Glue, etc.
Encryption
  • SSE-S3:
    • Encryption (keys) managed by AWS (S3)
    • Encryption type of AES-256
    • Encrypted server side via HTTP/S and Header containing "x-amz-server-side-encryption":"AES256"
    • Enabled by default for new buckets and objects
  • SSE-KMS
    • Encryption (KMS Customer Master Key [CMK]) managed by AWS KMS
    • Encrypted server side via HTTP/S and Header containing "x-amz-server-side-encryption":"aws:kms"
    • Offers further user control and audit trail via cloudtrail
    • May be impacted by KMS limits, though you can increase them via Service Quotas Console
      • Upload calls the GenerateDataKey KMS API (counts towards KMS quota 5500, 10000, or 30000 req/s based upon region)
      • Download calls the Decrypt KMS API (also counts towards KMS quota)
  • SSE-C:
    • Server side encryption via HTTPS only, using a fully managed external customer key external to AWS that must be provided in the HTTP headers for every HTTP request (key isn't saved by AWS)
    • Objects encrypted with SSE-C are never replicated between S3 Buckets
  • Client Side Encryption:
    • Utilizes a client library such as Amazon S3 Encryption Client
    • Encrypted prior to sending to S3 and must be decrypted by clients when retrieving from S3 conducted over HTTP/S
    • Utilizes a fully managed external customer key external to AWS
    • S3 objects useing SSE-C not able to be replicated between buckets
Encryption in transit (SSL/TLS) vs none
  • HTTP Endpoint - non-encrypted
  • HTTPS Endpoint - encrypted
  • To force encryption, a bucket policy is in order, and the following is an HTTP Get version
    • { "Version": "2012-10-17", "Statement": [ { "Effect":"Deny", "Principal":"*", "Action":"s3:GetObject", "Resource":"arn:aws:s3:::random-bucket-o-stuff/*", "Condition":{ "Bool":"aws:SecureTransport":"false" }

      } ] }

  • To force SSE-KMS encryption
    • { "Version": "2012-10-17", "Statement": [ { "Effect":"Deny", "Principal":"*", "Action":"s3:PutObject", "Resource":"arn:aws:s3:::random-bucket-o-stuff/*", "Condition":{ "StringNotEquals":{"s3:x-amz-server-side-encryption":"aws-kms"} }

      } ] }

  • To force SSE-C encryption
    • { "Version": "2012-10-17", "Statement": [ { "Effect":"Deny", "Principal":"*", "Action":"s3:PutObject", "Resource":"arn:aws:s3:::random-bucket-o-stuff/*", "Condition":{ "Null":{"s3:x-amz-server-side-encryption-customer-algorithm":"true"} }

      } ] }

EFS:
  • Linux based only
  • Can mount on many EC2(s)
  • Use SG control access
  • Connected via ENI
  • 10GB+ throughput
  • Performance mode (set at creation time):
    • General purpose (default); latency-sensitive; use cases (web server, CMS);
    • Max I/O-higher latency, throughput, highly parallel (big data, media processing)
  • Throughput mode:
    • Bursting (1 TB = 50 MiB/s and burst of up to 100 MiB/s)
    • Provisioned-set your throughput regardless of storage size (eg 1 GiB/s for 1 TB storage)
  • Storage Classes, Storage Tiers (lifecycle management=>move file after N days):
    • Standard: for frequently accessed files
    • Infrequent access (EFS-IA): cost to retrieve files, lower price to store
  • Availability and durability:
    • Standard: multi-AZ, great for production
    • One Zone: great for development, backup enabled by default, compatible with IA (EFS One Zone-IA)
EBS:
  • Volumes exist on EBS => virtual hard disk
  • Snapshots exist on S3 (point in time copy of disk)
  • Snapshots are incremental-only the blocks that have changed since the last snapshot are move to S3
  • First snapshot might take more time
  • Best to stop root EBS device to take snapshots, though you don't have to
  • Provisioned IOPS (PIOPS [io1/io2])=> DB workloads/multi-attach
  • Multi-attach (EC2 =>rd/wr)=>attach the same EBS to multiple EC2 in the same AZ; up to 16 (all in the same AZ)
  • Can change volume size and storage type on the fly
  • Always in the same region as EC2
  • To move EC2 volume=>snapshot=>AMI=>copy to destination Region/AZ=>launch AMI
  • EBS snapshot archive (up to 75% cheaper to store, though 24-72 hours to restore)
AWS FSx:
  • Launch 3rd party high performance file system(s) on AWS
  • Can be accessed via FSx File Gateway for on-premises needs via VPN and/or Direct Connect
  • Fully managed
  • Accessible via ENI within Multi-AZ
  • Types include:
    • FSx for Windows FileServer
    • FSx for Lustre
    • FSx for Net App ONTAP (NFS, SMB, iSCSI protocols); offering:
    • Works with most OSs
    • ONTAP or NAS
    • Storage shrinks or grows
    • Compression, dedupe, snapshot replication
    • Point in time cloning
    • FSx for Open ZFS; offering:
    • Works with most OSs
    • Snapshots, compression
    • Point in time cloning
Amazon FSx for Windows:
  • Fully managed Windows file system share drive
  • Supports SMB and Windows NTFS
  • Microsoft Active Directory integration, ACLs, user quotas
  • Can be mounted on Linux EC2 instances
  • Scale up to 10s of GBps, millions IOPs, 100s of PB of data
  • Storage Options:
  • SSD - latency sensitive workloads (DB, data analytics)
  • HDD - broad spectrum of workloads (home directories, CMS)
  • On-premises accessible (VPN and/or Direct Connect)
  • Can be configured to be Multi-AZ
  • Data is backed up daily to S3
  • Amazon FSx File Gateway allows native access to FSx for Windows from on-premises, local cache for frequently accessed data via Gateway
Amazon FSx for Lustre ("Linux" "Cluster"):
  • High performance, parallel, distributed file system designed for Applications that require fast storage to keep up with your compute such as ML, high peformance computing, video processing, Electronic Design Automation, or financial modeling
  • Integrates with linked S3 bucket(s), making it easy to process S3 objects as files and allows you to write changed data back to S3
  • Provides ability to both process 'hot data' in parallel/distributed fashion as well as easily store 'cold data' to S3
  • Storage options include SSD or HDD
  • Can be used from on-premises servers (VPN and/or Direct Connect)
  • Scratch File System can be used for temporary or burst storage use
  • Persistent File System can be used for storage / replicated with AZ
AWS Datasync:
  • A schedulable online data movement and discovery service that simplifies and accelerates data migration to AWS or moving data between on-premises storage, edge locations, other clouds and AWS storage (AWS to AWS, too)
  • Deployed VM AWS Datasync Agent used to convey data from internal storage (via NFS, SMB, or HDFS protocols) to the DataSync service over the internet or AWS Direct Connect to within AWS. Agent is unnecessary for AWS to AWS
  • Directly within AWS =>S3/EFS/FSx for Windows File Server/FSx for Lustre/FSx open zfs/FSx for NetAp ONTAP
  • File permissions and metadata are preserved
  • transfers encrypted and data validation conducted

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Streaming:
  • Good scenarios include where timing is important such as Fraud Detection or IoT Streaming Sensors gathering readings (eg: weather)
  • A lot more technical to develop/maintain
Batch Load:
  • If there is an acceptable latency, run the batch load job(s) every n seconds/minutes/hours/days/weeks/etc.

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads)

Example Full Data Engineering Analytics pipeline
graph LR
   A[S3]-->B[AWS Glue Data Catalog]
   B---|Schema|C[Athena]
   A-->E["Redshift /(Redshift Spectrum)"]
   A-->D["EMR (Hadoop/Spark/Hive)"]
   E-->C
   E-->F[QuickSight]
   C-->F
Loading
AWS Data Pipeline (DP)
  • Data sources can be on-prem or AWS
  • Destinations: S3, RDS, DynamoDB, Redshift, EMR
  • Conducted with EC2 or EMR instances managed by DP
  • Manages task dependencies
  • Retries and notifies upon failure(s)
  • HA
AWS DP vs. Glue:
  • Glue:
    • focused on ETL
    • resources all managed by AWS
    • Data Catalog is there to make the data available to Athena or Redshift Spectrum
    • Lambda based
  • DP:
    • Move data from one location to another
    • More control over environment, compute resources that run code and the code itself
    • EC2 or EMR instance based
Amazon Kinesis:
  • Platform to send stream data (eg: IoT, metrics and logs) making it easy to load and analyze as well as provide the ability to build your own custom applications for your business needs
  • Any mention of "streaming (system[s])" and/or "real time" (big) data is of importance, kinesis is likely the best fit as it makes it easy to collect, process, and analyze real-time, streaming data to allow quick reactions from information taken in.
  • Output can be classic or enhanced fan-out consumers
  • Accessed via VPC
  • IAM access => Identity-based (used by users and/or groups)
  • Types:
    • Kinesis Data Streams
    • Kinesis Data Firehose
    • Kinesis Analytics
    • Kinesis Video Streams
Amazon Kinesis Data Streams:
  • Service to provide low latency, real-time streaming ingestion
  • On-demand capacity mode
    • 4 MB/s input, ??? output?
    • Scales automatically to accommodate up to double its previous peak write throughput observed in the last 30 days
    • Pay per stream per hour and data/in/out per GB
  • Provisioned mode (if throughput exceeded exception => add shard[s] manually or programmatically)
    • Streams are divided into ordered shards
    • 1 MB/s or 1k messages input per shard else 'ProvisionedThroughputException'
    • 2 MB/s output per shard
    • Pay per shard per hour
  • Can have up to 5 parallel consumers (5 consuming api calls per second [per shard])
  • Synchronously replicate streaming data across 3 AZ in a single Region and store between 24 hours and 365 days in shard(s) to be consumed/processed/replayed by another service and stored elsewhere
  • Use fan-out if lag is encountered by stream consumers (~200ms vs ~70ms latency)
  • Requires code (producer/consumer)
  • Shards can be split or merged
  • 1 MB message size limit
  • TLS in flight or KMS at-rest encryption
  • Can't subscribe to SNS
  • Can't write directly to S3
  • Can output to:
    • Kinesis Data Firehose
    • Kinesis Data Analytics
    • Containers
    • λ
    • AWS Glue
Amazon Kinesis Data Analytics:
  • Fully Managed (serverless; scales automatically)
  • Perform real-time analytics on stream via SQL
  • Can utilize λ for preprocessing (near real-time)
  • Input stream can be joined with a ref table in S3
  • Output results include streams/errors
  • Can use either Kinesis Data Streams or Kinesis Data Firehose to analyze data in kinesis
  • Pay only for resources used, though that can end up not being cheap
  • Schema discovery
  • IAM permissions to access input(s)/output(s)
  • For SQL Applications: Input/Output: Kinesis Data Streams or Kinesis Data Firehose to analyze data
  • For Apache Flink (on a cluster):
    • Input: Kinesis Data Stream or Amazon MSK
    • Output: Sink (S3/Kinesis Data Firehose)
  • Use cases:
    • Streaming ETL (simple selections/translations)
    • Continuous metric generation (eg: leaderboard)
    • Responsive analytics to generate alerts when certain metrics encountered
  • ML use cases:
    • Random Cut Forest:
      • SQL function for anomaly detection on numeric columns in a stream
      • uses only recent history to generate model
    • HOTSPOTS:
      • locate and return info about relatively dense regions of data
      • uses more than only recent history
Amazon Kinesis Data Firehose:
  • Fully Managed (serverless) service, no administration, automatic scaling
  • Allows for custom code to be written for producer/consumer
  • Can use λ to filter/transform data prior to output (Better to use if filter/tranform with a λ to S3 over Kinesis Data Streams)
  • Near real time: 60 seconds latency minimum for non-full batches
  • Minimum 1 MB of data at a time
  • Pay only for the data going through
  • Can subscribe to SNS
  • No data persistence and must be immediately consumed/processed
  • Sent to (S3 as a backup [of source records] or failed [transformations or delivery] case[s]):
    • S3
    • Amazon Redshift (copy through S3)
    • Amazon Elastic Search
    • 3rd party partners (datadog/splunk/etc.)
    • Custom destination (http[s] endpoint)
  • S3 Destination(s) (Error and/or output) allow for bucket prefixes:
    • output/year=!{timestamp:yyyy}/month=!{timestamp:MM}/
    • error/!{firehose:random-string}/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/
  • Data Conversion from csv/json to Parquet/ORC using AWS Glue (only for S3)
  • Data Transformation through λ (eg: csv=>json)
  • Supports compression if target is S3 (GZIP/ZIP/SNAPPY)
Amazon Kinesis Video Streams
  • Producers
    • used to capture, process and store video streams in real-time such as smartphone/security/body camera(s), AWS DeepLens, audio feeds, images, RADAR data; RTSP camera.
    • One producer per video streams
    • Video playback capability
  • Consumers
    • custom build server (MXNet, Tensorflow, etc.)
      • This may pass on the data to db (checkpoint stream per processing status), decode the input frames and pass onto SageMaker, or even inference results to Kinesis Streams=>λ for downstream notifications
    • EC2 instances
    • AWS SageMaker
    • Amazon Rekognition Video
  • retention between 1 hr to 10 years
MQTT
  • An IOT Standard messaging protocol
  • Sensor data transferred to your model
  • The AWS loT Device SDK can connect via MQTT
Job scheduling (TBD)

Identify and implement a data transformation solution.

  • Handle ML-specific data using map reduce (Hadoop, Spark, Hive)
  • Transforming data transit (ETL: Glue, EMR, AWS Batch)

AWS Step Functions:

  • A visual workflow service that helps developers use AWS services with λ to build distributed applications, automate processes, orchestrate micro services, or create data (ML) pipelines
  • JSON used to declare state machines under the hood
  • Advanced Error Handling and retry mechanism outside the code
  • Audit history of workflows is available
  • Able to "wait" for any length of time, though the max execustion time of a state machine is 1 year
  • Great for orchestrating and tracking and ordered flow of resources

EMR:

  • Service to create a managed Hadoop framework clusters (Big Data) to analyze/process lots of data using (many) instances
  • Supports Apache Spark, HBase, Presto, Flink, Hive, etc.
  • Takes care of provisioning and configuration
  • Autoscaling and integrated with Spot Instances for cost savings
  • Use cases: Data processing, ML, Web Indexing, BigData
  • AWS Integration
    • Amazon EC2 for the instances that comprise the nodes in the
 cluster
    • Amazon VPC to configure the virtual network in which you launch your instances
    • Amazon S3 to store input and output data or HDFS (default)
    • Amazon CloudWatch to monitor cluster performance and configure alarms
    • AWS IAM to configure permissions
    • AWS CloudTrail to audit requests made to the service
    • AWS Data Pipeline to schedule and start your clusters
  • Node types:
    • Master Node:
      • single EC2 instance to manage the cluster
      • coordinates distribution of data and tasks
      • manages health-long running process
    • Core Node:
      • Hosts HDFS data and runs tasks and store data-long running process
      • can spin up/down as needed
    • Task Node (optional):
      • only to run tasks-usually Spot Instances are a best option
      • no hosted data, so no risk of data loss upon removal
      • can spin up/down as needed
  • Can have long-running cluster or transient (temporary) cluster
  • EMR Notebook
    • Similar concept to Zeppelin, with more AWS integration
    • Notebooks backed up to S3 only (not in within the cluster)
    • Provision clusters from the notebook!
    • Able to use multiple Notebooks to share the multi-tenant EMR clusters
    • Hosted inside a VPC
    • Accessed only via AWS console
    • build Apache Spark Apps and run queries against the cluter (python, pyspark, spark sql, spark r, scala, andor anaconda based open source graphical libs)
  • Purchasing options:
    • On-demand: reliable, predictable, won't be terminated, good for long running cluster(s) [though you need to manually delete]
    • Reserved: cost savings (EMR will use if available), good for long running cluster(s) [though you need to manually delete]
    • Spot instances:
      • cheaper, can be terminated, less reliable
      • Good choice for task nodes (temporary capacity)
      • Only use on core & master if you're testing or very cost-sensitive; you're risking partial data loss
  • At installation of the cluster you need to select frameworks and applications to install
  • Connect to the master node through an EC2 instance and run jobs from the terminal or via ordered steps submitted via the console
  • Instance Type(s) selection
    • Master node:
      • m4.large if < 50 nodes, m4 .xlarge if > 50 nodes
    • Core & task nodes:
      • m4.large is usually good
      • If cluster waits a lot on external dependencies (i.e. a web crawler), t2.medium
      • Improved performance: m4.xlarge
      • Computation-intensive applications: high CPU instances
      • Database, memory-caching applications: high memory instances
      • Network / CPU-intensive (NLP, ML) - cluster computer instances
      • Accelerated Computing / AI - GPU instances (g3, g4, p2, p3)
  • Storage
    • HDFS (distributed scalable file system for Hadoop)
      • distributes data that it stores across every instance in a cluster, as well as multiple copies of data on different instances to ensure no data is lost if instance(s) fail
      • each file stored as blocks
      • default block size is 128 MB
      • storage is ephemeral and is lost upon termination
      • performance benefit of processing data where stored to avoid latency
      • EBS serves as a backup for HDFS
    • EMRFS: access S3 as if it were HDFS
      • EMRFS Consistent View - Optional for S3 consistency
      • Uses DynamoDB to track consistency
    • Local file system
  • EMR promises
    • EMR charges by the hour
      • Plus EC2 charges
    • Provisions new nodes if a core node fails
    • Can add and remove tasks nodes on the fly
    • Can resize a running cluster's core nodes
  • Security
    • IAM policies: can be combined with tagging to control access on a cluster-by-cluster basis
    • Kerberos
    • SSH can use kerboros or EC2 key pairs for client authentication
    • IAM roles:
      • Every cluster in EMR must have a service role and a role for EC2 instance profile(s). These roles, attached via policies, will provide permission(s) to interact with other AWS Services
      • If a cluster uses automatic scaling, an autoscaling role is necessary
      • Service linked roles can be used if service for EMR has lost ability to clean up EC2 resources
      • IAM roles can also be sued for EMRFS requests to S3 to control user access to files with in EMR based on users, groups, or location(s) within S3
    • Security configurations may be specified for Lake Formation (JSON)
    • Native integration with Apache Ranger to provide security for Hive data metastore and Hive instance(s) on EMR
      • For data security on Hadoop/Hive
  • How to use EMR
    • Within EMR, select Create studio instance, which is your environment for running workspaces/notebooks
    • Requires:
      • VPC access
      • 1-5 subnets
      • Security group(s)
      • Service role (IAM/IAM Identity Center)
      • S3 bucket
    • Within the studio instance, create workspaces.  The workspace will need to create/attach an EMR cluster
    • A notebook must select a kernel at initialization (relative to the technology stack one is using)
    • Good practice delete your cluster if not using so it's not to be billed, though good to have a safeguard of the cluster shutting down, automatically to avoid paying for them

AWS Glue:

  • Managed ETL service (fully serverless) used to prepare/transform data for analysis
    • upper limit of 5 minutes as it is serverless
    • Utilizes Python (PySpark) or Scala (Spark) scripts, but run on serverless Spark platform
    • Targets: S3, JDBC (RDS, Redshift), or in Glue Data Catalog
    • Jobs scheduled via Glue Scheduler
    • Jobs triggered by events=>Glue Triggers
    • Transformations:
      • Bundled Transformations
        • DropFields/DropNullFields
        • Filter records
        • Join data to make more interesting data
        • Map/Reduce
      • ML Transformations
        • FindMatches ML: identify duplicate or matching data, even when the records lack a common unique identifier, and no fields exactly match
        • K-Means
      • Format conversions: CSV, JSON, Avro, Parquet, ORC, XML
      • Need an IAM role / credentials to access the TO/FROM data stores
  • Can be event driven (eg: λ triggered by S3 put object) to call Glue ETL
  • Glue Data Catalog:
    • Uses an AWS Glue Data Crawler scanning DBs/S3/data to write associated metadata utilized by Glue ETL, or data discovery on Athena, Redshift Spectrum or EMR
    • Can issue crawlers throughout a DP to be able to know what data is where in the flow
    • Metadata repo for all tables with versioned schemas and automated schema inference
  • Glue Crawlers go through your data to infer schemas and partitions (s3 based on organization [see S3 Data Partitioning])
    • formats supported: ]SON, Parquet, CSV, relational store
    • Crawlers work for: S3, Amazon Redshift, Amazon RDS
    • Can be schedule or On-Demand
    • Need an IAM role / credentials to access the data stores
  • Glue Job bookmarks prevent reprocessing old data
  • Glue Databrew-clean/normalize data using pre-built transformation
  • Glue Studio-new GUI to create, run, and monitor ETL jobs in Glue
  • Glue Streaming ETL (built on Apache Spark Structured Streaming)-compatible with Kinesis Data Streaming, Kafka, MSK
  • Glue Elastic Views:
    • Combine and replicate data across multiple data stores using SQL (View)
    • No custom code, Glue monitors for changes in the source data, serverless
    • Leverages a "virtual table" (materialized view)

AWS Batch:

  • Fully managed (serverless) batch processing at any scale using dynamically launched EC2 instances (spot or on-demand) managed by AWS for which you pay
  • Job with a start and an end (not continuous)
  • Can run 100,000s of computing batch jobs
  • You submit/schedule batch jobs and AWS Batch handles it
    • Can be scheduled using CloudWatch Events
    • Jobs can also be orchestrated using step functions
  • Provisions optimal amount/type of compute/memory based on volume and requirements
  • Batch jobs are defined as docker images and run on ECS
  • Helpful for cost optimization and focusing less on infrastructure
  • No time limit
  • Any run time packaged in docker image
  • Rely on EBS/instance store for disk space
  • Advantage over λ=>time limit, limited runtimes, limited disk space
  • Good for any compute based job (must harness docker) and for any non-ETL based work, batch is likely best (eg: periodically cleaning up s3 buckets)

Exploratory Data Analysis

Libraries to know at a high level:

  • Pandas:
    • used for slicing and mapping data (DataFrames, Series) and interoperates with numpy
    • Dataframe/Series are interchangeable with numpy arrays, though the former is often converted to the former to feed ML algorithms
  • Matplotlib (graphics might be good?)
    • boxplot (with whiskers)
    • histograms (binning: bins of results of similar measure)
  • Seaborn (graphics might be good?)
    • essentially Matplotlib extended
    • heatmap: demonstrates another dimension within the given plot axes
    • pairplot: good for attribute correlations
    • jointplot: scatterplot with histograms adjoining each axis
  • scikit_learn
    • toolkit for/to make ML models
    • X=>attributes
    • y=>labels
    • X and y are utilized in conjunction with the fit function to train the model(s)
    • predict function harnesses the model to output inferences based on input
    • good for preprocessing data (input data=>normal distribution)
      • to avoid unequal weightings, scale to the around the mean for each column
  • Spark MLLib
    • Classification: logistic regression, naïve Bayes
    • Regression
    • Decision trees
    • Recommendation engine (ALS)
    • Clustering (K-Means)
    • LDA (topic modeling)
    • ML workflow utilities (pipelines, feature transformation, persistence)
    • SVD, PCA, statistics

Jupyter Notebooks

  • runs in browser(s) to communicate with the python environment (eg: anaconda) server

from sklearn import preprocessing

scaler = preprocessing.standardScaler()

new_data = scaler.fit_transform(input)

General flow for analyzing data import at first glance (note consider merging this with the next topic):

  • import data
  • head()
  • Does the data have column names?
  • Are certain rows attributes of the data type or na values?
    • Can drop rows potentially (*.dropna(inplace=True), though this might introduce bias if the missing values aren't evenly distributed
  • describe()=> are counts of all the columns equal?
  • If remapping the data, it is a good idea to check the mean/std of attributes from/to via describe()
  • To convert to the numpy array=>*.values()

Sanitize and prepare data for modeling

  • Identify and handle missing data, corrupt data, stop words, etc.
  • Formatting, normalizing, augmenting, and scaling data
  • Labeled data (recognizing when you have enough labeled data and identifying mitigation
  • strategies [Data labeling tools (Mechanical Turk, manual labor)])

Imputing missing data

Mean Replacement
  • Replace missing values with the mean value from the rest of the column (single feature)
  • Fast & easy, won't affect mean or sample size of overall data set
  • Median may be a better choice than mean when outliers are present
  • But it's generally pretty terrible.
    • Only works on column level, misses correlations between features
    • Can't use on categorical features (imputing, with most frequent value can work in this case, though)
    • Not very accurate
Dropping
  • If not many rows contain missing Data
    • dropping those rows doesn't bias your data
    • you don't have a lot of time
    • maybe it's a reasonable thing to do?
  • But, it's never going to be the right answer for the "best" approach.
  • Almost anything is better. Can you substitute another similar field perhaps? (i.e., review summary vs. full text)
KNN: Find K "nearest" (most similar) rows and average their values
  • Assumes numerical data, not categorical
  • There are ways to handle categorical data (Hamming distance)
Deep Learning
  • Build a machine learning model to impute data for your machine learning model!
  • Works well for categorical data, though complicated.
Regression
  • Find linear or non-linear relationships between the missing feature and other features
  • Most advanced technique: MICE (Multiple Imputation by Chained Equations)
Get more data
  • What's better than imputing data? Getting more real data!

Unbalanced data

  • Large discrepancy between "positive" and "negative" cases
    • i.e., fraud detection. Fraud is rare, and most rows will be not-fraud
    • Don't let the terminology confuse you; "positive" doesn't mean "good"
    • It means the thing you're testing for is what happened.
    • If your machine learning model is made to detect fraud, then fraud is the positive case.
  • Mainly a problem with neural networks 


Oversampling

  • Duplicate samples from the minority class
  • Can be done at random

Undersampling

  • Instead of creating more positive samples, remove negative ones
  • Throwing data away is usually not the right answer
    • Unless you are specifically trying to avoid "big data" scaling issues

SMOTE

  • Artificially generate new samples of the minority class using nearest neighbors
    • Run K-nearest-neighbors of each sample of the minority class
    • Create a new sample from the KNN result (mean of the neighbors)
  • Both generates new samples and undersamples majority class
  • Generally better than just oversampling

Perform feature engineering

  • Identify and extract features from data sets, including from data sources such as text, speech,image, public datasets, etc.
  • Analyze/evaluate feature engineering concepts (binning, tokenization, outliers, synthetic features, 1 hot encoding, reducing dimensionality of data)

Analyze and visualize data for machine learning

  • Graphing (scatter plot, time series, histogram, box plot)
  • Interpreting descriptive statistics (correlation, summary statistics, p value)
  • Clustering (hierarchical, diagnosing, elbow plot, cluster size)

Amazon Athena:

  • Serverless ad-hoc query service enabling analysis and querying of data in S3 using standard SQL, while allowing more advanced queries (joins permitted)
  • Compress data for smaller retrieval
  • Use target files (> 128 MB) to minimize overhead and as a cost savings measure
  • $5.00 per TB scanned
  • Commonly used/integrated with Amazon Quicksight
  • Federated query allows SQL queries across relational, object, non-relational, custom (AWS or on-premisis) using Data Source Connectors that run on λ with results being returned and stored in S3
  • presto under the hood
  • supports: csv, json, orc, parquet, Auro
  • able to query unstructured, semi-structured or structured data with in the data lake
  • use cases
    • query web logs (CloudTrail, CloudFront, VPC, ELB)
    • query data prior to loading in DB
  • can integrate with Jupiter, Zepplin, or R-Studio notebooks
  • able to integrate with other visualization tools via ODBC/JDBC protocols
  • can harness Glue Data Catalog metadata for queries
  • Security:
    • Access control
    • IAM, ACLs, S3 bucket policies
    • AmazonAthenaFullAccess/AWSQuicksightAthenaAccess
    • Encrypt results at rest in S3 staging directory
      • Server-side encryption with S3-managed key (SSE-S3)
      • Server-side encryption with KMS key (SSE-KMS)
      • Client-side encryption with KMS key (CSE-KMS)
    • Cross-account access in S3 bucket policy possible
    • Transport Layer Security (TLS) encrypts in-transit (between Athena and S3)
  • anti-patterns:
    • Highly formatted reports / visualization=>That's what QuickSight is for
    • ETL=>Use Glue instead
Typical pipeline use case of Athena
graph LR
    A[S3] --> B[Glue]
    B --> C[Athena]
    C --> D[QuickSight]
Loading

Amazon Quicksight:

  • BI/analytics serverless ML service used to build interactive visualizations (dashboards, graphs, charts and reports), perform ad-hoc analysis without paying for integrations of data and leaving the data uncanned for exploration
  • Integrates with source both in and out of AWS (RDS)
  • In memory computation using Spice Engine
    • Data sets are imported into SPICE
      • Super-fast, Parallel, In-memory Calculation Engine
      • Uses columnar storage, in-memory, machine code generation
      • Accelerates interactive queries on large datasets
    • Each user gets 10GB of SPICE
    • Highly available / durable
    • Scales to hundreds of thousands of users
  • Column-Level security (CLS)
  • Can share analysis (if published) or the dashboard (read only) with users or groups
  • Available as an application anytime on any device (browsers [mobile])
  • Data Sources
    • Redshift
    • Aurora / RDS
    • Athena
    • EC2-hosted databases
    • Files (S3 or on-premises)
      • Excel
      • CSV, TSV
      • Common or extended log format
    • AWS loT Analytics
    • Data preparation allows limited ETL
  • Quicksight Paginated Reports
    • Reports designed to be printed
    • May span many pages
    • Can be based on existing Quicksight dashboards
  • Q
    • Machine learning-powered
    • Answers business questions with NLP eg: "What are the top-selling items in Florida?"
    • Offered as an add-on for given regions
    • Personal training on how to use it is required
    • Must set up topics associated with datasets
    • Datasets and their fields must be NLP-friendly
    • How to handle dates must be defined
  • Security:
    • Multi-factor authentication on your account
    • VPC connectivity
      • Add QuickSight's IP address range to your database security groups
    • Row-level security
      • Column-level security too (CLS) - Enterprise edition only
    • Private VPC access (for on-prem access)
      • Elastic Network Interface, AWS DirectConnect
  • User Management
    • Users defined via IAM, or email signup
    • SAML-based single sign-on
    • Active Directory integration (Enterprise Edition)
    • MFA
  • Pricing
    • Annual subscription
      • Standard: $9 / user /month
      • Enterprise: $18 / user / month
      • Enterprise with Q: $28 / user / month
    • Extra SPICE capacity (beyond 10GB), otherwise more $
      • $0.25 (standard) $0.38 (enterprise) / GB / month
    • Month to month
      • Standard: $12 / user / month
      • Enterprise: $24 / user / month
      • Enterprise with Q: $34 / user / month
    • Additional charges for paginated reports, alerts & anomaly detection, Q capacity, readers, and reader session capacity.
    • Enterprise edition
      • Encryption at rest
      • Microsoft Active Directory integration
      • CLS
  • Use Cases:
    • Interactive ad-hoc exploration / visualization of data
    • Dashboards and KPI's
    • Analyze / visualize data from:
      • Logs in S3
      • On-premise databases
      • AWS (RDS, Redshift, Athena, S3)
      • SaaS applications, such as Salesforce
      • Any JDBC/ODBC data source
  • ML insights feature (only ML capabilties of Quicksight)
    • Anomaly detection (uses Random Cut Forest)
    • Forcasting to get rid of anomalies to make forcast (uses Random Cut Forest)
    • Autonarratives to build rich dashboards with embedded narratives
  • Anti-Patterns
    • Highly formatted canned reports
      • QuickSight is for ad ho queries, analysis, and visualization
      • No longer true with paginated reports!
    • ETL
      • Use Glue instead, although QuickSight can do some transformations
  • Visual Types
    • AutoGraph - automatically selects chart based on input features to best display the data and relationships. Not 100% effective and might require intervention
    • Bar Charts
      • For comparison and distribution (histograms)
    • Line graphs
      • For changes/trends over time
      • [stacked] area line charts - allows visualization of different components added up to a change/trend
    • Scatter plots, heat maps
      • For correlation
    • Pie graphs, tree maps - Heirarchical Aggregation chart (eg: npm package map)
      • For aggregation
    • Pivot tables
      • For tabular data to aggregate in certain ways into other tables
      • applying statistical functions applied to (multi-dimensional) data
    • KPIs - chart detailing measurement(s) between current value(s) vs target(s)
    • Geospatial Charts (maps) - map with sized circles annotating certain amounts in certain areas
    • Donut Charts - when precision isn't important and few items in the dimension; show percentile/proportion of the total amount
    • Gauge Charts - compare values in a measure (eg: fuel left in a tank)
    • Word Clouds - word or phrase frequency within a corpus

Modeling

Frame business problems as machine learning problems.

  • Determine when to use/when not to use ML
  • Know the difference between supervised and unsupervised learning
  • Selecting from among classification, regression, forecasting, clustering, recommendation, etc.

Select the appropriate model(s) for a given machine learning problem

  • Xgboost, logistic regression, K-means, linear regression, decision trees, random forests, RNN, CNN, Ensemble, Transfer learning
  • Express intuition behind models

Train machine learning models

Train validation test split, cross-validation (TBD)

Optimizer, gradient descent, loss functions, local minima, convergence, batches, probability, etc.(TBD)

Loss Functions (aka Cost Function): seek to calculate/minimize the error (difference between actual and predicted value)

Compute choice (GPU vs. CPU, distributed vs. non-distributed, platform [Spark vs. non-Spark]) (TBD)

Model updates and retraining (TBD)

Batch vs. real-time/online (TBD)

Perform hyperparameter optimization.

  • Regularization
    • Drop out
    • L1 /L2
  • Cross validation
  • Model initialization
  • Neural network architecture (layers/nodes), learning rate, activation functions (see below as the notes need to be organized better w/ time)
  • Tree-based models (# of trees, # of levels)
  • Linear models (learning rate)

Activation functions: a gated function that verifies how an incoming value is higher than a threshold value to prevent linearity, used within internal/output layer cells in neural networks

Evaluate machine learning models.

  • Avoid overfitting/underfitting (detect and handle bias and variance)

  • Metrics (AUC-ROC, RMSE)

  • Outliers

    • Variance measures how "spread-out" the data is.

      • Variance (σ^2) is simply the average of the squared differences from the mean
      • Example: What is the variance of the data set (1, 4, 5, 4, 8)?
        • First find the mean: (1+4+5+4+8)/5 = 4.4
        • Now find the differences from the mean: (-3.4, -0.4, 0.6, -0.4, 3.6)
        • Find the squared differences: (11.56, 0.16, 0.36, 0.16, 12.96)
        • Find the average of the squared differences:
          • σ^2= (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
    • Standard Deviation σ is just the square root of the variance.

      • σ^2 = 5.04
      • σ = (5.04)^.5 = 2.24
      • So the standard deviation of (1, 4, 5, 4, 8) is 2.24.
      • This is usually used as a way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual.
      • You can tell how extreme a data point is by asking about "how many sigmas" away from the mean it is?
    • Dealing with Outliers

      • Sometimes it's appropriate to remove outliers from your training data
      • Do this responsibly! Understand why you are doing this.
      • For example: in collaborative filtering a single user who rates thousands of movies could have a big effect on everyone else's ratings. That may not be desirable.
      • Another example: in web log data, outliers may represent bots or other agents that should be discarded.
      • But if someone really wants the mean income of US citizens for example, don't toss out billionaires just because you want to.
      • Our old friend standard deviation provides a principled way to classify outliers.
      • Find data points more than some multiple of a standard deviation in your training data.
      • What multiple? Use common sense.
      • Remember AWS's Random Cut Forest algorithm creeps into many of its services - it is made for outlier detection
        • Found within QuickSight, Kinesis Analytics, SageMaker, and more
    • Binning

      • Bucket observations together based on ranges of values.
      • Example: estimated ages of people
      • Put all 20-somethings in one classification, 30-somethings in another, etc.
      • Quantile binning categorizes data by their place in the data distribution
      • Ensures even sizes of bins
      • Transforms numeric data to ordinal data
      • Especially useful when there is uncertainty in the measurements
      • Helps to cover up imprecision in data collection(s)
    • Transforming

      • Applying some function to a feature to make it better suited for training
      • Feature data with an exponential trend may benefit from a logarithmic transform
      • Example: YouTube recommendations
        • A numeric feature x is also represented by x?and VX
        • This allows learning of super and sub-linear functions
    • Encoding

      • Transforming data into some new representation required by the model
      • One-hot encoding
        • Create "buckets" for every category
        • The bucket for your category has a 1, all others have a 0
        • Very common in deep learning, where categories are represented by individual output "neurons"
    • Scaling / Normalization

      • Some models prefer feature data to be normally distributed around 0 (most neural nets)
      • Most models require feature data to at least be scaled to comparable values
      • Otherwise features with larger magnitudes will have more weight than they should
        • Example: modeling age and income as features - incomes will be much higher values than ages
        • Scikit_learn has a preprocessor module that helps (MinMaxScaler, etc)
      • Remember to scale your results back up
    • Shuffling

      • Many algorithms benefit from shuffling their training data
      • Otherwise they may learn from residual signals in the training data resulting from the order in which they were collected
  • Confusion Matrix:

    Measure Abbreviation Formula
    Error Rate ERR (FP + FN)/(TP + TN + FN + FP) = (FP + FN)/(P + N)
    Accuracy ACC (TP + TN)/(TP + FP + TN + FN)
    Sensitivity, True positive rate, Recall SN, TPR, REC TP/(TP + FN) = TP/P
    Precision, Positive predictive value PREC, PPV TP/(TP + FP)
    Specificity, True negative rate SP, TNR TN/(TN + FP) = TN/N
    False positive rate FPR FP/(FP + TN) = 1 - SP = 1 - TNR
    F1 Score (harmonic mean of precision and recall) F1 TP/(TP + (FN + FP/2))
  • Offline and online model evaluation, A/B testing

  • Compare models using metrics (time to train a model, quality of model, engineering costs)

  • Cross validation (eg: from sklearn.model_selection import cross_val_score)

    • choose many (k-folds)=>train
    • choose remaining holdouts to validate against
    • average out the validation step results
    • good if lacking data

Machine Learning Implementation and Operations

Build ML solutions for performance, availability, scalability, resiliency, and fault tolerance.

AWS environment logging and (error) monitoring

AWS CloudTrail:
  • Service that monitors and records account activity across AWS infrastructure (history of events/API calls)
  • Provides governance, compliance and audit for your AWS account:
  • Enabled by default
  • Trail can be applied to all regions (default) or a single region
CloudTrail Events:
  • Able to be separated into read/write events
  • Management events (default on)
  • Data events (default off due to volume, though can be turned on to trigger/invoke)
CloudTrail Insights:
  • Used to detect unusual activity in account (if enabled):
  • Inaccurate resource provisioning
  • Hitting service limits
  • Bursts of AWS IAM actions
  • Gaps in periodic maintenance
  • Analyzes normal manangement events to create a baseline to then continuosly analyze write events to detect unusual patterns (S3/CloudTrail console/EventBridge events)
  • Cloudtrail Events are stored for 90 days, though can be sent to S3 and analyzed by Athena
Amazon EventBridge (aka Cloudwatch Events):
  • Service to provide connectivity between certain events and resultant services such as
    • CRON job triggering (via EventBridge) a λ
    • λ triggering (via EventBridge) SNS/SQS messages
    • S3 Event Notifications (via EventBridge) to trigger whatever service is required
    • Event Pattern: rules specified in AWS JSON rule configs react (eg: filter) to certain service action(s) (eg: check for external generated certs that are n days away from expiration, metadata, object sizes, names, etc.)
    • When an EventBridge rule runs, it needs permission on the target (eg: [λ, SNS, SQS, Cloudwatch Logs, API GW, etc.] resource based policy or [Kinesis Streams, Sys Mgr Run Command, ECS tasks etc.] IAM Role must allow EventBridge)
    • Externally available to 3rd party SAAS partners
    • Can analyze events and infer an associated schema (capable of versioning). This registered schema allows code generation for applications to know the structure of the data in coming into the event bus
    • EventBridge Event Buses Types:
      • Default (receive events from AWS services)
      • Partner (receive events from SAAS)
      • Custom (recieve from Custom Applications)
    • EventBridge Event Buses:
      • Are accessible to other AWS accounts/Regions via Resource based policies
      • Events can be archived/filtered sent to it (time based or forever) and even replayed
      • Multiple destinations at one time is possible
Cloudwatch vs Cloudtrail:
  • Cloudwatch:
    • Cloudwatch Contributor Insight=>helps analyze (VPC) logs
    • Performance monitoring and dashboards (metrics, CPU, network, etc.)
    • Events and Alerting
    • Log aggregation and analysis
    • Cloudwatch metric=>kinesis data firehose to S3 or 3rd parties in near real time
  • CloudTrail:
    • Record API calls made within Account by everyone
    • Can define trails for specific resources
    • Global service

Multiple regions, Multiple AZs

AMI/golden image

Docker containers

Auto Scaling groups

Rightsizing

* Instances
* Provisioned IOPS
* Volumes
  • Load balancing
  • AWS best practices

Recommend and implement the appropriate machine learning services and features for a given problem.

ML on AWS (application services)

Amazon Polly:
  • Turn text into lifelike speech using deep learning (for talking applications)
  • Customize pronunciation of words with pronunciation lexicons that are harnessed by the Sythesize Speech Operation
  • Can map stylized words and/or acronyms to resultant output
  • Generate more customized output from text marked up with SSML including:
    • breating, whispering
    • emphasis on words
    • phonetic pronunciation
Amazon Lex:
  • ASR to convert speech to text
  • Natural language understanding to recognize parts of speech/text
  • Helps to build chatbots, call center bots
Amazon Comprehend (Medical):
  • Serverless NLP service harnessing ML to uncover valuable insights and connections in text
  • Medical version detects PHI via DetectPHI API
Amazon Transcribe:
  • Automatically convert speech to text
  • Uses Deep Learning - Automatic Speech Recognition (ASR)
  • Use cases:
    • Transcribe customer calls
    • Automate close capitioning/subtitles
    • Generate metadata for media assets to create full scaleable architecture
  • Can remove PII using redaction
  • Supports automatic language identification for multi-lingual audio
DeepLens: AWS camera service
Amazon Rekognition:
  • Find objects, people, text, scenes in images and videos using ML
  • Facial analysis and search to perform user verification, people counting
  • Create a DB of "familiar faces" or compare against celebrities
  • Use cases:
    • Labeling
    • Text detection
    • Face detection and analysis (gender, emotions, age range, etc.)
    • Face search and verification
    • Celebrity recognition
    • Pathing (eg: for sports game analysis)
    • Content Moderation (inappropriate, unwanted, or offensive images/videos)
      • Social media/broadcast media/advertising/e-commerce
      • Confidence level of content flags/gates (threshold configuration based)
      • Flag sensitive content for manual review in A2I
      • Help comply with regulations
Amazon Textract:
  • Extracts text, handwriting and data from any scanned documents (eg: forms, tables, etc.) using ML
  • Read from any type of document (PDFs, images, etc.)
  • Good for invoices, financial reports, medical records, insurance claims, taxforms, ids, passports
Amazon Translate:
  • Natural and accurate language translation
  • Allows localization of content (eg applications/websites) for international users, and to easily translate large volumes of text efficiently

AWS service limits(TBD)

Build your own model vs. SageMaker built-in algorithms(TBD)

AWS SageMaker:

  • File Mode:
    • Useful for small files that fit in memory and where the algorithm has a large number of epochs
    • Can leverage the file system cache for secondary epochs, the overall I/O throughput with Pipe mode is still faster than file mode
  • Pipe Mode:
    • Recommended for large datasets
    • Overall I/O throughput with Pipe mode is still faster than file mode
    • Can stream dataset directly to your training instances where data is fed on-the-fly without using any disk I/O or downloading the complete file prior to execution.
    • Shorter startup times because the data is being streamed instead of being downloaded to your training instances.
    • Higher I/O throughputs due to our high-performance streaming agent (no disk I/O usage).

AWS Sagemaker Ground Truth:

  • Sometimes you don't have training data at all, and it needs to be generated by humans first. (eg: training an image classification model. Somebody needs to tag a bunch of images with what they are images of before training a neural network)
  • Ground Truth manages humans who will label your data for training purposes
  • Ground Truth creates its own model as images are labeled by people
  • As this model learns, only images the model isn't sure about are sent to human labelers.
  • This can reduce the cost of labeling jobs by 70%
  • human labelers include
    • Mechanical Turk
    • internal team (your company?)
    • Professional labeling companies
  • Alternatives to generate training labels:
    • Rekognition=>Automatically classify images
    • Comprehend=>Automatically classify text by topics, sentiment
    • Any pre-trained model or unsupervised technique that may be helpful

AWS Sagemaker Ground Truth Plus:

 * Turnkey solution managing the workflow and team of labelers

  • You fill out an intake form
  • They contact you and discuss pricing
  • You track progress via the Ground Truth Plus Project Portal
  • Get labeled data from S3 when done

Infrastructure: (spot, instance types), cost considerations(TBD)

Using spot instances to train deep learning models using AWS Batch (TBD)

Apply basic AWS security practices to machine learning solutions(TBD)

Identity and Access Management (IAM)

Allow vs Deny: If any denial in policy is present, the resource is denied regardless of allow statement(s). The default behavior is to deny resource(s) and resource(s) need allow statements to be allowed.
LDAP: software protocol for enabling the location of data about organizations, individuals and other resources in a network.
Identity federation: a system of trust between two parties for the purpose of authenticating users and conveying information needed to authorize their access to resources.
User groups can only contain users
S3 Bucket Policies vs Access permissions:
  • Used to add or deny permissions across some or all S3 objects in a bucket, enabling central management of permissions
  • Can grant users within an AWS account or other AWS accounts to S3 resources
  • Can restrict based on request time (Date condition), request sent using SSL (Boolean condition), requester IP Address (Ip address condition) using policy keys
  • User access to S3 => IAM permissions
  • Instance (EC2) access => IAM role
  • Public access to S3 => bucket policy
Type of Access Control Account Level Control User Level Control
IAM Policies No Yes
ACLs Yes No
Bucket Policies Yes Yes
IAM Credentials Report: IAM security tool that lists all your AWS accounts, IAM users and the status of their various credentials; good for auditing permissions at the account level
IAM Access Advisor: shows the service permissions granted to a user and when those services were last used; can use this information to revise policies at the user level
AWS Policy Simulator: used to test and troubleshoot IAM policies that are attached to users, user groups, or resources.
IAM Access Analyzer: service to identify unintended access to resources in an organization and accounts, such as Amazon S3 buckets or IAM roles, shared with an external entity to avoid security risk(s)
IAM Policy Evaulation Logic:

IAM Policy Evaulation Logic

Amazon Cognito:
  • Web Identity federation service/identity broker handling interations between application(s)/resource(s) and Web IdPs.
  • Capable of synchronizing data from multiple devices by means of SNS to send notifications to all devices associated to a given user upon data deltas (IAM policy can be tethered to user ids possibly).
  • User pool: user based; handling user registration, authentication and account recovery.
    • Compatible IdPs: Facebook, Amazon, Google, Apple, OpenID Connect providers, SAML
  • Identity pool: receives authentication token to authorize access to resources directly or through the API GW.
    • Maps to IAM role(s)
    • default IAM role(s) for authenticated/guest users
AWS Resource Access Manager (RAM):
  • Share AWS resources that you own with other AWS accounts (within OU or any account)
  • Aids in avoiding resource duplication by sharing thing such as:
    • VPC subnets (owner can share +1 subnets with other accounts in the same OU):
      • Allows all resources (EC2, etc.) launched in the same VPC
      • Must be from the same OU
      • Can't share SGs and default VPC
      • Users can manage own resources, but can't modify, view, or delete other's resources
      • VPC by itself can't be shared
    • AWS Transit Gateway
    • Route 53 Resolver Rules
    • Licence Manager Configurations accross accounts using Private IP(s)

Security Groups (SGs):

  • Stateful connection, allowing inbound traffic to the necessary ports, thus enabling the connection
  • If adding an Internet Gateway, ensure the SG allows traffic in
  • SG => EC2 instances level, LBs, EFS, DBs, Elasticache
  • Allow rules only

NACL Groups:

  • Stateless, thus a source port inbound will become the outbound port (or possibly taking the defined port and responding via an ephemeral port)
  • Great way of allowing/blocking ip addresses at the subnet level
  • Like a firewall controlling to/from subnet traffic
  • One NACL per subnet
  • New Subnet automatically set to default NACL which denies all inbound/outbound traffic
  • Do not modify default NACL, instead create custom NACL(s)
  • If accepting internet traffic routed via internet gateway
  • If accepting vpn or AWS Direct Connect traffic routed via Virtual Private Gateway
  • NACL rules:
    • Range from 1-32766, with a higher precedence placed on lower numbers
    • Allow and Deny rules
    • First rule match drives acceptance/denial
    • Last rule match is a catch all (*) and denies a request in case no rules match
    • AWS recommends adding rules by an increment of 100

VPC

VPC Endpoint:
  • Every AWS service is publicly exposed (public url)
  • VPC Endpoints (using AWS PrivateLink) allows connections to AWS service(s) using a private network instead of public internet
  • Redundant and scales horizontally
  • Removes the need for IGW, NATGW, etc. to access AWS service(s)
  • In case of issues:
    • Check DNS setting resolution in VPC
    • Check Route tables
  • Types of Endpoints:
    • Interface Endpoints: provisions an ENI (private ip) as an entry point (must attach a SG); supports most AWS services; powered by Private Link
    • Gateway Endpoints: provisions a gateway and must be used as a target in Route table; supports both S3 and DynamoDB
  • Gateway Endpoints are preferred most of the time over Interface Endpoints as the former is free and the latter costs $
  • Interface endpoint is preferred if access is required from on-premises (site-to-site VPN or Direct Connect), a different VPC or a different region

Encryption/anonymization(TBD)

Deploy and operationalize machine learning solutions.

  • Exposing endpoints and interacting with them
  • ML model versioning
  • A/B testing
  • Retrain pipelines
  • ML debugging/troubleshooting
    • Detect and mitigate drop in performance
    • Monitor performance of the model

Acronyms

Acronym Definition
AOF Append-only file
AZ Availability Zones
CLS Column Level Security
DB Database
DP Data Pipeline
EBS Elastic Block Store
ECS Elastic Container Service
EFS Elastic File System
EMR Elastic Map Reduce
EMRFS Elastic Map Reduce File System
ENI Elastic Network Interface
ETL Extract, Translate, Load
FN False Negative
FP False Positive
IA Infrequent Access
IAM Identity and Access Management
IGW Internet Gateway
IOPS Input/Output operations per second
IOT Internet of Things
KMS Key Management Service
KPI Key Performance Indicator
ML Machine Learning
MQTT Message Queuing Telemetry Transport
MSK Managed Streaming Kafka
NFS Network File System
OLAP Online Analytical Processing
OLTP Online Transaction Processing
RCU Read Capacity Units
RDS Relational Database Service
S3 Simple Storage Service
SCT AWS Schema Conversion Tool
SG Security Group
SMOTE Synthetic Minority Over-sampling TEchnique
SNS Simple Notification Service
SQS Simple Queue Service
SSD Solid State Drive
SSE Server Side Encryption
SSH Secure Shell
SSL Secure Sockets Layer
SSM Systems Manager
TN True Negative
TP True Positive
TTL Time to live
VPC Virtual Private Cloud
VPN Virtual Private Network
WCU Write Capacity Units

About

AWS MLS-C01 Study Guide

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published