Kinesis¶
Overview¶
- Managed alternative to Kafka.
- Data is replicated across 3 AZ.
- Can quickly become expensive.
Use Cases¶
- application logs, metrics IoT, Clickstreams.
- "Real-time" big data.
- Streaming processing frameworks (Spark, NiFi).
- Shard level metrics can be in 1minute intervals.
Kinesis Streams¶
- Low latency streaming ingestion at scale.
- Streams are divided into ordered Shards/Partitions. Increase shards to increase throughput.
- Data retention is 1 day by default. Can be up to 7 days.
- Can reprocess/replay data.
- Multiple applications can consume the same stream.
- Real-time processing with scale of throughput.
- Data inserted into Kinesis can't be deleted (immutable).
Shards¶
- Write 1MB/s, or 1000 messages per shard.
- Read 2MB/s per shard.
- Can provision up to 200 shards.
- Billing is per shard (provisioned).
- Messages can be batches.
- Can reshard/merge shards over time.
- Records are ordered per shard.
Kinesis Analytics¶
- Real-time analytics on streams using SQL
Kinesis Firehose¶
- Load streams into S3, Redshift, ElasticSearch etc.
Producers¶
- Put records are used to send data to Kinesis.
- Need to provide the partition key.
- The message key is hashed to determine the shard id.
- The same key goes to the same partition (helps with ordering for a specific key).
- Message data must be base64 encoded.
- Each message is given a sequence number (sequential).
- Use a partition key that's highly distributed to avoid a "hot partition", which would result in all the data being sent to the same shard.
- user_id if many users.
- country_id not a good choice is all the users are in one country.
- Use Batching to reduce cost and increase throughput.
Handling ProvisionedThroughputExceeded Exceptions¶
- Happens when MB/s or TPS are exceeded on a shard.
- Three solutions to the problem:
- Retries with backoff.
- Increase the shards (scaling).
- Use a more distributed (unique) partition key.
Consumers¶
- Can use the SDK, CLI or Kinesis Client Library.
- Kinesis Client Library uses DynamoDB to checkpoint offsets, track other workers, and share the work amongst available shards.
- Pass the shart iterator to
get records --shard-iterator
to retrieve the messages for a given shard id.
Kinesis Client Library (KCL)¶
- Java library for distributed applications that helps read records from Kinesis Streams, to share the read workload across multiple nodes.
- Each shard can only be ready by one KCL instance. ie: if there's 4 shards, the maximum number of KCL instances is 4.
- DynamoDB is used to checkpoint progress and co-ordinate consumption of shards across the KCL instances (requires IAM access to write to DynamoDB).
- KCL can run on EC2, Elastic Beanstalk, On-Prem applications etc.
- Record are read in the order that they were placed into the shard.
Security¶
- IAM policies to control access/authorization.
- VPC endpoints are available to allow access to Kinesis from within a VPC.
Encryption¶
- Encryption in transit via HTTPS.
- Encryption at rest using KMS
- Possible to use client side encryption.
Kinesis Data Analytics¶
- Real-time analytics on Kinsesis Streams using SQL.
- Auto-scales.
- Fully managed, no services to provision.
- Continuous analytics (real time).
- Pay for actual consumption.
- Create streams out of real-time queries.
Kinesis Firehose¶
- Fully managed, no administration.
- Near real time (60 seconds delay).
- Used for ETLs. Load data into Redshift/S3/ElasticSearch/Splunk.
- Automatic scaling.
- Supports alot of data formats. There's a conversion fee to convert between formats.
- Pay for the amount of data going through Firehose.
SQS vs SNS vs Kinesis¶
SQS¶
- Consumers pull data.
- Data is deleted once consumed.
- Can have as many consumers as you want.
- Throughput isn't provisioned.
- Ordering is not guaranteed unless you use a FIFO queue.
- Messages can be delayed if required.
SNS¶
- Data is pushed to subscribers (pub/sub model).
- Up to 10m subscribers.
- Data is not persisted, it's lost if it isn't delivered.
- Up to 10,000 topics.
- Throughput is not provisioned.
- Integrates with SQS for fan-out architecture.
Kinesis¶
- Consumers pull data.
- Can have as many consumers are you want.
- Data can be replayed.
- Meant for real-time big data, analytics and ETL.
- Records within the shard are ordered according to partition key.
- Data expires after X days.
- Throughput is provisioned.
Kinesis Data Ordering vs SQS FIFO Queue¶
- Use the partition key to control the order of messages in a Kinesis Stream.
- The same partition key will always go to the same shard in a Kinesis Stream.
- SQS standard isn't ordered at all.
- SQS FIFO messages without a group id will be consumed by a single consumer in the order they're sent.
- SQS FIFO messages with a group id, will be consumed by multiple consumers (one consumer per group, messages processed in the order they arrive for that group. Very similar to a Kinesis partition key).
Example¶
There are 100 trucks sending GPS data, 5 kinesis shards and 1 SQS FIFO.
Kinesis Data Stream¶
- On average, 20 trucks per shard (x 5 shards).
- Data is ordered within each shard.
- Maximum of 5 consumers (one per shard).
- Up to 5MB/s of data (1MB/s per shard).
SQS FIFO¶
- 100 Group IDs (one per truck).
- Data is ordered within each group id (or in the order they arrive into the queue if no group id).
- Maximum of 100 consumers (one per Group ID).
- Up to 300 messages/second, or 3,000 if using batching.
Last update: June 30, 2021