top of page

Data Analytics on AWS - What, Why & How

Writer's picture: Vikas SolegaonkarVikas Solegaonkar

"Data is the new oil". That is a common statement we have all heard. Yet unlike traditional oil, data accumulates ceaselessly, expanding by the second. With the advent of 5G and the Internet of Things (IoT), the magnitude of big data is escalating exponentially. To accommodate such vast quantities of data and facilitate efficient processing, cloud services emerge as the preferred infrastructure, with Amazon Web Services (AWS) reigning supreme as the preeminent choice. This discourse provides a comprehensive overview of key AWS offerings pertinent to big data applications, elucidating typical solutions devised through their utilization.


Topics covered:

  1. Data Analytics Pipeline

  2. Analytics on Cloud

  3. Data Analytics Architecture Principles

  4. Temperature of Data

  5. Data Collection

  6. Data Storage

  7. Data Processing

  8. Data Analysis

  9. Data Consumption

  10. Sum up


Data Analytics Pipeline

Any data analytics use case typically entails processing data across four stages of a pipeline: data collection, storage within a data lake, data processing to derive actionable insights, and subsequent analysis to generate valuable outcomes. However, each of these stages represents a substantial domain in its own right, presenting unique challenges and complexities.


Within our array of use cases, distinct requirements arise regarding the speed, volume, and variety of data being processed. For instance, an application geared towards providing real-time insights into network security demands swift processing to ensure timely responses—no utility lies in discovering a network breach after the fact. Conversely, data pertaining to social media posts may not necessitate immediate analysis, yet poses significant challenges due to its diverse nature.


Moreover, certain scenarios mandate handling colossal volumes of data, spanning diverse types and arriving at rapid velocities, while concurrently demanding instantaneous processing. Consider, for instance, a defense drone tasked with border surveillance: such a system would continuously generate vast quantities of videos, images, and audio recordings, alongside metadata concerning geolocation, environmental conditions, and more—all of which require instantaneous processing to facilitate timely decision-making.


Analytics on Cloud

AWS offers a variety of services tailored to each phase of the data analytics pipeline, encompassing diverse architecture patterns to cater to different use cases, including batch, interactive, and stream processing, as well as machine learning-driven insight extraction.


Primarily, there exist four distinct approaches for implementing these pipelines:

  1. Virtualized: This approach, although least favored, serves as a straightforward initial step for migrating the data analytics pipeline to AWS. It involves provisioning powerful EC2 instances and deploying either open-source or licensed data analytics frameworks.

  2. Managed Services: Essentially, these are EC2 instances managed by AWS, with the analytics framework also managed by AWS. This alleviates much operational burden, allowing focus on data tasks. AWS furnishes a suite of managed services for big data analytics, encompassing both open-source and proprietary frameworks.

  3. Containerized: In this paradigm, applications are deployed within Docker containers, offering enhanced cost-effectiveness compared to virtualized instances as they eliminate the need for underlying EC2 infrastructure. AWS provides a range of services and Docker images to facilitate adoption of containerized solutions, while also accommodating custom containers.

  4. Serverless: Representing the most dynamic and encouraged approach by AWS, serverless architectures offer high cost efficiency and scalability. AWS advocates transitioning to native serverless architectures, albeit this approach ties the solution closely to AWS, potentially limiting portability. Nonetheless, for those prioritizing cost-effectiveness and scalability, serverless architectures stand as the optimal choice within the AWS ecosystem.


Architecture Principles

AWS recommends some architecture principles that can improve the deployment of a data analytics pipeline on the cloud. They are tailored towards the AWS cloud, but may be extended to any other cloud provider as well.


Build decoupled systems

Decoupling stands out as a paramount architectural tenet, regardless of domain or architectural style. This holds especially true when constructing a data analytics pipeline on AWS. The six steps comprising the analytics process—Data Store, Process, Store, Analyze, and Answers—must exhibit sufficient decoupling to enable individual steps to be replaced or scaled independently of one another.


Right Tool for Right Job

AWS prescribes distinct services for each stage of the data processing pipeline, contingent upon factors such as data structure, latency, throughput, and access patterns. These considerations are expounded upon in the accompanying blog post. Adhering to these recommendations can yield substantial cost reductions and performance enhancements for the pipeline. Thus, it is imperative to comprehensively grasp the nuances of each service and its corresponding use case.


Leverage Managed & Serverless services

The core architectural tenet for any application deployed on the AWS cloud is to prioritize services over servers. While it remains feasible to provision a fleet of EC2 instances and deploy an open-source analytics framework, the overarching goal is to eschew reinventing the wheel and instead capitalize on the offerings provided by AWS. This approach not only facilitates deployment but also ensures optimization and scalability within the AWS ecosystem.


Use event-journal design pattern

The adoption of an event-journal design pattern is strongly advocated for constructing a data analytics pipeline on AWS. In this pattern, data is amassed into an S3 bucket, which serves as the authoritative source and remains unaltered by any other service. Consequently, disparate services can access this data autonomously, obviating the necessity for synchronization. Given the rapid influx of data, preserving a singular source of truth is paramount to ensure resilience against potential component failures.


S3 provides an efficient data lifecycle - allowing us to glacier the data over time. That helps us with a significant cost reduction.


Be cost conscious

Once more, this emphasizes a fundamental aspect that transcends the realm of AWS or Big Data. Cost-saving measures are integral considerations in any application architecture. However, AWS offers an array of opportunities and diverse techniques to address this concern, including auto-scaling, pay-as-you-go (PAYG) models, and serverless computing. Leveraging these resources is essential when operating within the AWS environment to optimize costs effectively.


Enable AI Services

Data holds little value if we cannot extract insights and derive utility from it. AWS offers a comprehensive suite of machine learning-based services, ranging from SageMaker to Comprehend and Alexa. Each of these services presents unique capabilities that can be integrated into data analytics workflows to extract meaningful insights and facilitate actionable outcomes from the analyzed data.


Once more, if cloud lock-in is a concern, you have the option to host your own Jupyter Notebooks on a provisioned EC2 instance. However, it's important to note that AWS services such as SageMaker and Comprehend offer significant utility and can enhance the value of the data pipeline considerably.


Temperature of Data

We can select the most suitable solution for our problem by evaluating the use case, including the type of data, required processing, and the value derived from insights. To comprehend these factors effectively, it's crucial to grasp the concept of "Temperature of the Data," which serves as an indicator of the volume, velocity, and variety of the data undergoing processing.


Hot

Warm

Cold

Volume

MB-GB

GB-TB

PB-EB

Item Size

B-KB

KB-MB

KB-TB

Latency

Micro-Milli Seconds

Milli Seconds - Seconds

Minutes-Hours

Durability

Low

High

Very High

Request Rate

Very High

High

Low

Cost/GB

$$-$

$$-c

c

Hot data exhibits high velocity, often arriving in small, frequent chunks to ensure minimal latency and immediate processing. However, it typically possesses low durability as its relevance diminishes rapidly over time. In contrast, cold data is characterized by its immense volume, often received in large batches, and is commonly associated with offline processing of archived data.


The temperature of the data dictates the selection of techniques, services, and architectural patterns employed for its processing. For instance, insights derived from a missile interceptor must be processed with extremely low latency, as even a second's delay could render them meaningless. Conversely, insights gleaned from analyzing bulk videos received from Mars, pertaining to potential discoveries of alien life, can tolerate a degree of delay.


With the foundational concepts established, let's delve into the key AWS services utilized within the data analytics pipeline.


AWS proposes a six-step data pipeline: Collect, Store, Process, Store (again), Analyze, and Answer. We will now explore each of these steps in detail, alongside an overview of the various AWS services relevant to each stage, highlighting their respective significance and applications.


Collect

Data input can be categorized into three types of sources. Depending on the nature of the data source and its inherent characteristics, it's essential to select an appropriate storage solution for housing this raw, unprocessed data.

  • Data originating from sources such as Web/Mobile Apps and Data Centers typically exhibits structured and transactional attributes. These data streams may be received through platforms like Amplify or via standard Web Service calls facilitated by API Gateway, or similar low-volume transactional sources. They can be efficiently pushed and stored in SQL or NoSQL databases. Additionally, the utilization of in-memory databases like Redis can also be advantageous for managing this type of data efficiently.

  • Migration data and application logs primarily consist of file-based content, often comprising media or log files. These files tend to be substantial in size and are typically received from services related to AWS Migration or from cloudwatch logs. S3 (Amazon Simple Storage Service) emerges as the optimal solution for storing such data, providing scalability and durability to accommodate large volumes of file-based content efficiently.

  • For data originating from IoT devices, sensors, mobile tracking, and multimedia sources, the data often arrives in the form of continuous streaming data streams. Managing such data involves handling events and pushing them into stream storage systems such as Kafka, Kinesis Streams, or Kinesis Firehose. Kafka is well-suited for high-throughput distributed platforms, offering robust capabilities for managing streaming data. Kinesis Streams, on the other hand, provides a managed stream storage solution, while Kinesis Firehose excels in managed data delivery, simplifying the process of ingesting and delivering streaming data to designated destinations.

Store

The subsequent step involves storing the input data, for which AWS offers a diverse array of options tailored to various use cases. Each storage solution comes with its own set of advantages and drawbacks, which must be carefully evaluated based on the specific requirements of the use case at hand.


S3 is perhaps the most popular of the lot.

  • It is natively supported by big data frameworks (Spark, Hive, Presto, and others)

  • S3 offers the advantage of decoupling storage from compute, eliminating the necessity to deploy compute clusters solely for storage purposes, as seen with HDFC. This feature proves beneficial when running transient EMR (Elastic MapReduce) clusters utilizing EC2 spot instances. Moreover, S3 facilitates the provision of multiple heterogeneous analysis clusters, enabling various services to access and utilize the same dataset efficiently.

  • S3 provides a very high durability (eleven nines) 99.999999999%

  • Utilizing S3 within a single region proves highly cost-effective since there's no charge for data replication within that region. This cost-saving aspect enhances the affordability of storing and accessing data within a localized AWS environment.

  • And above all, S3 is secure. It provides for SSL encryption in transit as well as at rest.


In addition to S3, AWS offers a variety of database services, including both managed and serverless options, to cater to diverse storage requirements. These database services provide scalable and reliable solutions for storing and managing data efficiently within the AWS ecosystem.

  • ElastiCache - Managed Memcached or Redis service

  • DynamoDB - Managed Key-Value / Document DB

  • DynamoDB Accelerator (DAX) - Managed in memory cache for DynamoDB

  • Neptune - Managed Graph DB

  • RDS - Managed Relational Database


Given the extensive array of solutions at our disposal, the pertinent question arises: which one should be employed? AWS advises leveraging the following criteria to discern the most suitable solution for our specific requirements. Central to this analysis are considerations regarding the volume, variety, and velocity of the data, along with the pertinent access patterns.

  • Relational databases excel in maintaining robust referential integrity through strongly consistent transactions and scalable architecture. They accommodate complex queries using SQL, offering versatility in data retrieval and manipulation.

  • Key-value databases prioritize low-latency performance, facilitating high throughput and swift data ingestion via key-based queries. While they excel in simplicity, they typically support straightforward queries with filters.

  • Document databases specialize in indexing and storing documents, offering flexible querying capabilities across various properties. They support queries with filters, projections, and aggregates, making them suitable for diverse document-centric applications.

  • In-memory databases and caches offer ultra-low latency in the range of microseconds, ideal for time-sensitive applications. These systems support key-based queries and often employ specialized data structures for optimized performance. They facilitate simple query methods with filters, making them efficient for rapid data retrieval.

  • Graph databases are particularly advantageous for modeling and traversing complex relationships between data entities. They excel in expressing queries in terms of these intricate relations, providing a robust framework for analyzing interconnected datasets.


We can summarize in the two tables below. Based on the data:

Data Structure

Database

Fixed Schema

SQL, NoSQL

No Schema

NoSQL, Search

Key-Value

In-memory, NoSQL

Graph

GraphDB

And based on the data access patterns:

Data access patterns

Database

Put/Get (key-value)

In-memory, NoSQL

Simple Relationships (1:N, M:N)

NoSQL

Multi-table joins, transactions

SQL

Faceting, Search

Search

Graph traversal

GraphDB

As anticipated, it's rare for the selection criteria based on data structure and access pattern to align perfectly. In such instances, where there's a mismatch between storage choice dictated by data structure versus access pattern, a decision must be made based on which criterion holds greater significance. This necessitates identifying the more prominent factor and making an informed trade-off to select the most suitable storage solution.


Based on the use case, we can choose a particular database using the below chart:


ElastiCache

DAX

Aurora

RDS

Elasticsearch

Neptune

S3+Glacier

Use Cases

In memory caching

Key/Value lookups, document store

OLTP, Transactional

OLTP, Transactional

Log analysis, reverse indexing

Graph

File store

Performance

Ultra high request rate, ultra low latency

Ultra high request rate, ultra low latency

Very high request rate, low latency

High request rage, low latency

Medium request rate, low latency

Medium request rate, low latency

High throughput

Data Shape

Key/Value

Key/Value and Document

Relational

Relational

Documents

Node/edges

Files

Data Size

GB

TB, PB

GB, mid TB

GB, low TB

GB, TB

GB, mid TB

GB, TB, PB, EB

Cost/GB

$$

cc-$$

cc

cc

cc

cc

cc

Availability

2AZ

3AZ

3AZ

2AZ

1-2AZ

3AZ

3AZ

VPC Support

Inside VPC

VPC Endpoint

Inside VPC

Inside VPC

Inside VPC

Inside VPC

VPC Endpoint


Process

The subsequent stage in the pipeline involves processing the available data. AWS furnishes an extensive array of options for data processing, presenting us with a multitude of choices and decisions to be made.


We have three major use cases when we process the big data:


Interactive & Batch Processing

When engaging in interactive or batch analytics processing, the anticipated data activity level is relatively lower, implying a lower "heat" compared to real-time or streaming scenarios. Despite expectations of interactive analytics being "hot," the data volumes involved in interactive sessions are typically modest, thus not meeting the threshold for being classified as "hot" data. Additionally, the responsiveness required for user perception in interactive analytics may not necessarily align with the rapid pace of real-time data processing from a data analytics standpoint. For such use cases, AWS recommends one of the following services.

  • AWS Elasticsearch - Managed service for Elastic Search

  • Redshift & Redshift Spectrum - Managed data warehouse, Spectrum enables querying S3

  • Athena - Serverless interactive query service

  • EMR - Managed Hadoop framework for running Apache Spark, Flink, Presto, Tex, Hive, Pig, Hbase and others


Streaming and Realtime Analytics

Conversely, when dealing with continuous data streams, such as those originating from IoT devices and sensors, and necessitating real-time processing, a distinct set of processing services must be considered.

  • Spark Streaming on EMR

  • Kinesis Data Analytics - Managed service for running SQL on Streaming Data

  • Kinesis Client Library

  • Lambda - Run code serverless, Services such as S3 can publish events to Lambda, Lambda can pool event from a Kinesis

.

EMR (Spark Streaming)

KCL application

Kinesis Analytics

Lambda

Managed Service

Yes

No

Yes

Yes

Serverless

No

No

Yes

Yes

Scale/Throughput

No limits, depends on node count

No limits, depends on node count

No limits, scales automatically

No limits, scales automatically

Availability

Single AZ

Multi AZ

Multi AZ

Multi AZ

Sliding Window Functions

Built-in

App needs to implement

Built-in

No

Reliability

Spark Checkpoints

KCL Checkpoints

Managed by Kinesis Data Analytics

Managed by Lambda

Predictive Analysis

Both scenarios outlined above may necessitate predictive analysis based on the available data. AWS offers a diverse array of AI services that can be leveraged across various levels to fulfill these predictive analysis requirements. Within the realm of AI services, AWS offers a hierarchical structure catering to different levels of abstraction:

  • Application Services: High-level SAS (Software as a Service) offerings such as Rekognition, Comprehend, Transcribe, Polly, Translate, and Lex streamline data processing by delivering output through a single service call.

  • Platform Services: AWS provides platforms like Amazon SageMaker, Amazon Mechanical Turk, and Amazon Deep Learning AMIs, empowering users to leverage custom AI models. These platforms facilitate the creation of tailored models capable of processing input data and generating meaningful insights. Additionally, users can harness generic AI frameworks like TensorFlow, PyTorch, or Caffe2 to power these platforms.

  • Infrastructure: At the foundational level, AWS grants users the flexibility to select hardware resources beneath containers or EC2 instances, supporting the chosen platform. Options include NVIDIA Tesla V100 GPU acceleration for AI/ML training, compute-intensive instances for AI/ML inference, and services like Greengrass ML for edge computing.


Once more, the pivotal question arises: Which analytics solution should I utilize? This decision can be guided by considering the data type and the selected mode of processing. By aligning these factors with the appropriate analytics tools, the optimal solution can be determined. For different modes of processing, various analytics solutions are tailored to meet specific timing requirements:

  • Batch Processing: This mode may take minutes to hours to complete and is suitable for generating daily, weekly, or monthly reports. Preferred services include EMR (utilizing MapReduce, Hive, Pig, or Spark).

  • Interactive Processing: Tasks in this mode typically take seconds to complete, such as self-service dashboards. Recommended services comprise Redshift, Athena, and EMR (utilizing Presto or Spark).

  • Stream Processing: This mode necessitates milliseconds to seconds for processing, catering to use cases like fraud alerts or real-time metrics. Services like EMR (with Spark streaming), Kinesis Data Analytics, KCL (Kinesis Client Library), and Lambda are suitable for such scenarios.

  • Predictive Analytics: This mode requires real-time (milliseconds) or batch (minutes) processing and encompasses tasks like fraud detection, demand forecasting, and speech recognition. Implementations can leverage services such as SageMaker, Polly, Rekognition, Transcribe, Translate, EMR (utilizing Spark ML), and Deep Learning AMI (supporting MXNet, TensorFlow, Theano, Torch, CNTK, and Caffe2).


Analysis

Certainly, preparing data for consumption involves the crucial step of ELT (Extract, Load, Transform) or ETL (Extract, Transform, Load). AWS offers a range of tools to facilitate ELT/ETL processes. Below is a summary table providing an overview of these services and their implications:

.

Glue

ETL Data Pipeline

Data Migration Service

EMR Apache

NiFi

Partner Solution

Use Case

Serverless

ETL Data Workflow

Migrate databases (to/from datalakes)

Customize developed hadoop/spark

ETL Automate the flow of data between systems

Rich partner ecosystem for ETL

Scale/Throughput

~DPUs

~Nodes, through EMR cluster

ECT Instance Type

~Nodes

Self managed

Self managed or through partner

Managed service

Clusterless

Managed

Managed EC2 on your behalf

Managed EC2 on your behalf

Self managed on EMR or marketplace

Self managed or through partner

Data sources

S3, RDBMS, Redshift, DynamoDB

S3, JDBC, Custom RDBMS, data warehouses

S3, various

Managed Hadoop/Spark

Various through rich processor framework

Various

Skills needed

Wizard for simple mapping, code snippets for advanced ETL

Wizard and code snippets

Wizard and drag/drop

Hadoop/Spark Coding

NiFi processors and some coding

Self managed or through partner

Consume

Ultimately, the processed data is consumed by services capable of deriving meaningful insights or presenting the information in a user-friendly format. These consuming services can range from AI services that analyze the data to make decisions, to user interface platforms that render insights in an accessible manner. Possible consuming services include AI applications, Jupyter notebooks, Anaconda, R Studio, Kibana, Quicksight, Tableau, Looker, MicroStrategy, Qlik, and more. Each of these platforms offers unique capabilities and interfaces tailored to different user preferences and requirements.


Sum Up

The following diagram sums up the entire process of data analytics, along with the various choices available to us.

Sample Architecture

Let's examine a sample architecture for a real-time streaming analytics pipeline, which leverages a suite of services for data processing and storage. Upon data stream ingestion, Kinesis Data Analytics conducts initial processing. Subsequently, the processed data is routed to various streaming data processing applications tasked with extracting and categorizing different data facets. This processed data is then directed to AI services for real-time predictive analysis as required.


The remaining data is stored in diverse data storage services, contingent upon the type of data extracted and segregated from the input stream. These stored datasets are subsequently utilized for generating notifications and insights. Moreover, the refined data stream is forwarded to downstream applications for further processing, should the need arise.


1 view0 comments

Comments


Manjusha Rao

  • LinkedIn
  • GitHub
  • Medium
Manjusha.png
Vikas Formal.png

Vikas Solegaonkar

  • LinkedIn
  • GitHub
  • Medium
bottom of page