
"Data is the new oil". That is a common statement we have all heard. Yet unlike traditional oil, data accumulates ceaselessly, expanding by the second. With the advent of 5G and the Internet of Things (IoT), the magnitude of big data is escalating exponentially. To accommodate such vast quantities of data and facilitate efficient processing, cloud services emerge as the preferred infrastructure, with Amazon Web Services (AWS) reigning supreme as the preeminent choice. This discourse provides a comprehensive overview of key AWS offerings pertinent to big data applications, elucidating typical solutions devised through their utilization.
Topics covered:
Data Analytics Pipeline
Analytics on Cloud
Data Analytics Architecture Principles
Temperature of Data
Data Collection
Data Storage
Data Processing
Data Analysis
Data Consumption
Sum up
Data Analytics Pipeline
Any data analytics use case typically entails processing data across four stages of a pipeline: data collection, storage within a data lake, data processing to derive actionable insights, and subsequent analysis to generate valuable outcomes. However, each of these stages represents a substantial domain in its own right, presenting unique challenges and complexities.
Within our array of use cases, distinct requirements arise regarding the speed, volume, and variety of data being processed. For instance, an application geared towards providing real-time insights into network security demands swift processing to ensure timely responses—no utility lies in discovering a network breach after the fact. Conversely, data pertaining to social media posts may not necessitate immediate analysis, yet poses significant challenges due to its diverse nature.
Moreover, certain scenarios mandate handling colossal volumes of data, spanning diverse types and arriving at rapid velocities, while concurrently demanding instantaneous processing. Consider, for instance, a defense drone tasked with border surveillance: such a system would continuously generate vast quantities of videos, images, and audio recordings, alongside metadata concerning geolocation, environmental conditions, and more—all of which require instantaneous processing to facilitate timely decision-making.
Analytics on Cloud
AWS offers a variety of services tailored to each phase of the data analytics pipeline, encompassing diverse architecture patterns to cater to different use cases, including batch, interactive, and stream processing, as well as machine learning-driven insight extraction.
Primarily, there exist four distinct approaches for implementing these pipelines:
Virtualized: This approach, although least favored, serves as a straightforward initial step for migrating the data analytics pipeline to AWS. It involves provisioning powerful EC2 instances and deploying either open-source or licensed data analytics frameworks.
Managed Services: Essentially, these are EC2 instances managed by AWS, with the analytics framework also managed by AWS. This alleviates much operational burden, allowing focus on data tasks. AWS furnishes a suite of managed services for big data analytics, encompassing both open-source and proprietary frameworks.
Containerized: In this paradigm, applications are deployed within Docker containers, offering enhanced cost-effectiveness compared to virtualized instances as they eliminate the need for underlying EC2 infrastructure. AWS provides a range of services and Docker images to facilitate adoption of containerized solutions, while also accommodating custom containers.
Serverless: Representing the most dynamic and encouraged approach by AWS, serverless architectures offer high cost efficiency and scalability. AWS advocates transitioning to native serverless architectures, albeit this approach ties the solution closely to AWS, potentially limiting portability. Nonetheless, for those prioritizing cost-effectiveness and scalability, serverless architectures stand as the optimal choice within the AWS ecosystem.
Architecture Principles
AWS recommends some architecture principles that can improve the deployment of a data analytics pipeline on the cloud. They are tailored towards the AWS cloud, but may be extended to any other cloud provider as well.
Build decoupled systems
Decoupling stands out as a paramount architectural tenet, regardless of domain or architectural style. This holds especially true when constructing a data analytics pipeline on AWS. The six steps comprising the analytics process—Data Store, Process, Store, Analyze, and Answers—must exhibit sufficient decoupling to enable individual steps to be replaced or scaled independently of one another.
Right Tool for Right Job
AWS prescribes distinct services for each stage of the data processing pipeline, contingent upon factors such as data structure, latency, throughput, and access patterns. These considerations are expounded upon in the accompanying blog post. Adhering to these recommendations can yield substantial cost reductions and performance enhancements for the pipeline. Thus, it is imperative to comprehensively grasp the nuances of each service and its corresponding use case.
Leverage Managed & Serverless services
The core architectural tenet for any application deployed on the AWS cloud is to prioritize services over servers. While it remains feasible to provision a fleet of EC2 instances and deploy an open-source analytics framework, the overarching goal is to eschew reinventing the wheel and instead capitalize on the offerings provided by AWS. This approach not only facilitates deployment but also ensures optimization and scalability within the AWS ecosystem.
Use event-journal design pattern
The adoption of an event-journal design pattern is strongly advocated for constructing a data analytics pipeline on AWS. In this pattern, data is amassed into an S3 bucket, which serves as the authoritative source and remains unaltered by any other service. Consequently, disparate services can access this data autonomously, obviating the necessity for synchronization. Given the rapid influx of data, preserving a singular source of truth is paramount to ensure resilience against potential component failures.
S3 provides an efficient data lifecycle - allowing us to glacier the data over time. That helps us with a significant cost reduction.
Be cost conscious
Once more, this emphasizes a fundamental aspect that transcends the realm of AWS or Big Data. Cost-saving measures are integral considerations in any application architecture. However, AWS offers an array of opportunities and diverse techniques to address this concern, including auto-scaling, pay-as-you-go (PAYG) models, and serverless computing. Leveraging these resources is essential when operating within the AWS environment to optimize costs effectively.
Enable AI Services
Data holds little value if we cannot extract insights and derive utility from it. AWS offers a comprehensive suite of machine learning-based services, ranging from SageMaker to Comprehend and Alexa. Each of these services presents unique capabilities that can be integrated into data analytics workflows to extract meaningful insights and facilitate actionable outcomes from the analyzed data.
Once more, if cloud lock-in is a concern, you have the option to host your own Jupyter Notebooks on a provisioned EC2 instance. However, it's important to note that AWS services such as SageMaker and Comprehend offer significant utility and can enhance the value of the data pipeline considerably.
Temperature of Data
We can select the most suitable solution for our problem by evaluating the use case, including the type of data, required processing, and the value derived from insights. To comprehend these factors effectively, it's crucial to grasp the concept of "Temperature of the Data," which serves as an indicator of the volume, velocity, and variety of the data undergoing processing.
Hot | Warm | Cold | |
Volume | MB-GB | GB-TB | PB-EB |
Item Size | B-KB | KB-MB | KB-TB |
Latency | Micro-Milli Seconds | Milli Seconds - Seconds | Minutes-Hours |
Durability | Low | High | Very High |
Request Rate | Very High | High | Low |
Cost/GB | $$-$ | $$-c | c |
Hot data exhibits high velocity, often arriving in small, frequent chunks to ensure minimal latency and immediate processing. However, it typically possesses low durability as its relevance diminishes rapidly over time. In contrast, cold data is characterized by its immense volume, often received in large batches, and is commonly associated with offline processing of archived data.
The temperature of the data dictates the selection of techniques, services, and architectural patterns employed for its processing. For instance, insights derived from a missile interceptor must be processed with extremely low latency, as even a second's delay could render them meaningless. Conversely, insights gleaned from analyzing bulk videos received from Mars, pertaining to potential discoveries of alien life, can tolerate a degree of delay.
With the foundational concepts established, let's delve into the key AWS services utilized within the data analytics pipeline.
AWS proposes a six-step data pipeline: Collect, Store, Process, Store (again), Analyze, and Answer. We will now explore each of these steps in detail, alongside an overview of the various AWS services relevant to each stage, highlighting their respective significance and applications.
Collect
Data input can be categorized into three types of sources. Depending on the nature of the data source and its inherent characteristics, it's essential to select an appropriate storage solution for housing this raw, unprocessed data.
Data originating from sources such as Web/Mobile Apps and Data Centers typically exhibits structured and transactional attributes. These data streams may be received through platforms like Amplify or via standard Web Service calls facilitated by API Gateway, or similar low-volume transactional sources. They can be efficiently pushed and stored in SQL or NoSQL databases. Additionally, the utilization of in-memory databases like Redis can also be advantageous for managing this type of data efficiently.
Migration data and application logs primarily consist of file-based content, often comprising media or log files. These files tend to be substantial in size and are typically received from services related to AWS Migration or from cloudwatch logs. S3 (Amazon Simple Storage Service) emerges as the optimal solution for storing such data, providing scalability and durability to accommodate large volumes of file-based content efficiently.
For data originating from IoT devices, sensors, mobile tracking, and multimedia sources, the data often arrives in the form of continuous streaming data streams. Managing such data involves handling events and pushing them into stream storage systems such as Kafka, Kinesis Streams, or Kinesis Firehose. Kafka is well-suited for high-throughput distributed platforms, offering robust capabilities for managing streaming data. Kinesis Streams, on the other hand, provides a managed stream storage solution, while Kinesis Firehose excels in managed data delivery, simplifying the process of ingesting and delivering streaming data to designated destinations.
Store
The subsequent step involves storing the input data, for which AWS offers a diverse array of options tailored to various use cases. Each storage solution comes with its own set of advantages and drawbacks, which must be carefully evaluated based on the specific requirements of the use case at hand.
S3 is perhaps the most popular of the lot.
It is natively supported by big data frameworks (Spark, Hive, Presto, and others)
S3 offers the advantage of decoupling storage from compute, eliminating the necessity to deploy compute clusters solely for storage purposes, as seen with HDFC. This feature proves beneficial when running transient EMR (Elastic MapReduce) clusters utilizing EC2 spot instances. Moreover, S3 facilitates the provision of multiple heterogeneous analysis clusters, enabling various services to access and utilize the same dataset efficiently.
S3 provides a very high durability (eleven nines) 99.999999999%
Utilizing S3 within a single region proves highly cost-effective since there's no charge for data replication within that region. This cost-saving aspect enhances the affordability of storing and accessing data within a localized AWS environment.
And above all, S3 is secure. It provides for SSL encryption in transit as well as at rest.
In addition to S3, AWS offers a variety of database services, including both managed and serverless options, to cater to diverse storage requirements. These database services provide scalable and reliable solutions for storing and managing data efficiently within the AWS ecosystem.
ElastiCache - Managed Memcached or Redis service
DynamoDB - Managed Key-Value / Document DB
DynamoDB Accelerator (DAX) - Managed in memory cache for DynamoDB
Neptune - Managed Graph DB
RDS - Managed Relational Database
Given the extensive array of solutions at our disposal, the pertinent question arises: which one should be employed? AWS advises leveraging the following criteria to discern the most suitable solution for our specific requirements. Central to this analysis are considerations regarding the volume, variety, and velocity of the data, along with the pertinent access patterns.
Relational databases excel in maintaining robust referential integrity through strongly consistent transactions and scalable architecture. They accommodate complex queries using SQL, offering versatility in data retrieval and manipulation.
Key-value databases prioritize low-latency performance, facilitating high throughput and swift data ingestion via key-based queries. While they excel in simplicity, they typically support straightforward queries with filters.
Document databases specialize in indexing and storing documents, offering flexible querying capabilities across various properties. They support queries with filters, projections, and aggregates, making them suitable for diverse document-centric applications.
In-memory databases and caches offer ultra-low latency in the range of microseconds, ideal for time-sensitive applications. These systems support key-based queries and often employ specialized data structures for optimized performance. They facilitate simple query methods with filters, making them efficient for rapid data retrieval.
Graph databases are particularly advantageous for modeling and traversing complex relationships between data entities. They excel in expressing queries in terms of these intricate relations, providing a robust framework for analyzing interconnected datasets.
We can summarize in the two tables below. Based on the data:
Data Structure | Database |
Fixed Schema | SQL, NoSQL |
No Schema | NoSQL, Search |
Key-Value | In-memory, NoSQL |
Graph | GraphDB |
And based on the data access patterns:
Data access patterns | Database |
Put/Get (key-value) | In-memory, NoSQL |
Simple Relationships (1:N, M:N) | NoSQL |
Multi-table joins, transactions | SQL |
Faceting, Search | Search |
Graph traversal | GraphDB |
As anticipated, it's rare for the selection criteria based on data structure and access pattern to align perfectly. In such instances, where there's a mismatch between storage choice dictated by data structure versus access pattern, a decision must be made based on which criterion holds greater significance. This necessitates identifying the more prominent factor and making an informed trade-off to select the most suitable storage solution.
Based on the use case, we can choose a particular database using the below chart:
ElastiCache | DAX | Aurora | RDS | Elasticsearch | Neptune | S3+Glacier | |
Use Cases | In memory caching | Key/Value lookups, document store | OLTP, Transactional | OLTP, Transactional | Log analysis, reverse indexing | Graph | File store |
Performance | Ultra high request rate, ultra low latency | Ultra high request rate, ultra low latency | Very high request rate, low latency | High request rage, low latency | Medium request rate, low latency | Medium request rate, low latency | High throughput |
Data Shape | Key/Value | Key/Value and Document | Relational | Relational | Documents | Node/edges | Files |
Data Size | GB | TB, PB | GB, mid TB | GB, low TB | GB, TB | GB, mid TB | GB, TB, PB, EB |
Cost/GB | $$ | cc-$$ | cc | cc | cc | cc | cc |
Availability | 2AZ | 3AZ | 3AZ | 2AZ | 1-2AZ | 3AZ | 3AZ |
VPC Support | Inside VPC | VPC Endpoint | Inside VPC | Inside VPC | Inside VPC | Inside VPC | VPC Endpoint |
Process
The subsequent stage in the pipeline involves processing the available data. AWS furnishes an extensive array of options for data processing, presenting us with a multitude of choices and decisions to be made.
We have three major use cases when we process the big data:
Interactive & Batch Processing
When engaging in interactive or batch analytics processing, the anticipated data activity level is relatively lower, implying a lower "heat" compared to real-time or streaming scenarios. Despite expectations of interactive analytics being "hot," the data volumes involved in interactive sessions are typically modest, thus not meeting the threshold for being classified as "hot" data. Additionally, the responsiveness required for user perception in interactive analytics may not necessarily align with the rapid pace of real-time data processing from a data analytics standpoint. For such use cases, AWS recommends one of the following services.
AWS Elasticsearch - Managed service for Elastic Search
Redshift & Redshift Spectrum - Managed data warehouse, Spectrum enables querying S3
Athena - Serverless interactive query service
EMR - Managed Hadoop framework for running Apache Spark, Flink, Presto, Tex, Hive, Pig, Hbase and others
Streaming and Realtime Analytics
Conversely, when dealing with continuous data streams, such as those originating from IoT devices and sensors, and necessitating real-time processing, a distinct set of processing services must be considered.
Spark Streaming on EMR
Kinesis Data Analytics - Managed service for running SQL on Streaming Data
Kinesis Client Library
Lambda - Run code serverless, Services such as S3 can publish events to Lambda, Lambda can pool event from a Kinesis
. | EMR (Spark Streaming) | KCL application | Kinesis Analytics | Lambda |
Managed Service | Yes | No | Yes | Yes |
Serverless | No | No | Yes | Yes |
Scale/Throughput | No limits, depends on node count | No limits, depends on node count | No limits, scales automatically | No limits, scales automatically |
Availability | Single AZ | Multi AZ | Multi AZ | Multi AZ |
Sliding Window Functions | Built-in | App needs to implement | Built-in | No |
Reliability | Spark Checkpoints | KCL Checkpoints | Managed by Kinesis Data Analytics | Managed by Lambda |
Predictive Analysis
Both scenarios outlined above may necessitate predictive analysis based on the available data. AWS offers a diverse array of AI services that can be leveraged across various levels to fulfill these predictive analysis requirements. Within the realm of AI services, AWS offers a hierarchical structure catering to different levels of abstraction:
Application Services: High-level SAS (Software as a Service) offerings such as Rekognition, Comprehend, Transcribe, Polly, Translate, and Lex streamline data processing by delivering output through a single service call.
Platform Services: AWS provides platforms like Amazon SageMaker, Amazon Mechanical Turk, and Amazon Deep Learning AMIs, empowering users to leverage custom AI models. These platforms facilitate the creation of tailored models capable of processing input data and generating meaningful insights. Additionally, users can harness generic AI frameworks like TensorFlow, PyTorch, or Caffe2 to power these platforms.
Infrastructure: At the foundational level, AWS grants users the flexibility to select hardware resources beneath containers or EC2 instances, supporting the chosen platform. Options include NVIDIA Tesla V100 GPU acceleration for AI/ML training, compute-intensive instances for AI/ML inference, and services like Greengrass ML for edge computing.
Once more, the pivotal question arises: Which analytics solution should I utilize? This decision can be guided by considering the data type and the selected mode of processing. By aligning these factors with the appropriate analytics tools, the optimal solution can be determined. For different modes of processing, various analytics solutions are tailored to meet specific timing requirements:
Batch Processing: This mode may take minutes to hours to complete and is suitable for generating daily, weekly, or monthly reports. Preferred services include EMR (utilizing MapReduce, Hive, Pig, or Spark).
Interactive Processing: Tasks in this mode typically take seconds to complete, such as self-service dashboards. Recommended services comprise Redshift, Athena, and EMR (utilizing Presto or Spark).
Stream Processing: This mode necessitates milliseconds to seconds for processing, catering to use cases like fraud alerts or real-time metrics. Services like EMR (with Spark streaming), Kinesis Data Analytics, KCL (Kinesis Client Library), and Lambda are suitable for such scenarios.
Predictive Analytics: This mode requires real-time (milliseconds) or batch (minutes) processing and encompasses tasks like fraud detection, demand forecasting, and speech recognition. Implementations can leverage services such as SageMaker, Polly, Rekognition, Transcribe, Translate, EMR (utilizing Spark ML), and Deep Learning AMI (supporting MXNet, TensorFlow, Theano, Torch, CNTK, and Caffe2).
Analysis
Certainly, preparing data for consumption involves the crucial step of ELT (Extract, Load, Transform) or ETL (Extract, Transform, Load). AWS offers a range of tools to facilitate ELT/ETL processes. Below is a summary table providing an overview of these services and their implications:
. | Glue | ETL Data Pipeline | Data Migration Service | EMR Apache | NiFi | Partner Solution |
Use Case | Serverless | ETL Data Workflow | Migrate databases (to/from datalakes) | Customize developed hadoop/spark | ETL Automate the flow of data between systems | Rich partner ecosystem for ETL |
Scale/Throughput | ~DPUs | ~Nodes, through EMR cluster | ECT Instance Type | ~Nodes | Self managed | Self managed or through partner |
Managed service | Clusterless | Managed | Managed EC2 on your behalf | Managed EC2 on your behalf | Self managed on EMR or marketplace | Self managed or through partner |
Data sources | S3, RDBMS, Redshift, DynamoDB | S3, JDBC, Custom RDBMS, data warehouses | S3, various | Managed Hadoop/Spark | Various through rich processor framework | Various |
Skills needed | Wizard for simple mapping, code snippets for advanced ETL | Wizard and code snippets | Wizard and drag/drop | Hadoop/Spark Coding | NiFi processors and some coding | Self managed or through partner |
Consume
Ultimately, the processed data is consumed by services capable of deriving meaningful insights or presenting the information in a user-friendly format. These consuming services can range from AI services that analyze the data to make decisions, to user interface platforms that render insights in an accessible manner. Possible consuming services include AI applications, Jupyter notebooks, Anaconda, R Studio, Kibana, Quicksight, Tableau, Looker, MicroStrategy, Qlik, and more. Each of these platforms offers unique capabilities and interfaces tailored to different user preferences and requirements.
Sum Up
The following diagram sums up the entire process of data analytics, along with the various choices available to us.

Sample Architecture
Let's examine a sample architecture for a real-time streaming analytics pipeline, which leverages a suite of services for data processing and storage. Upon data stream ingestion, Kinesis Data Analytics conducts initial processing. Subsequently, the processed data is routed to various streaming data processing applications tasked with extracting and categorizing different data facets. This processed data is then directed to AI services for real-time predictive analysis as required.
The remaining data is stored in diverse data storage services, contingent upon the type of data extracted and segregated from the input stream. These stored datasets are subsequently utilized for generating notifications and insights. Moreover, the refined data stream is forwarded to downstream applications for further processing, should the need arise.

Comments