# Data ingestion and processing
<a name="data-ingestion-and-processing"></a>

 Data ingestion and processing involves the collection, centralization, and analysis of data from multiple sources. This data, when effectively ingested and processed, helps teams to understand the availability, security, performance, and reliability of their systems in real-time. Through streamlining data ingestion and processing, teams can make quicker and more effective decisions, enhancing overall agility and reliability of systems. 

**Topics**
+ [

# Indicators for data ingestion and processing
](indicators-for-data-ingestion-and-processing.md)
+ [

# Anti-patterns for data ingestion and processing
](anti-patterns-for-data-ingestion-and-processing.md)
+ [

# Metrics for data ingestion and processing
](metrics-for-data-ingestion-and-processing.md)

# Indicators for data ingestion and processing
<a name="indicators-for-data-ingestion-and-processing"></a>

The collection, centralization, and analysis of data from various sources. With this capability, teams can make quicker, more effective decisions, enhancing their systems' agility, security, and reliability.

**Topics**
+ [

# [O.DIP.1] Aggregate logs and events across workloads
](o.dip.1-aggregate-logs-and-events-across-workloads.md)
+ [

# [O.DIP.2] Centralize logs for enhanced security investigations
](o.dip.2-centralize-logs-for-enhanced-security-investigations.md)
+ [

# [O.DIP.3] Implement distributed tracing for system-wide request tracking
](o.dip.3-implement-distributed-tracing-for-system-wide-request-tracking.md)
+ [

# [O.DIP.4] Aggregate health and status metrics across workloads
](o.dip.4-aggregate-health-and-status-metrics-across-workloads.md)
+ [

# [O.DIP.5] Optimize telemetry data storage and costs
](o.dip.5-optimize-telemetry-data-storage-and-costs.md)
+ [

# [O.DIP.6] Standardize telemetry data with common formats
](o.dip.6-standardize-telemetry-data-with-common-formats.md)

# [O.DIP.1] Aggregate logs and events across workloads
<a name="o.dip.1-aggregate-logs-and-events-across-workloads"></a>

 **Category:** FOUNDATIONAL 

 Logs and events should be aggregated across multiple workloads to provide a comprehensive view of the entire system. This enables teams to troubleshoot, identify patterns, and resolve operational issues. 

 Implement a log aggregation solution that supports collecting logs from various sources and provides functions for filtering, searching, visualizing, and alerting. Make sure the solution provides real-time data collection, supports necessary data sources, and offers visualization options. The tool should be accessible to application teams, allowing them to monitor and troubleshoot their system as needed. 

**Related information:**
+  [AWS Well-Architected Reliability Pillar: REL11-BP01 Monitor all components of the workload to detect failures](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_monitoring_health.html) 
+  [Cross-account cross-Region CloudWatch console](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Cross-Account-Cross-Region.html) 
+  [Collect, analyze, and display Amazon CloudWatch Logs in a single dashboard with the Centralized Logging on AWS solution](https://docs.aws.amazon.com/solutions/latest/centralized-logging-on-aws/welcome.html) 
+  [Centralized Logging with OpenSearch](https://aws.amazon.com/solutions/implementations/centralized-logging-with-opensearch/) 
+  [Sending Logs Directly to Amazon S3](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Sending-Logs-Directly-To-S3.html) 
+  [One Observability Workshop](https://observability.workshop.aws/) 

# [O.DIP.2] Centralize logs for enhanced security investigations
<a name="o.dip.2-centralize-logs-for-enhanced-security-investigations"></a>

 **Category:** FOUNDATIONAL  

 Effective security investigations require the aggregation, standardization, and centralization of logs and events so they are readily accessible to investigation teams. Centralized logs and event data enhance the ability of security teams to conduct effective investigations, improve threat detection, and accelerate incident response times. 

 Use cloud native tools or Security Information and Event Management (SIEM) solutions to aggregate, standardize, and centralize logs and event data, while respecting regional boundaries and data sovereignty requirements. These tools are designed to collect and analyze logs and security events from various sources to provide a centralized view of an organization's security posture. Centralizing, normalizing, deduping, and removing unnecessary data allows security teams to use automation and scripted investigation tools which leads to a faster and more efficient response process. 

 Given the sensitivity of this data, verify that the data is accessible only to authorized security personnel and that strong access controls are in place to maintain data security and confidentiality. Only grant least-privilege permission to the data so that it is only accessible to authorized users with the minimum level of access required to perform investigations. For instance, access to overwrite this data should be restricted. 

**Related information:**
+  [AWS Well-Architected Performance Pillar: PERF07-BP02 Analyze metrics when events or incidents occur](https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/perf_monitor_instances_post_launch_review_metrics.html) 
+  [Collect, analyze, and display Amazon CloudWatch Logs in a single dashboard with the Centralized Logging on AWS solution](https://docs.aws.amazon.com/solutions/latest/centralized-logging-on-aws/welcome.html) 
+  [Cross-account cross-Region CloudWatch console](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Cross-Account-Cross-Region.html) 
+  [AWS Well-Architected Framework - Security Pillar - Detection](https://docs.aws.amazon.com/wellarchitected/latest/framework/sec-detection.html) 
+  [Amazon Security Lake](https://aws.amazon.com/security-lake/) 
+  [Centralized Logging on AWS](https://aws.amazon.com/solutions/implementations/centralized-logging/) 
+  [Amazon OpenSearch Service](https://aws.amazon.com/opensearch-service/) 
+  [Centralized Logging with OpenSearch](https://aws.amazon.com/solutions/implementations/centralized-logging-with-opensearch/) 
+  [AWS Marketplace: SIEM](https://aws.amazon.com/marketplace/solutions/security/siem) 

# [O.DIP.3] Implement distributed tracing for system-wide request tracking
<a name="o.dip.3-implement-distributed-tracing-for-system-wide-request-tracking"></a>

 **Category:** RECOMMENDED 

 Distributed tracing is a method to track requests as they move through distributed systems. It provides insights into system interactions across multiple services and applications, enabling quicker issue identification and resolution. 

 Use a tracing solution that is scalable, provides real-time data collection, and supports comprehensive visualization of tracing data. Integrate this solution with the log and event aggregation tools to enhance system-wide visibility. This gives a comprehensive view of the entire system and its dependencies, facilitating quick identification and resolution of issues. 

**Related information:**
+  [AWS Well-Architected Reliability Pillar: REL06-BP07 Monitor end-to-end tracing of requests through your system](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_monitor_aws_resources_end_to_end.html) 
+  [Distributed Tracing System – AWS X-Ray](https://aws.amazon.com/xray/) 
+  [Amazon CloudWatch ServiceLens](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ServiceLens.html) 
+  [Amazon Managed Grafana](https://aws.amazon.com/grafana/) 
+  [AWS X-Ray integration with Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/x-ray-data-source.html) 

# [O.DIP.4] Aggregate health and status metrics across workloads
<a name="o.dip.4-aggregate-health-and-status-metrics-across-workloads"></a>

 **Category:** RECOMMENDED 

 Aggregate health and status metrics across all workloads for a unified view of the system's overall health. Aggregated health metrics provide a snapshot of the system's overall health and performance, aiding in proactive issue detection and efficient resource management. 

 Use a monitoring solution that allows aggregation of health metrics across all applications, supports real-time data collection, and provides intuitive visualization of metrics data. Integration with the logging, events, and tracing tools can provide a comprehensive view of overall system health. 

**Related information:**
+  [Amazon Managed Grafana](https://aws.amazon.com/grafana/) 
+  [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) 
+  [Application Monitoring with Amazon CloudWatch](https://aws.amazon.com/solutions/implementations/application-monitoring-with-cloudwatch/) 
+  [AWS Health Aware](https://github.com/aws-samples/aws-health-aware/) 

# [O.DIP.5] Optimize telemetry data storage and costs
<a name="o.dip.5-optimize-telemetry-data-storage-and-costs"></a>

 **Category:** RECOMMENDED 

 Optimize costs associated with storing and processing large amounts of telemetry data by using techniques like data filtering and compression. When dealing with non-security related telemetry data, data sampling can also be an effective method to reduce costs. 

 Select cost-effective solutions and consumption-based resources for data storage. Be strategic about data retention—remove unused or unnecessary data from storage regularly. Also, be selective about which data sources are ingested and ensure they are required for effective analysis to avoid unnecessary spend. Always remember that while managing costs is important, it should not compromise the integrity and completeness of your data, especially when it comes to security. 

**Related information:**
+  [AWS Well-Architected Performance Pillar: PERF03-BP01 Understand storage characteristics and requirements](https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/perf_right_storage_solution_understand_char.html) 
+  [AWS Well-Architected Sustainability Pillar: SUS04-BP05 Remove unneeded or redundant data](https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/sus_sus_data_a6.html) 

# [O.DIP.6] Standardize telemetry data with common formats
<a name="o.dip.6-standardize-telemetry-data-with-common-formats"></a>

 **Category:** RECOMMENDED 

 Normalize telemetry data using a common format or standard schema to enhance consistency in data collection and reporting. This facilitates seamless correlation and analysis across multiple facets of observability, such as system performance, user behaviors, and security events, improving the overall speed and accuracy of detection and response in any of these areas. 

 Two notable open-source projects supporting this goal are OpenTelemetry and the Open Cybersecurity Alliance Schema Framework (OCSF). OpenTelemetry provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application and send them to any observability platform. OCSF, on the other hand, is an extensible, vendor-agnostic project designed to simplify data ingestion and normalization specifically for cybersecurity events. 

 Utilize a common telemetry format to streamline these processes, reduce associated costs of data processing, and allow teams to focus more on detecting and responding to actionable events. Guidelines should be established for the collection and reporting of data, enforcing consistency across all teams. Adopting and effectively using standard schemas or frameworks like OpenTelemetry and OCSF can provide considerable advantages in achieving comprehensive observability. 

**Related information:**
+  [OCSF Schema](https://schema.ocsf.io/) 
+  [OCSF GitHub](https://github.com/ocsf) 
+  [AWS Distro for OpenTelemetry](https://aws.amazon.com/otel/) 
+  [OpenTelemetry](https://opentelemetry.io/) 

# Anti-patterns for data ingestion and processing
<a name="anti-patterns-for-data-ingestion-and-processing"></a>
+  **Over-reliance on ETL Tools:** Over-relying on ETL (Extract, Transform, Load) tools for data processing can lead to inflexibility and difficulties adapting to data source changes. Where possible, use tools with native integrations that allow ETL-free data processing and analysis pipelines, enabling a more flexible and scalable way to integrate data from multiple sources without introducing additional operational overhead. 
+  **Ignoring event correlation:** Ignoring the correlation of multiple alerts can hide broader issues. Incorporate event correlation into the observability strategy to quickly identify and resolve problems across multiple tools and systems. Utilize distributed tracing tools to trace requests across multiple services and dependencies to identify bottlenecks or issues, centralized logs and events for security investigations, and use normalized data formats to enable correlation of telemetry from multiple sources. 
+  **Inefficient data analysis:** Relying on monolithic or manual data processing methods leads to inefficient data analysis. Monolithic data processing of large volumes leads to long wait times, slow detection and reaction times, and potentially increased cost. Manual data processing, on the other hand, is error-prone and time-consuming. Overcome these inefficiencies by adopting scalable and distributed architectures like serverless computing, capable of handling large data volumes in parallel. Data processing should be automated wherever possible to ensure consistent, error-free, and efficient data analysis. 
+  **Lack of data governance:** Poor data governance practices can lead to inaccurate data, poor decision-making, and compliance risks. Establish and enforce data governance policies, including data quality checks, granular access control, and data provenance tracking. 

# Metrics for data ingestion and processing
<a name="metrics-for-data-ingestion-and-processing"></a>
+  **Data ingestion rate**: The amount of data ingested by monitoring systems in a given time period which indicates that the system can effectively process large volumes of telemetry data, leading to more accurate insights. Measure this metric by calculating the volume of data ingested by the monitoring systems per unit of time. 
+  **Data processing latency**: The time it takes for telemetry data to be processed and made available for analysis. Lower data processing latency aims to quickly assess and act on insights from telemetry data. Measure the time elapsed between data ingestion and the availability of processed data for analysis. 
+  **Data cost efficiency**: Measuring the cost of collecting, storing, and processing telemetry data compared to the number of actionable insights generated or decisions made based on these insights. This metric assures that resources are utilized effectively and unnecessary expenses are minimized. Calculate the total cost of data collection, storage, and processing, and contrast it to the actionable insights they provide. 
+  **Anomaly detection rate**: The percentage of anomalies detected by the monitoring systems. A higher anomaly detection rate indicates that the system is effective in identifying potential issues, enabling teams to proactively address them. Measure this metric by calculating the number of anomalies detected by the monitoring systems, divided by the total number of events, then multiply by 100 for the percentage.