Cloud Monitoring Solution Architect (AWS/Azure/GCP)

Jul 29, 2025 - Senior

$2,500.00 Fixed

Overview:

We are seeking a highly experienced Senior Cloud Monitoring & Observability Architect to design, implement, and optimize a comprehensive monitoring strategy for our cloud-native applications and infrastructure across one or more major cloud platforms (AWS, Azure, or GCP). The critical objective is to ensure unparalleled visibility into system health, performance, security, and cost efficiency. This project aims to establish a proactive monitoring framework that enables rapid incident detection, efficient troubleshooting, and data-driven optimization of our cloud resources. The scope can range from a foundational setup for basic visibility to a strategic, enterprise-grade solution for complex multi-cloud or hybrid environments, including advanced security monitoring and comprehensive cost optimization.

Responsibilities:

Phase 1: Cloud Environment Assessment & Strategy:

  • Conduct a thorough review of our existing cloud infrastructure, applications, and services on the designated cloud platform(s).
  • Collaborate with engineering and operations teams to understand specific monitoring requirements, KPIs, and pain points.
  • Design a holistic cloud monitoring architecture leveraging native cloud services (e.g., AWS CloudWatch, Azure Monitor, Google Cloud Monitoring) augmented by third-party tools (e.g., Datadog, New Relic, Splunk) where beneficial.
  • Define monitoring standards, naming conventions, and best practices for consistency across the cloud environment.

Phase 2: Metrics, Logs & Tracing Implementation:

  • Implement robust metrics collection for all cloud resources (EC2, Lambda, S3, RDS, AKS, GKE, etc.), custom application metrics, and business-level metrics.
  • Configure centralized log aggregation and management solutions (e.g., CloudWatch Logs, Azure Log Analytics, Google Cloud Logging) with appropriate retention policies.
  • Implement distributed tracing (e.g., AWS X-Ray, Azure Application Insights, Google Cloud Trace) for end-to-end visibility into microservices interactions.
  • Set up synthetic monitoring and real user monitoring (RUM) for critical application endpoints.

Phase 3: Alerting, Dashboards & Automation:

  • Configure highly effective and actionable alerting rules with appropriate thresholds and notification channels (e.g., PagerDuty, Slack, email).
  • Develop intuitive, role-specific dashboards in native cloud consoles or Grafana for real-time operational visibility.
  • Implement automated remediation actions triggered by specific alerts (e.g., auto-scaling, self-healing).
  • Establish cost monitoring and alerting to optimize cloud spending and provide detailed reporting.


Phase 4: Optimization, Security Monitoring & Knowledge Transfer:

  • Continuously optimize monitoring configurations for performance, accuracy, and cost-efficiency.
  • Implement security monitoring integration with cloud security services and SIEM systems for comprehensive threat detection.
  • Provide comprehensive documentation of the cloud monitoring architecture, configurations, dashboards, alerting policies, and best practices for ongoing management.
  • Conduct in-depth training sessions and strategic advisory for our operations and development teams, and potentially leadership, on leveraging the monitoring platform for troubleshooting, performance analysis, security insights, and cost optimization.
  • Perform post-implementation performance and cost reviews with optimization recommendations.

Required Qualifications:

  • Minimum 4+ years of hands-on experience in cloud architecture or DevOps roles, with at least 4 years specifically focused on designing and implementing comprehensive monitoring solutions on AWS, Azure, or GCP.
  • Expert-level proficiency with the monitoring services of at least one major cloud provider (e.g., AWS CloudWatch, AWS X-Ray, AWS Config; Azure Monitor, Azure Application Insights; Google Cloud Monitoring, Google Cloud Logging, Google Cloud Trace).
  • Strong understanding of cloud architecture patterns, serverless computing, containers (Docker, Kubernetes), and their monitoring implications.
  • Experience with third-party APM/observability tools (e.g., Datadog, New Relic, Splunk, Dynatrace) is a significant plus.
  • Proficiency in scripting (e.g., Python, PowerShell, Bash) and Infrastructure as Code (e.g., CloudFormation, Terraform, ARM templates) for automating monitoring deployments.
  • Deep understanding of metrics, logs, and traces, and how to effectively aggregate and analyze them.
  • Excellent analytical skills to interpret complex cloud data and identify performance/cost optimization opportunities.
  • Exceptional communication and presentation skills to convey complex monitoring strategies to various stakeholders.

Key Skills:

  • Cloud Monitoring
  • Observability
  • AWS CloudWatch
  • Azure Monitor
  • Google Cloud Monitoring
  • AWS X-Ray

Expectations for Support from Freelancer:

  • Responsiveness: Prompt communication and response to inquiries (within 24 hours on weekdays, with faster response for critical issues as agreed upon).
  • Availability: Willingness to be available for urgent issues or critical updates, potentially outside standard business hours, with prior arrangement.
  • Troubleshooting: Ability to quickly diagnose and resolve any post-implementation issues that may arise.
  • Documentation Updates: Keep documentation current with any changes or optimizations made during the support phase.
  • Advisory: Provide expert advice on future scaling, security enhancements, or new feature implementations.

Project Goals:

  • Unparalleled Visibility: Ensure deep, real-time visibility into cloud system health, performance, and security across all deployed resources and applications.
  • Rapid Incident Detection & Resolution: Enable proactive identification and efficient troubleshooting of cloud-related issues, significantly reducing Mean Time To Resolution (MTTR).
  • Cost Optimization: Drive efficiency in cloud spending through intelligent monitoring, detailed cost reporting, and resource utilization analysis.
  • Enhanced Reliability & Performance: Improve the stability, responsiveness, and overall performance of cloud-native applications and infrastructure.
  • Data-Driven Decision Making: Empower engineering, operations, and leadership teams with actionable insights derived from comprehensive cloud data.
  • Strengthened Cloud Security: Integrate monitoring with cloud security services to enhance threat detection and compliance.
  • Brazil
  • Proposal: 2
  • Verified
  • Less than a week
Lucas Pereira
Lucas Pereira Inactive
São Paulo , Brazil
Member since
Oct 26, 2024
Total Job
6
Last seen
1 week ago