$55.00 Hourly
Overview:
We are a fintech startup, and our platform must have high uptime. While our incident response is decent, we lack a structured process for learning from failures.
The Challenge:
We frequently experience service disruptions, but our post-incident analysis is informal and inconsistent. We do not have a standard procedure to conduct a thorough Root Cause Analysis (RCA), which means we often fix the symptom rather than the underlying problem.
Problems Caused:
This lack of a formal process leads to recurring incidents and a failure to improve our system's reliability over time. It prevents us from implementing long-term fixes and building a more resilient infrastructure.
Proposed Method:
The freelancer will be responsible for creating and documenting a formal RCA process. This includes developing a template for incident reports, defining a timeline for analysis, and establishing a clear chain of communication and accountability for follow-up actions.
Required Skills:
Experience in Incident Management and Service Reliability Engineering (SRE).
Knowledge of ITIL or other incident management frameworks.
Strong technical writing and communication skills.
Experience Required:
At least 3-5 years of experience in a role where you have led or participated in post-incident reviews.
Delivery:
A comprehensive, documented RCA process including templates and guidelines.
Support:
We require 2 weeks of post-delivery support to help our team adopt the new process and answer questions.
- Spain
- Proposal: 0
- Not Verified
- Less than a month
- Estimated Hours: 40
