SRE Roles & Responsibilities
Site Reliability Engineer – The term ideated by Google and this role has been gaining more attention day by day . A role dedicated in focussing on OBSERVABILITY, RESILIENCY, RELIABILITY AND MONITORING .
Even though SRE Engineers , Devops Engineers can be set with a generic set of roles & responsibilities , organizations are forming their own job description according to their current requirement, environment and also considers the developers requirement . Because SRE engineers are there to support developers and make sure the applications are hitting smooth delivery .
Recently I met with Muthu , SRE lead in a reputed MNC . According to Muthu, SRE roles are defined according to the environment we work. The responsibilities are being added/removed as per the surrounding teams requirements, skill set . For example , if the developers team claims that they can work on AWS-EKS setup as they have bandwidth , then SRE team stands down allowing Developers to explore around AWS-EKS and just provides suggestions on demand.
The core responsibility of a SRE engineer
- Maintain high reliability and availability for software applications
- Participate in ~15% of the production incidents and find all possible way of fixing the issues permanently
- Automate the mundane tasks and avoid human errors. Example – Restart the services when there is an event reported, executing log rotation script manually when there is a threshold issue reported, rebooting the server etc
- Setup a robust Monitoring, Logging & Alerting system. Capture all logs, analyse , monitor and take proactive actions to avoid issues or application degradation . Track metrics such as availability , uptime performance , latency and error count.
- Define SLI & SLO by collaborating with Product owners. SLI : Service Level Indicators – SLI could be the number of successful requests out of total requests. SLO : Service Level Objective – You can set the SLOs once you have determined the baseline system performance
- Perform proof of concepts across existing tools to include new features which will help improve the current system. Compare existing tools with new tools and explore the options, advantages over current tools and take decisions in implementing the right tools for the environment
- Incident post-mortems : Write incident root cause analysis , find out the core reason behind the issue and prevent it from happening again
- Collaborate with cross departments : Closely work with developers to understand their application needs from platform standpoint, understand the blockers and start providing solutions to make the life easier for developers .
- Left shift to L1 operations team : Find the mundane tasks being performed by team and find easy way to implement/deploy those using one touch tool like rundeck , teamcity, Jenkins, concourse etc. Post implementation , left shift the task to L1 Ops team who can handle the tasks without engineering intervention . This will give enough space for Engineering team to work on product development
SRE role is like Ice cream flavors , each company have its own unique flavor according to their environment setup and requirement. OBSERVABILITY, RELIABILITY , RESILIENCY , MONITORING