All jobs
DevsuEngineering
Site Reliability Engineer (SRE) - GCP
Remote (US)Posted 19 days ago
Seeking a Site Reliability Engineer (SRE) with expertise in monitoring, observability, and reliability engineering to support systems on-premises and Google Cloud Platform (GCP). The role involves designing, operating, and improving monitoring and observability platforms, with secondary backup support for application issues.
Location: Remote (US)
Responsibilities
- Own and operate the monitoring and observability stack across on-prem and GCP environments
- Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
- Define, tune, and maintain alerts to ensure high signal-to-noise ratio
- Establish observability standards and best practices across teams
- Improve visibility into system health, performance, and reliability
- Apply SRE principles to improve availability, performance, and resilience
- Define and track SLIs, SLOs, and error budgets
- Participate in on-call rotations and SEV incident response
- Lead or contribute to incident investigations and root cause analysis (RCA)
- Drive preventative actions to reduce repeat incidents
- Support and monitor Kubernetes environments (GKE and on-prem clusters)
- Monitor cluster health, capacity, and resource utilization
- Troubleshoot platform-level issues impacting application reliability
- Collaborate with Platform and Engineering teams on reliability improvements
- Provide L2/L3 application support during resource shortages, high-severity incidents, and peak periods
- Triage and troubleshoot application issues using runbooks and dashboards
- Collaborate with Application Support and Engineering teams during incidents
- Document actions, findings, and resolutions in ServiceNow
Requirements
- Strong experience as a Site Reliability Engineer or Reliability Engineer
- Deep hands-on expertise with Grafana (dashboards, alerting, troubleshooting)
- Solid experience with monitoring and observability systems
- Production experience operating Kubernetes environments
- Experience supporting systems in GCP and on-prem environments (mandatory)
- Strong Linux systems and troubleshooting skills
- Fluent English (written and spoken)
- Ability to work in PST time zone
- Ability to participate in an on-call rotation including weekend coverage
Benefits
- Stable, long-term contract with career growth opportunities
- Private health insurance
- Remote-friendly culture promoting work-life balance
- Continuous training, mentorship, and learning programs
- Free access to AI training resources and tools
- Flexible Paid Time Off (PTO) policy and paid holidays
- Challenging software projects for clients in the US and LatAm
- Collaboration with talented software engineers in Latin America and the US