All jobs
QADDevOps
Senior Site Reliability Engineer
RemotePosted today
Senior Site Reliability Engineer responsible for ensuring the reliability, scalability, and performance of mission-critical services, driving automation, and shaping SRE practices.
Location: Remote
Responsibilities
- Design, implement, and maintain highly available, scalable, and resilient systems.
- Define, implement, and enforce best practices for monitoring, alerting, logging, tracing, and synthetic testing within AWS using Datadog.
- Develop robust, well-tested software and tooling for automation and reliability.
- Contribute to incident management, post-mortems, and reliability metrics.
- Leverage infrastructure as code with Terraform and GitHub Actions.
- Provide expertise in system design reviews, architecture, and scalability.
- Share knowledge through documentation, runbooks, and mentorship.
Requirements
- Experience operating and improving production systems at scale.
- Ability to understand complex distributed systems.
- Troubleshooting skills and incident response experience.
- Experience with SLIs, SLOs, and error budgets.
- Strong communication skills.
Additional Information
- This role is fully remote.
- The role involves working with AWS, Kubernetes, Datadog, Terraform, and other observability and automation tools.
- The company values diversity, equity, and inclusion.