Site Reliability Engineer 3 (SRE3)
Site Reliability Engineer 3 (SRE3) 3
Applications
3
Applications
Not Accepting Applications
About the Job
Skills
Job Title: Site Reliability Engineer 3 (SRE3)
Experience Required: 4 to 6 years of experience
Location: Bangalore
Role:
Site Reliability Engineer at Flipkart are developers with excellent operations mindset. As a Site Reliability Engineer, you will be building solutions to scale our platforms and applications reliably for high availability and make sure Service Level objectives (SLO) are met. You will own all the SLOs of various Flipkart services across tiers. You will work directly with our Software Development teams to reduce the toil of developing, deploying and maintaining our software,by adopting engineered solutions and reliability engineering best practices. You will be responsible for solving greenfield problems in reliability engineering and benchmarking, at scale.
Responsibilities:
- Help our engineers adopt Flipkart Reliability Engineering playbook by abstracting context and complexities of a hybrid cloud.
- Build, coach and mentor teams of Site Reliability Engineers
- Cover availability, reliability, security etc. considerations being imbibed and reviewed and adhered to at every stage of product development.
- Monitor and resolve issues in all environments. Ensure SLOs are met. Alert appropriately, build self-healing capabilities in the platforms, involve people when needed, and log tickets. Participate in a 24x7 on-call rotation.
- Run periodic resilience ( chaos) experiments and continuously verify the state of reliability
- Build and improve configuration and automation tools to remove toil in developing, deploying and maintaining software
- Own the RCA lifecycle for the platform issues, be answerable to the stakeholders (internals and external) on most of the service internals.
- Have a viewpoint on the distributed systems’ performance, and should be able to drive the capacity plans and scale requirements.
- Identifying bottlenecks and tuning areas as long as major code changes are not necessary. e.g. If working on a hive benchmark, and MySQL connection pool is not externally configurable and expansion policy is becoming a problem, you should be able to make code changes, build it and expose config and continue benchmark.
- Partner the developer and devops teams in on-call load sharing, handle 24/7 platform support.
Qualification:
- BTech or Mtech in CS or equivalent with 5+ years working w/ highly available platforms in web-scale organizations. Demonstrated experience of around 1-2 years as a developer is good to have.
- Good troubleshooting skills of always available and high scale systems.
- Should have the ability to effectively collect all the relevant data-points and debugging artefacts/snapshots so that the debugging at a later stage can be as effective as possible.
- Expert level knowledge of at least one configuration management system (Ansible, Puppet, etc.).
- Understanding of standard networking basics such as: HTTP, DNS, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing, DB sharding, partitions etc..
- Excellent written and verbal communication skills.
- Understand CI/CD and ability to architect the workflow or a deployment plan.
- Write software to automate API-driven tasks at scale; using Python, Go etc., develop application components wherever required using Scala, Python, C++ and Java
About the company
Industry
Human Resources Services
Company Size
51-200 Employees
Headquarter
Bangalore