Apply now

Apply for Job

R&D Manager

Date:  Apr 12, 2024
Location: 

India, Hyderabad

Job Category:  R&D
Department:  Product & Technology

Manager – SRE – L1

Hyderabad, India

 

About CyberArk:

CyberArk, the global leader in privileged access management, helps organizations transform their business through improved security and reduced risk. As a trusted partner for thousands of companies around the world, CyberArk consistently sets the bar – driving innovation and helping our customers stay one step ahead of attackers.

 

Job Overview:

 

We are the Infrastructure Team of CyberArk. Our group provides back-end services to all CyberArk applications. Our group's mission is to protect sensitive assets using an access control layer (authentication & authorization), cryptography, and secure communication while practicing the highest security standards.

 

Our goal is to build and maintain a scalable, fault-tolerant, high-load, distributed system. We are searching for an outstanding SRE manager who is responsible for leading and managing a team of Site Reliability Engineers, with a focus on ensuring the reliability, performance, and scalability of CyberArk’s saas services and AWS infrastructure. This role involves a combination of technical expertise, leadership, and collaboration to meet the organization's reliability and availability goals.

 

Responsibilities:

 

  • Team Leadership: Lead, mentor, and develop a team of Site Reliability Engineers, providing guidance and support in their daily work. Set clear performance goals and expectations for the team, and conduct regular performance reviews.
     
  • Incident Management : Develop and maintain an incident response plan to minimize downtime and service disruptions. Lead the team in responding to and resolving critical incidents, ensuring post-incident reviews to prevent future occurrences. Participate in an on-call rotation schedule to provide 24/7 support for critical incidents and escalations.
     
  • Monitoring And Alerting : Implement and maintain robust monitoring and alerting systems to proactively identify and address performance and reliability issues. Continuously refine alerting thresholds to minimize false alarms and maintain service reliability.
     
  • Cost Optimization : Execute regular cost optimization audits. Work closely with infrastructure and development teams to forecast and plan for future capacity needs to ensure that the infrastructure can handle traffic growth.
     
  • SOPs : Maintain comprehensive documentation for SRE processes and procedures. Provide training and knowledge sharing sessions to improve the overall technical expertise of the team.
     
  • Strategy And Planning : Define and execute a strategic vision for site reliability engineering, including setting objectives, goals, and key performance indicators. Collaborate with cross-functional teams to align SRE efforts with business objectives and priorities.
     

 

Qualifications

  • Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
  • 10+ years in a Site Reliability Engineering with strong knowledge in AWS services. AWS Certification is a plus.
  • 5+ years experience in leading a team of SREs and collaboration with various stakeholders.
  • Knowledge of defining and monitoring system quality measures, including SLO and SLA;
  • Built tooling to improve reliability of systems, automated remediation of issues, or improve scalability;
  • Hands-on experience collecting performance data, analyzing, troubleshooting, and tuning. Exposure to APM tools is a plus
  • Proficiency in system administration, cloud infrastructure, and release deployment.
  • Excellent problem-solving skills and the ability to work well under pressure.

 

 

 

Apply now

Apply for Job