Site Reliability Engineer Lead - Hybrid

Mandaluyong, National Capital Region

Posted more than 30 days ago


Company:: Nityo Infotech Services Philippines
Company Description:: Nityo Infotech Corporation is the fastest-growing global IT Services & Solutions Company; headquartered in New Jersey, USA. Our services span from Application Management Outsourcing, Packaged Application Services, Remote Infrastructure Management, Product Development, and Support to higher value-added offerings, including Managed Platform and Product Engineering Services.
Contract Type:: Full Time
Experience Required:: 5 to 10 years
Education Level:: Bachelor’s Degree
Number of vacancies:: 1

Job Description

As the SRE Lead, you will lead and mentor the SRE Team at DFI Retail Group. You will be responsible for upholding the stability, resilience, and scalability of our services through automation, strategic testing, and robust engineering practices. You will leverage your software engineering expertise to automate operations, optimize system performance, and develop solutions that prevent recurring issues.

Qualifications:
•Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience.
•Proven experience (At least 5 years) as an SRE, DevOps Engineer, or a similar role, demonstrating a strong understanding of software engineering principles and a total of 8 years of work experience in IT.
•Hands-on experience in the administration of the Atlassian product suite (Jira, Confluence and Bitbucket).
•In-depth knowledge of cloud platforms such as AWS, Azure, or GCP, including services related to compute, storage, networking, and databases.
•Proficiency in scripting languages like Python or PowerShell and experience with automation tools such as Terraform or Ansible.
•Familiarity with monitoring and log system (Prometheus, Zabbix, Grafana, ELK, Azure Monitor, Google Monitoring, New Relic)
•Hands-on experience with containerization technologies like Docker and container orchestration tools like Kubernetes.
•Strong understanding of networking concepts and protocols.
•Experience with CI/CD pipelines and tools for continuous integration, continuous delivery, and infrastructure automation.
•Solid understanding of security best practices for cloud environments.
•Strong analytical and problem-solving skills, with the ability to identify root causes and implement effective solutions.
•Excellent communication and collaboration skills, with the ability to work effectively within a team and communicate technical details to both technical and non-technical audiences

Responsibilities:
•Team Leadership: Provide guidance and mentorship to the team, fostering their professional development and cultivating a high performing, collaborative team environment.
•Technical Expertise: Act as a subject matter expert, providing technical guidance and championing industry best practices. You will drive the adoption of new technologies and methodologies to continuously enhance system reliability.
•Design and Implement Solutions for Reliability and Scalability: Develop and implement highly scalable and available system architectures to meet growing user demands without compromising performance.
•Automate Operations: Design, build, and integrate software tools to automate operational processes, including system monitoring, incident response, and deployment procedures.
•Optimize System Performance: Proactively monitor system performance, identify bottlenecks, and implement optimization strategies to ensure efficient resource utilization and service delivery.
•Implement and Manage Monitoring and Observability: Establish comprehensive service metrics and implement robust monitoring systems to track, analyze, and report on system reliability, performance, and efficiency including, but not limited to the following monitoring systems (New Relic, Azure Monitor, and Google Cloud Monitoring). Utilize observability tools to gain deeper insights into system behavior and identify potential issues proactively.
•Incident Response and Resolution: Develop and implement strategies for rapid incident detection and response. Troubleshoot and resolve complex system issues, minimizing downtime and mitigating service disruptions.
•Capacity Planning and Performance Tuning: Conduct capacity planning analyses to anticipate future resource needs and ensure system scalability. Proactively tune system performance to optimize resource utilization and maintain service level agreements (SLAs).
•Collaboration with Development Teams: Work closely with software development teams to integrate reliability considerations throughout the software development lifecycle. Participate in code reviews, design discussions, and post-incident reviews to enhance system reliability and prevent recurring issues.
•Drive Continuous Improvement: Continuously evaluate existing processes and tools, identifying areas for improvement and automation. •Research and implement new technologies and best practices to enhance system reliability and operational efficiency.
•Documentation and Knowledge Sharing: Create and maintain comprehensive documentation for systems, processes, and incident responses. Actively share knowledge and best practices with the team and organization.
•Administer Atlassian Product Suite: Manage and maintain the Atlassian product suite, including Jira, Confluence, and Bitbucket, ensuring seamless operation and integration with existing workflows. Provide user support and training as needed.

Apply Now

Report this Job Ad