Changing the world through digital experiences is what Adobe’s all about. We give everyone from emerging artists to global brands everything they need to design and deliver exceptional digital experiences! We’re passionate about empowering people to create beautiful and powerful images, videos, and apps, and transform how companies interact with customers across every screen.
We’re on a mission to hire the very best and are committed to creating exceptional employee experiences where everyone is respected and has access to equal opportunity.
We realize that new ideas can come from everywhere in the organization, and we know the next big idea could be yours!
The Challenge :
Do you possess the ability to automate resolving problems to improve uptime? In other words, reduce MTTD & MTTR!
Adobe needs a hands-on Site Reliability Engineer (SRE) with a breadth of knowledge in services at scale and someone who can be flexible, efficient, and rapidly adaptive in an environment that looks vastly different from the siloed, traditional teams.
Our mission is to reduce MTTD & MTTR utilizing automatic recovery and self-healing, thereby providing an outstanding experience to Adobe customers.
SRE is a mentality of engineering approaches that focuses on building highly reliable systems and eliminating toil through automation.
Our SRE and Engineering teams are distributed, split between San Jose, California; Bangalore, India. We rely heavily on tools like Slack, JIRA, Git, and video conferencing to collaborate.
Flexibility to join meetings with colleagues around the world is encouraged. The successful candidate must prioritize tasks and work independently.
What you will do :
Ensure the highest level of uptime and Quality of Service (QoS) through operational perfection
Embed with teams to nurture strong collaboration and find opportunities for automation and self-healing
Identify areas to improve service resiliency through chaos engineering, performance / load testing, etc.
Support and maintain globally distributed multi-cloud (public and / or private) environments
Automate common, repeatable tasks at a large scale to streamline operational procedures
Tackle performance and stability issues using a wide variety of tools
Evaluate and handle the application and environment security, Use and maintain version control for application infrastructure
Follow organizational change processes during implementations.
Cross-train with other distributed team members and promote the DevOps / SRE approach
Participate in an on-call rotation as required
Determine root-cause for all production level incidents and write corresponding high-quality RCA reports
What you need to succeed :
3-5 years production-level experience with distributed applications in the public cloud (AWS and / or Azure)
Experience in one (and preferably more) of the following languages : Python, Go, Java
Strong working knowledge of Event Driven Automation focused on solving for automated remediation concepts and preferably experience in Stackstorm platform
Experience in developing bots and Slack based apps
Expertise in containerization orchestration engines (i.e. Kubernetes)
Strong working knowledge of modern, continuous development techniques and pipelines (CI / CD, Jenkins, Git, Artifactory)
Strong written and oral communication skills
Consistent track record to adapt to new technologies and learn quickly
Curious to ask difficult questions and capable of leading change in a diverse organizational landscape
B.S. degree in Computer Science, related technical field or equivalent practical experience
Desired Qualities :
Self-starter requiring minimal mentorship
Ability to learn quickly and adapt to changing priorities and requirements