We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results

Cloud Reliability Engineer

Ampcus, Inc
United States, South Carolina, Charleston
Dec 18, 2024
Overview

Cloud Reliability Engineer

Role and Responsibilities

Reporting to the Head of Cloud / API Engineering, the Cloud Reliability Engineer will play a critical role in driving innovation and growth for the Banking Solutions business. In this role, the candidate will have the opportunity to make a lasting impact on the company's digital transformation journey, drive customer-centric innovation and automation, and position the organization as a leader in the competitive digital banking landscape.

Specifically, the Cloud Reliability Engineer will be responsible for the following:


  • Strategize and drive the building blocks of reliability engineering as we make the transition from private to public cloud.
  • Ensure the reliability, availability, and performance of applications and services, focusing on minimizing downtime, optimizing response times, and maintaining high availability for users.
  • Lead incident response efforts for incidents, including identification, triage, resolution, and post-incident analysis to prevent recurrence and improve system resilience.
  • Develop and maintain monitoring solutions and alerting mechanisms for infrastructure, application performance, and user experience metrics, enabling proactive issue detection and mitigation.
  • Implement automation tools and processes to automate routine tasks, scale infrastructure, and ensure seamless deployments, updates, and rollbacks with minimal user impact.
  • Conduct capacity planning, performance tuning, and resource optimization for environments, collaborating with development and operations teams to meet scalability and performance goals.
  • Collaborate with security teams to implement security best practices, perform vulnerability assessments, and ensure compliance with security standards and regulatory requirements for applications.
  • Manage deployment pipelines, release processes, and configuration management for app deployments, ensuring consistency, reliability, and version control across environments.
  • Identify areas for improvement in reliability, performance, and efficiency through data analysis, root cause analysis, and trend analysis, and drive initiatives to enhance system reliability and operational efficiency.
  • Create and maintain documentation, runbooks, and knowledge base articles for operational procedures, troubleshooting guides, and best practices, and promote knowledge sharing within the team.
  • Develop and test disaster recovery plans, backup strategies, and failover mechanisms for app services, ensuring business continuity and data integrity in case of failures or disasters.
  • Collaborate with development, QA, DevOps, and product teams to ensure alignment on reliability goals, performance metrics, release schedules, and incident response processes.
  • Participate in on-call rotations and provide 24/7 support for critical incidents, troubleshoot issues, and coordinate with teams for resolution, escalation, and follow-up actions as per defined SLAs.


Professional Qualifications


  • Specific experience in reliability engineering for a large-scale transition from private to public cloud and strategies for such.
  • Proficient in development technologies, architectures, and platforms (web, api) to understand system complexities and performance considerations.
  • Experience in cloud platforms (e.g., AWS, Azure, Google Cloud) and infrastructure as code (IaC) tools for managing app infrastructure and deployments.
  • Knowledge of monitoring tools (e.g., Dynatrace, Logrocket, DataDog) and logging frameworks (e.g., ELK Stack) for real-time visibility into system health, performance metrics, and user experience.
  • Experience in incident management, including incident response, triage, root cause analysis (RCA), and post-mortem reviews to prevent recurring issues.
  • Strong troubleshooting skills to diagnose complex technical issues in app environments, infrastructure, networking, and performance bottlenecks.
  • Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Ansible, Terraform) for automating routine tasks, deployments, and infrastructure management.
  • Experience in implementing continuous integration/continuous deployment (CI/CD) pipelines for apps using tools like Jenkins, GitLab CI/CD, or Azure DevOps.
  • Expertise in setting up monitoring solutions, configuring alerts, and creating dashboards to monitor system performance, application metrics, and user experience.
  • Familiarity with APM (Application Performance Monitoring) tools to analyze app performance, identify bottlenecks, and optimize resource utilization.
  • Commitment to continuous learning, staying updated with industry trends, new technologies, and best practices in app reliability, performance, and operations.
  • Adaptability to evolving requirements, technologies, and business needs, with a focus on driving continuous improvement and operational excellence.


Personal Characteristics


  • Demonstrates judgment and flexibility; thinks about issues and develops solutions that thoughtfully take the broader context into account - positively deals with a shifting demand for time, priorities, and the rapid change of environments.
  • Takes an ownership approach to engineering and product outcomes.
  • Action-oriented self-starter who can set strategy and drive execution with a roll up the sleeves approach.
  • Excellent interpersonal communication, negotiation and influencing skills to work effectively with all stakeholders (internal & external), making information-based decisions.
  • Penchant for excellence, both personally and professionally, demonstrated by intellectual curiosity, record of accomplishment, and reputation; shows strong attention to detail and implementation of best practices with an inclination for continuous improvement.
  • Ability to quickly establish strong credibility with employees, business partners and external resources.
  • Embodies and delivers the firm's values and culture towards colleagues, clients, and communities:
  • Win as one team
  • Lead with integrity
  • Be the change


Summary: Responsible for providing high-level consulting services to clients and preparing programming assignments. Designs, plans and supervises implementation of complex, large-scale system projects. Reviews, analyzes, and modifies programming systems including encoding, testing, debugging and installing for a complex, large-scale computer system. Assists in supervising the daily activities of the project team members.

Essential Duties and Responsibilities:

Provides high-level consulting services to client personnel (e.g., advises client on complex issues involving new regulation, technology or system functionality; evaluates various technical and business solutions and makes recommendations to client; troubleshoots errors and inefficiencies related to the application(s) and related processes; advises client on technical direction and specific business issues).

Maintains project estimates and project management timelines for multiple major projects.

Verifies completeness and accuracy of specifications for multiple major projects to be estimated (e.g., report changes, control file changes, file fixes).

Determines programming requirements for multiple major projects (e.g., product updates, conversions).

Researches and designs system modules, program enhancements and modifications to existing programs or modules.

Creates documents to communicate complex technical information to audiences of all levels.

Conducts research and documents findings and recommendations by using analytical problem solving.

Provides client support, training, testing and vendor relations.

Develops technical designs that will meet system objectives and minimize the impact on operations.

Maintains and develops on-line and batch application programs.

Codes programs that interface with multiple applications.

Trains new employees on all aspects of an application or system product.

Develops complex procedural language routines.

Provides applications development and support and utilizes troubleshooting and diagnostic tools.

Monitors, measures, and optimizes individual and combined utilization of hardware, software, and telecommunications components.

Responsible for software installation and maintenance. May act as project leader.

Develops and implements a disaster recovery plan.

Performs other related duties as assigned.

Qualifications:

To perform this job successfully, an individual must be able to perform each essential duty satisfactorily. The requirements listed below are representative of the knowledge, skills, and/or abilities required. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.

Complexity of Work:

Moderately routine; general policies applied. Some decision-making.

Education:

Bachelor s degree from a four year college or university in a related area.

Experience:

7-10 years with 6-8 years full life cycle development experience and 5-7 years programming and system design experience in financial services or a related industry in directly-related progressively responsible positions; or equivalent combination of education and experience.

Knowledge, Skills and Abilities:

Thorough knowledge of structured programming technology for structured language environment

Thorough knowledge of applications/development methodologies

Thorough knowledge of appropriate operating systems, programming languages, and development tools

Considerable knowledge of performance tuning

Skill in interpersonal skills/team building

Skill in project management experience

Skill in operating independently

Skill in exhibiting solid analysis, decision-making, and problem solving

Skill in understanding and focusing on the clients needs and goals, establishing credibility and building relationships with clients

Ability to assess requirements, alternatives, and risks/benefits for low- to high-impact projects

Ability to develop a mid-size project plan (i.e., a plan that affects a single system or family and multiple resources) using approved project management software

Ability to communicate effectively verbally and in writing

Ability to establish and maintain effective working relationships with employees, clients, vendors and public

Applied = 0

(web-86f5d9bb6b-jpgxp)