DevOps Support Engineer

Abu Dhabi, United Arab Emirates
Full Time
Experienced

DevOps Support Engineer

About AI Factory

The AI Factory operates sovereign AI infrastructure including:

  • GPU clusters

  • Cloud subscriptions

  • Containerized workloads

  • API gateways

  • Multi-environment deployments (Sandbox → Staging → Production)

The DevOps Support Engineer ensures:

  • Infrastructure stability

  • Deployment reliability

  • Operational continuity for AI workloads


Role Overview


Employment Type: Full-time

Work Arrangement: Onsite (Applicants based outside the UAE are required to relocate)


The DevOps Support Engineer is responsible for supporting:

  • Cloud infrastructure

  • CI/CD pipelines

  • Containerized AI workloads

  • API gateways

  • Production environments

The role focuses on:

  • Platform stability

  • Environment health

  • Deployment reliability

  • Infrastructure troubleshooting

  • Structured incident management

  • Environment discipline

  • Production governance

This is an operational reliability role aligned with modern DevOps, SRE, and AIOps practices.

The engineer acts as:

  • L1 operational responder for infrastructure/platform incidents

  • Ensures issues are diagnosed, contained, escalated appropriately

  • Ensures resolution within defined service levels


Key Responsibilities


1. Infrastructure, Cloud & Environment Support

  • Support Azure subscriptions, resource groups, networking, and access control

  • Monitor GPU environments, container clusters, and AI runtime environments

  • Troubleshoot deployment failures across Sandbox, Staging, and Production


2. DevOps & CI/CD Support

  • Monitor CI/CD pipelines and resolve build/deployment issues

  • Support Git workflows, version control issues, and release rollouts

  • Ensure environment configuration consistency

  • Validate infrastructure changes post-deployment

  • Perform rollback support when required


3. GPU & AI Runtime Operations Support

  • Monitor GPU utilization and allocation

  • Identify memory saturation and CUDA/container runtime errors

  • Support AI model deployment on GPU nodes

  • Detect performance bottlenecks affecting inference services


4. API Gateway, WAF & Integrations

  • Troubleshoot API gateway routing issues and throttling policies

  • Monitor rate limiting and traffic control mechanisms

  • Investigate WAF-related blocking incidents

  • Support secure external integrations

  • Support integrations with enterprise systems:

    • Microsoft 365

    • SharePoint

    • Teams

    • Oracle

    • Jira

  • Troubleshoot authentication issues, webhook failures, and API timeouts


5. Observability & Incident Response

  • Monitor service availability, CPU/GPU utilization, memory, storage, and logs

  • Detect infrastructure bottlenecks affecting AI workloads

  • Act as first-line responder for infrastructure and platform-related incidents (P0–P3)

  • Perform triage using logs, metrics, system databases, and environment diagnostics

  • Classify incidents by severity and business impact in line with defined SLAs

  • Contain and mitigate production-impacting issues

  • Coordinate with L2/L3 teams and vendors

  • Escalate with full diagnostic context (logs, metrics snapshots, timestamps, components)

  • Track incident lifecycle to closure and ensure no SLA breach


6. Documentation & Knowledge Management

  • Maintain and improve:

    • Infrastructure runbooks

    • Deployment troubleshooting guides

    • Environment configuration documentation

    • FAQs

  • Document recurring failure patterns (deployment errors, GPU saturation, network misconfigurations)

  • Handle ITSM/ticketing documentation

  • Capture and publish Root Cause Analysis (RCA) summaries for major incidents

  • Update environment diagrams and operational checklists after changes


7. Platform Reliability

  • Support Kubernetes clusters, Docker containers, and orchestration layers

  • Validate scaling, failover, and resilience mechanisms

  • Ensure uptime SLAs for AI products, platforms, and APIs


8. Security & Compliance Coordination

  • Support IAM, access control, WAF, and network configurations

  • Coordinate with security teams for incident remediation

  • Ensure adherence to environment governance policies


Required Technical Skills

  • Strong hands-on experience with Azure (AWS/GCP acceptable)

  • Experience supporting Kubernetes and Docker environments

  • Familiarity with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins)

  • Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)

  • Understanding of networking, IAM, API gateways, and WAF

  • Experience supporting production cloud environments under SLA constraints

  • Familiarity with Infrastructure-as-Code concepts (ARM/Terraform)


Experience

  • 4–7 years in DevOps, Cloud Operations, Platform Support, or SRE-aligned roles

  • Experience supporting containerized or AI workloads preferred

  • Exposure to regulated or government environments advantageous

  • Arabic speaker is a plus

Share

Apply for this position

Required*
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*