DevOps Support Engineer
DevOps Support Engineer
About AI Factory
The AI Factory operates sovereign AI infrastructure including:
GPU clusters
Cloud subscriptions
Containerized workloads
API gateways
Multi-environment deployments (Sandbox → Staging → Production)
The DevOps Support Engineer ensures:
Infrastructure stability
Deployment reliability
Operational continuity for AI workloads
Role Overview
Employment Type: Full-time
Work Arrangement: Onsite (Applicants based outside the UAE are required to relocate)
The DevOps Support Engineer is responsible for supporting:
Cloud infrastructure
CI/CD pipelines
Containerized AI workloads
API gateways
Production environments
The role focuses on:
Platform stability
Environment health
Deployment reliability
Infrastructure troubleshooting
Structured incident management
Environment discipline
Production governance
This is an operational reliability role aligned with modern DevOps, SRE, and AIOps practices.
The engineer acts as:
L1 operational responder for infrastructure/platform incidents
Ensures issues are diagnosed, contained, escalated appropriately
Ensures resolution within defined service levels
Key Responsibilities
1. Infrastructure, Cloud & Environment Support
Support Azure subscriptions, resource groups, networking, and access control
Monitor GPU environments, container clusters, and AI runtime environments
Troubleshoot deployment failures across Sandbox, Staging, and Production
2. DevOps & CI/CD Support
Monitor CI/CD pipelines and resolve build/deployment issues
Support Git workflows, version control issues, and release rollouts
Ensure environment configuration consistency
Validate infrastructure changes post-deployment
Perform rollback support when required
3. GPU & AI Runtime Operations Support
Monitor GPU utilization and allocation
Identify memory saturation and CUDA/container runtime errors
Support AI model deployment on GPU nodes
Detect performance bottlenecks affecting inference services
4. API Gateway, WAF & Integrations
Troubleshoot API gateway routing issues and throttling policies
Monitor rate limiting and traffic control mechanisms
Investigate WAF-related blocking incidents
Support secure external integrations
Support integrations with enterprise systems:
Microsoft 365
SharePoint
Teams
Oracle
Jira
Troubleshoot authentication issues, webhook failures, and API timeouts
5. Observability & Incident Response
Monitor service availability, CPU/GPU utilization, memory, storage, and logs
Detect infrastructure bottlenecks affecting AI workloads
Act as first-line responder for infrastructure and platform-related incidents (P0–P3)
Perform triage using logs, metrics, system databases, and environment diagnostics
Classify incidents by severity and business impact in line with defined SLAs
Contain and mitigate production-impacting issues
Coordinate with L2/L3 teams and vendors
Escalate with full diagnostic context (logs, metrics snapshots, timestamps, components)
Track incident lifecycle to closure and ensure no SLA breach
6. Documentation & Knowledge Management
Maintain and improve:
Infrastructure runbooks
Deployment troubleshooting guides
Environment configuration documentation
FAQs
Document recurring failure patterns (deployment errors, GPU saturation, network misconfigurations)
Handle ITSM/ticketing documentation
Capture and publish Root Cause Analysis (RCA) summaries for major incidents
Update environment diagrams and operational checklists after changes
7. Platform Reliability
Support Kubernetes clusters, Docker containers, and orchestration layers
Validate scaling, failover, and resilience mechanisms
Ensure uptime SLAs for AI products, platforms, and APIs
8. Security & Compliance Coordination
Support IAM, access control, WAF, and network configurations
Coordinate with security teams for incident remediation
Ensure adherence to environment governance policies
Required Technical Skills
Strong hands-on experience with Azure (AWS/GCP acceptable)
Experience supporting Kubernetes and Docker environments
Familiarity with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins)
Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
Understanding of networking, IAM, API gateways, and WAF
Experience supporting production cloud environments under SLA constraints
Familiarity with Infrastructure-as-Code concepts (ARM/Terraform)
Experience
4–7 years in DevOps, Cloud Operations, Platform Support, or SRE-aligned roles
Experience supporting containerized or AI workloads preferred
Exposure to regulated or government environments advantageous
Arabic speaker is a plus