AI Support Engineer
AI Support Engineer
About AI Factory
The AI Factory is the product and technology arm of GovAI, responsible for:
Building and operating sovereign AI platforms, models, and services across Abu Dhabi Government entities
Scaling AI services from pilot to production
Strengthening AI Operations to ensure:
Reliability
Governance
High-quality support for AI-powered workloads
Role Overview
Employment Type: Full-time
Work Arrangement: Onsite (Applicants based outside the UAE are required to relocate)
The AI Support Engineer is:
The first line of operational support for AI platforms, AI models, APIs, and AI-enabled solutions
Focused on supporting:
AI workloads
Model consumption
RAG pipelines
AI-driven applications running in production
An AI-focused L1 operations role, aligned with AIOps practices (not a traditional IT helpdesk)
Responsible for:
Triaging AI-related incidents
Monitoring model behavior and API performance
Supporting AI service integrations
Escalating issues to engineering, vendors, or platform teams
Key Responsibilities
AI Model & API Operations Support
Support production AI model consumption (LLMs, embeddings, OCR, STT/TTS, inference APIs)
Troubleshoot inference failures, latency spikes, malformed payloads, and API errors
Diagnose authentication failures (OAuth, tokens, API keys, quota limits)
Validate request structures and integration configurations
Monitor token consumption trends and detect abnormal usage spikes
Support quota management and controlled usage increases
RAG & AI Pipeline Operations
Monitor RAG pipelines and retrieval workflows
Troubleshoot embedding generation failures and indexing issues
Identify ingestion failures affecting vector databases
Validate document connector and data pipeline integrity
Diagnose relevance or response degradation caused by configuration issues
Escalate data-layer or infrastructure-level issues to DevOps support
AI Governance & Guardrail Monitoring
Ensure AI service consumption complies with defined access controls and governance policies
Validate rate limiting, usage policies, and guardrail configurations
Detect abnormal usage patterns or policy violations
Support enforcement of entity-level quotas and access restrictions
Escalate governance breaches to appropriate stakeholders
Incident Triage & SLA Management
Act as first responder for AI-layer incidents (P0–P3)
Perform structured triage using logs, API traces, and monitoring dashboards
Classify incidents based on severity and business impact
Contain and mitigate AI service disruptions and coordinate with vendors when needed
Escalate complex issues to L2/L3 engineering with complete diagnostic context
Track incidents through full lifecycle and ensure SLA adherence
Participate in Root Cause Analysis (RCA) for major AI service failures
Release Validation & Change Support
Perform smoke validation after AI model updates or API releases
Monitor regression risks following deployments
Identify post-release anomalies and escalate early
Support controlled rollout monitoring for new AI capabilities
Enterprise Integration & Connector Support
Support integrations with enterprise systems (Microsoft 365, SharePoint, Teams, Oracle, Jira, etc.)
Troubleshoot API integration failures, webhook errors, and data exchange issues
Validate secure connectivity and authentication configurations
Coordinate with DevOps support for infrastructure-related integration failures
Observability & Operational Monitoring
Monitor AI API performance metrics (latency, error rates, throughput)
Track token usage, consumption trends, and service availability
Identify recurring failure patterns and propose preventive actions
Maintain visibility dashboards for AI service health
Documentation & Knowledge Management
Maintain AI troubleshooting runbooks and support playbooks
Update known-issue repositories and FAQs
Document recurring AI API and RAG-related issues
Capture structured RCA documentation for major incidents
Contribute to operational documentation for new AI services
Handle ITSM/ticketing
Required Technical Skills
Experience supporting REST APIs and API-based platforms
Understanding of LLM consumption patterns (RAG, embeddings, inference APIs)
Familiarity with authentication mechanisms (OAuth2, API keys, token-based access)
Ability to troubleshoot using logs, traces, and monitoring dashboards
Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
Basic understanding of cloud environments (Azure preferred)
Familiarity with enterprise system integrations
Understanding of rate limiting, quotas, and API governance
Experience
3–8 years in:
AI platform support
API support
SaaS support
Application operations
Experience supporting AI/ML services or developer platforms preferred
Exposure to regulated or government environments advantageous
Experience working with external vendors and enterprise stakeholders
Arabic speaker is a plus