AI Support Engineer

Abu Dhabi, United Arab Emirates
Full Time
Experienced

AI Support Engineer

About AI Factory

The AI Factory is the product and technology arm of GovAI, responsible for:

  • Building and operating sovereign AI platforms, models, and services across Abu Dhabi Government entities

  • Scaling AI services from pilot to production

  • Strengthening AI Operations to ensure:

    • Reliability

    • Governance

    • High-quality support for AI-powered workloads


Role Overview

Employment Type: Full-time

Work Arrangement: Onsite (Applicants based outside the UAE are required to relocate)



The AI Support Engineer is:

  • The first line of operational support for AI platforms, AI models, APIs, and AI-enabled solutions

  • Focused on supporting:

    • AI workloads

    • Model consumption

    • RAG pipelines

    • AI-driven applications running in production

  • An AI-focused L1 operations role, aligned with AIOps practices (not a traditional IT helpdesk)

  • Responsible for:

    • Triaging AI-related incidents

    • Monitoring model behavior and API performance

    • Supporting AI service integrations

    • Escalating issues to engineering, vendors, or platform teams


Key Responsibilities

AI Model & API Operations Support

  • Support production AI model consumption (LLMs, embeddings, OCR, STT/TTS, inference APIs)

  • Troubleshoot inference failures, latency spikes, malformed payloads, and API errors

  • Diagnose authentication failures (OAuth, tokens, API keys, quota limits)

  • Validate request structures and integration configurations

  • Monitor token consumption trends and detect abnormal usage spikes

  • Support quota management and controlled usage increases


RAG & AI Pipeline Operations

  • Monitor RAG pipelines and retrieval workflows

  • Troubleshoot embedding generation failures and indexing issues

  • Identify ingestion failures affecting vector databases

  • Validate document connector and data pipeline integrity

  • Diagnose relevance or response degradation caused by configuration issues

  • Escalate data-layer or infrastructure-level issues to DevOps support


AI Governance & Guardrail Monitoring

  • Ensure AI service consumption complies with defined access controls and governance policies

  • Validate rate limiting, usage policies, and guardrail configurations

  • Detect abnormal usage patterns or policy violations

  • Support enforcement of entity-level quotas and access restrictions

  • Escalate governance breaches to appropriate stakeholders


Incident Triage & SLA Management

  • Act as first responder for AI-layer incidents (P0–P3)

  • Perform structured triage using logs, API traces, and monitoring dashboards

  • Classify incidents based on severity and business impact

  • Contain and mitigate AI service disruptions and coordinate with vendors when needed

  • Escalate complex issues to L2/L3 engineering with complete diagnostic context

  • Track incidents through full lifecycle and ensure SLA adherence

  • Participate in Root Cause Analysis (RCA) for major AI service failures


Release Validation & Change Support

  • Perform smoke validation after AI model updates or API releases

  • Monitor regression risks following deployments

  • Identify post-release anomalies and escalate early

  • Support controlled rollout monitoring for new AI capabilities


Enterprise Integration & Connector Support

  • Support integrations with enterprise systems (Microsoft 365, SharePoint, Teams, Oracle, Jira, etc.)

  • Troubleshoot API integration failures, webhook errors, and data exchange issues

  • Validate secure connectivity and authentication configurations

  • Coordinate with DevOps support for infrastructure-related integration failures


Observability & Operational Monitoring

  • Monitor AI API performance metrics (latency, error rates, throughput)

  • Track token usage, consumption trends, and service availability

  • Identify recurring failure patterns and propose preventive actions

  • Maintain visibility dashboards for AI service health


Documentation & Knowledge Management

  • Maintain AI troubleshooting runbooks and support playbooks

  • Update known-issue repositories and FAQs

  • Document recurring AI API and RAG-related issues

  • Capture structured RCA documentation for major incidents

  • Contribute to operational documentation for new AI services

  • Handle ITSM/ticketing


Required Technical Skills

  • Experience supporting REST APIs and API-based platforms

  • Understanding of LLM consumption patterns (RAG, embeddings, inference APIs)

  • Familiarity with authentication mechanisms (OAuth2, API keys, token-based access)

  • Ability to troubleshoot using logs, traces, and monitoring dashboards

  • Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)

  • Basic understanding of cloud environments (Azure preferred)

  • Familiarity with enterprise system integrations

  • Understanding of rate limiting, quotas, and API governance


Experience

  • 3–8 years in:

    • AI platform support

    • API support

    • SaaS support

    • Application operations

  • Experience supporting AI/ML services or developer platforms preferred

  • Exposure to regulated or government environments advantageous

  • Experience working with external vendors and enterprise stakeholders

  • Arabic speaker is a plus

Share

Apply for this position

Required*
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*