Senior AI Engineer

Israel · Full-time

About The Position

ScaleOps, the leader in real-time automated cloud resource management, is revolutionizing how DevOps teams manage their cloud-native application infrastructures. Backed by venture capital and software industry titans, ScaleOps’ platform removes the organizational friction between application owners and DevOps teams by fully automating the resource management process to meet real-time demand.

The ScaleOps platform dynamically manages the application’s resource allocation, eliminating the need for manual intervention. The result is improved application performance, 60%-80% cloud cost savings, and a fully automated allocation process.

With well over $80 million in backing, ScaleOps has seen tremendous business growth, attracting global industry leaders to its customer base. ScaleOps automatically manages the production environments of over 50 enterprises, including Wiz, CATO Networks, Outreach, SentinelOne, Maxar, Playtika, Orca Security, EQ Bank, Outbrain, PayU, and Noname.

About the Role

We are forming a new AI Group and are looking for a Senior AI Engineer to play a central role in shaping it from an early stage. This is a greenfield opportunity with a huge roadmap ahead: we are at the beginning of a long-term, strategic effort to evolve how decisions are made across the ScaleOps platform using AI-driven systems.

You won’t just be integrating APIs or building demos-you’ll be building the AI brain that works alongside (and increasingly drives) our core automation engine. This role offers a true “startup within a startup” environment, with real production impact from day one, high strategic importance, and clear room to grow technically and into technical leadership as the AI Group scales.


What You’ll Build & Own

  • Agentic AI Architecture: Design and build autonomous AI agents that analyze infrastructure in real-time and make intelligent decisions. Work with modern agentic frameworks (LangGraph, PydanticAI) and conversational AI to create multi-agent systems - including troubleshooting agents, optimization agents, FinOps agents, how-to agents and more. Leverage core LLM capabilities - tool-use, memory, and retrieval - to operate safely in production environments.
  • Platform Integration & Intelligent Decision Systems: Develop MCPs to expose ScaleOps capabilities to AI agents that reason over infrastructure environments, metrics, configurations, and cost signals. Build systems that integrate with tools like Slack, Jira, and AI-powered IDEs (Cursor, Windsurf) to deliver context-aware insights, from "why is this pod not scheduling?" to "how can we reduce costs by 30% safely?"
  • AI Model Development & MLOps: Build and deploy machine learning models that learn from infrastructure patterns - automatically detecting the right resource policies for workloads, predicting optimal scaling triggers, and recommending GPU configurations. Own the complete ML pipeline from training to production deployment, ensuring models are reliable, monitored, and continuously improving.
  • R&D AI Tools Development & Adoption  - Embed & Build internal AI tools to accelerate engineering, development lifecycle, research, support, with AI.
  • AI Tools for Business Impact: Develop AI-powered tools that help Sales and Support teams demonstrate value instantly. Build agents that analyze customer infrastructure, generate cost optimization reports automatically, and provide intelligent recommendations that turn technical data into clear business outcomes.
  • End-to-End Ownership: Own AI systems from concept to production - ensuring they're fast (sub-2-second responses), reliable, safe, and cost-effective. Build evaluation frameworks to measure quality, implement security controls, and balance performance tradeoffs in real-world production environments.
  • Technical Leadership: Define the AI architecture and best practices as a founding member of the AI team. Make key technical decisions - choosing frameworks, designing multi-agent systems, establishing data governance - and shape how ScaleOps evolves from AI-enhanced internal tools to customer-facing AI products.

Requirements


  • Core Engineering: Significant software engineering experience (typically 4+ years) with strong Python skills and solid backend engineering fundamentals.
  • Production Experience: Experience building and operating production systems in cloud environments
  • Real-World GenAI Experience: Practical experience bringing LLM-based systems into production, including handling complex challenges such as latency, cost control, and failure modes. Familiarity with additional agentic frameworks (e.g., LangChain, MetaGPT) and evaluation frameworks.
  • Builder Mentality: Strong ownership and the ability to operate independently while collaborating closely across teams. You have the motivation to grow into technical leadership as the group expands.
  • Data & RAG (Advantage): Experience enabling LLMs to consume structured or operational data (configurations, logs, metrics) and experience with retrieval systems (RAG) or vector databases - Advantage.


Apply for this position

Schedule your demo

Schedule your demo

Meet ScaleOps at Booth #900

Start Optimizing K8s Resources in Minutes!

Schedule your demo

Submit the form and schedule your 1:1 demo with a ScaleOps platform expert.

Schedule your demo