AI Infrastructure & Experience Engineer
FocusKPI is seeking an AI Infrastructure & Experience Engineer to join one of our clients, a high-tech SaaS company.
Work Location: Mountain View, CA (Onsite role, 5 days/week onsite)
Duration: 4-month contract
Pay Range: $70 - 79/hr
**No C2C resumes are considered**
Position Responsibilities:
- Inference Optimization: Deploy and tune multiple LLMs and generative multimodal models on local inference hardware. Optimize performance metrics (TTFT, tokens/sec) via model quantization, caching strategies, and architecture-specific adjustments.
- Systems Engineering & CUDA: Leverage deep knowledge of the CUDA environment to build custom kernels, ensuring maximum utilization of the low-cost GPU compute.
- Orchestration & Integration: Seamlessly bridge inference backends with orchestration layers (LiteLLM, Ollama, etc.) and frontends like OpenWebUI.
- Rapid Prototyping: Build functional, high-fidelity demos showcasing model memory capabilities, agentic workflows, and context-aware web search.
- Peripheral Connectivity: Implement communication protocols to bridge local AI compute with peripheral devices, including smart TVs, household appliances, and XR hardware.
- Recent experience in model optimization is required
- Hardware & Compute: Proven experience with NVIDIA ecosystems and ARM64 architecture.
- Systems Programming: Advanced proficiency in C++, Python, and Rust. Deep familiarity with CUDA and the ability to author/debug custom CUDA kernels for compute-intensive tasks.
- AI/ML Frameworks: Extensive experience with modern inference engines (llama.cpp, TensorRT-LLM, Ollama) and orchestration frameworks (LiteLLM).
- Software Engineering: Robust understanding of asynchronous programming (FastAPI), containerization (Docker/Kubernetes), sandbox environments, and API design for low-latency communication.
- Full-Stack Prototyping: Ability to quickly spin up modern frontend UIs (React, Next.js, or similar) to present AI-driven intelligence to end users.
- Communication Protocols: Familiarity with WebSockets, gRPC, and REST for device-to-device communication in a local network environment.
- Overall Mandatory skills required: Model optimization recent exparience, Interference Optimization, NVIDIA ecosystems, Custom CUDA Kernel Development, ARM64 architecture, Python
- A minimum of 3 years of relevant industry experience is required
- The "Builder" Mindset: You are energized by the prospect of building proofs-of-concept in days rather than months. You thrive in environments where speed and creativity are paramount.
- Problem Solver: You approach unsolved, messy engineering challenges with enthusiasm rather than trepidation.
- Architectural Vision: You see the "big picture" of how AI becomes part of consumers' daily lives, not just how the model generates text.
- Agile & Adaptable: You are comfortable working in a fast-paced environment where priorities shift based on the results of rapid experimentation.
- Degree in Computer Science, Machine Learning, or Artificial Intelligence Specialization preferred, but not required
**No C2C resumes are considered**
Thank you!
FocusKPI Hiring Team
Founded in 2010, FocusKPI, Inc. (FocusKPI) is a data science and technology firm specializing in predictive analytics practice and methodologies. FocusKPI is a US company headquartered in Silicon Valley, California, with an East Coast office in Boston, Massachusetts.
NOTICE: Please be aware of fraudulent emails regarding job postings, job offers and fake checks. FocusKPI's recruiting team will strictly reach out via @focuskpi.com email domain. If you have received fraudulent emails now or in the past, please report it to https://reportfraud.ftc.gov/ .
The domain @focuskpijobs.com is fraudulent and not related to FocusKPI. Please do not not reply or communicate to anyone with @focuskpijobs.com.