Senior Python Engineer / LLM Evaluation – Part-Time, Remote
We are hiring experienced Python engineers for part-time, task-based work focused on evaluating and testing Large Language Models (LLMs).
This is not traditional QA and not junior AI labeling work.
We are looking for senior engineers who can reason deeply about system behavior, ambiguity, and real-world usage — not just write code.
What You'll Do
• Design structured test cases that simulate real human workflows
• Define gold-standard outputs and expected behaviors
• Analyze LLM failure modes such as hallucinations, bias, and context limitations
• Work directly with Git repositories and existing codebases
• Navigate incomplete documentation and ambiguous requirements
• Apply engineering judgment to determine what 'good' looks like
Who You Are
• 3+ years of software development experience (Python-focused)
• Python is your primary language
• Strong hands-on Git experience in real projects
• Comfortable reading and debugging code you didn't write
• Able to reason about edge cases, trade-offs, and ambiguity
• Strong written and spoken English (B2+)
Nice to Have
• QA or structured testing experience (must be code-capable)
• Experience evaluating AI or LLM systems
• Familiarity with evaluation metrics such as precision, recall, coverage
• Experience working with Docker
• Consulting or freelance engineering background
What We're Looking For
We value engineers who can explain why something fails — not just that it fails.
If you naturally think in terms of scenarios, assertions, failure modes, and user expectations, you'll thrive here.
This role suits senior backend Python engineers, ML engineers who still code regularly, and technically strong evaluators with real production experience.
Fully remote.
Flexible schedule.
Task-based delivery.
If you're interested in applying your engineering judgment to real-world AI system evaluation, we'd love to hear from you.