Agentic Code-Generation Loop Research Intern

€2K - €3K EUR / monthly • Paris, IDF, FR / Paris, Île-de-France, FR Job type Internship Role Engineering, Machine learning School year Junior and above Visa US citizenship/visa not required Skills Git, TypeScript, Deep Learning, Natural Language Processing, Prompt Engineering

About the role

The intern will design, evaluate and bring to the state of the art the internal Windmill agentic loop for generating scripts, flows and full-stack apps - and build the benchmarking system that measures its progress. The work tackles several open questions: how to objectively evaluate a generated workflow or app beyond "it compiles" (functional tests, end-to-end execution, UX quality, semantic correctness); how an agent should decompose a natural-language specification into coherent atomic steps; how to efficiently inject Windmill-specific context (hub, types, resource schemas) without saturating the context window; how to exploit execution feedback for self-correction; how to keep a dependency graph of scripts, flows and apps coherent across iterative multi-file edits; and how to detect hallucinations, silent regressions and "fake successes" where tests pass for the wrong reasons. Expected deliverables:

the Windmill benchmark (corpus, harness, tracking dashboard)
an improved agentic loop shipped to production with documented progression metrics
a weekly lab notebook
the final thesis report
possibly a publication or open-source release

The intern works directly with Ruben Fiszel (co-founder & CEO) and the Windmill R&D / AI team, with daily interaction, weekly reviews and full access to the codebase, to anonymized usage data, to frontier-model API budgets and to GPU infrastructure for fine-tuning experiments.

State of the art

Code-generation agents:

Inline assistants: Copilot, Cursor, Codeium - local completion and editing, short context
Autonomous agents: Claude Code, Aider, SWE-agent, OpenHands, Devin - planning, execution, self-correction
RL / fine-tuning approaches: AgentCoder, Reflexion, Self-Refine, agent tuning on execution traces
Retrieval methods: RAG over documentation, code embeddings, graph-RAG

Reference benchmarks:

SWE-bench / SWE-bench Verified - resolving GitHub issues (Python); now saturated on frontier models
HumanEval, MBPP, APPS, BigCodeBench - generation of isolated functions
LiveCodeBench - anti-contamination, temporally controlled tasks
WebArena, AppWorld - agents on simulated environments
TAU-bench, AgentBench - agent evaluation with tool use

Limitations of these benchmarks for our use case: none covers workflow generation (step composition, branching, parallelism,state management); none tests generation of full-stack apps with interactive UI; none integrates the specifics of Windmill (type system,resouces:variables hub,multi-language runtime). Scientific and technical locks: 1. Evaluation: how to objectively measure quality of generated workflow or app beyond mere "it compiles / it passes a unit test"?

Decomposition: how should an agent break natural-language specification into coherent atomic scripts/steps?

3. Contextualization: how to efficiently feed agent with Windmill context without exploding context window?

Iteration loop: how to optimally exploit execution feedback for self-correction?

5. Multi-file editing: coherent management of dependency graph between scripts flows apps during iterative editing.

Robustness: detection of hallucinations silent regressions "fake successes".

Work plan (5–6 months)

Phase 1 - Mapping & state of art (weeks1–3): Audit of Windmill's current agentic loop(architecture,prompts tool-use); systematic review existing literature benchmarks; selection/reproduction of2–3 reference baselines. Phase2-Benchmark(weeks3–8): design evaluation task corpus(isolated scripts,multi-step flows full-stack apps); design evaluation harness(sandboxed execution,multi-criteria scoring); set up continuous regression tracking; open-source release benchmark envisioned. Phase3-Improvment agentic loop(weeks8–20): iteraive experimentation prompts planning strategies tool design retrieval execution feedback; comparison frontier models vs open weights; targeted exploration supervised fine-tuning RL approaches; progressive production deployment. Phase4-Consolidation deliverables(weeks20–24): writting thesis/final-year report internal technical documentation possible paper submission.

Who we're looking for

M2/final-year student computer science applied mathematics solid programming foundations(Python TypeScript bonus Rust) strong interest LLMs/agents/evaluation methodology empirical rigorous approach. Required skills: pf Python TypeScript concrete understanding LLMs(tokenization context window prompting tool use function calling) hands-on experience at least one agentic assistant design controlled experiments reproducible metrics git testing code review CI fluent English. nice-to-have Rust Svelte/modern frontend fine-tuning RL experience(SFT DPO RLHF RLAIF) agent/benchmark evaluation experience prior publication significant open-source contribution Docker PostgreSQL sandboxing observability. education Master’s student(M2) final-year student(PFE) computer science applied mathematics MPRI École Polytechnique(X) École Normale Supérieure(ENS)(Ulm/Paris-Saclay/Lyon) Télécom Paris CentraleSupélec Mines ENSIMAG EPITA42 EPFL equivalent.

About interview

1.Apply here or email jobs@windmill.dev 2.30 min interview founder 3.1h case study team member 4.hired. six month internship permanent contract(CDI) offered upon successful completion.

About Windmill

Windmill open-source developer platform(16k+ GitHub stars Paris-based) turns scripts workflows internal tools full-stack apps sweet spot Retool Temporal compete both plus Airflow n8nw winning DX performance. how it works write scripts Python TypeScript Go Bash Rust SQL etc Windmill analyzes parameters auto-generates UIs turning each script standalone app reusable no-code module chain scripts flows(branching parallelism retries error handling) build powerful workflows trigger anything cron webhook auto-generated UI custom dashboards built app builder under hood all-in-one queue/worker runtime script editor flow builder secret manager OAuth platform enterprise-grade permissions groups audit logs. the insight no-code tools intuitive not extensible writing code only10% work then deal credentials CI/CD permissions UIs error handling Windmill handles developers focus logic matters. stak Rust TypeScript+Svelte PostgreSQL easy deploy works out-of-box replaces entire categories internal infra. founded Ruben Fiszel(ex-Palantir) saw need enterprise workflow infrastructure realized had open-source developer-focused built best-in-class performance DX not another Spark wrapper. t300+ enterprise customers3M ARR profitable Tier1 investors(YC Google Bessemer) team10 engineers customers run between2and600 workers1to+1k seats. r revenue model core platform fully open-source thousands individuals companies using free ~5% features enterprise-scoped SSO audit logs dedicated workers etc live private repo companies pay access along premium support all revenue generated zero outbound sales modest marketing purely inbound product-led. salary info conversion estimate: montly salary €2K - €3K EUR ≈ $2.15K - $3.23K USD/month approx (€1 ≈ $1.075 USD) anual estimate ≈ $25.8K - $38.8K USD/year summary: job title Agentic Code-Generation Loop Research Intern location Paris IDF FR Paris Île-de-France FR employment type Internship role Engineering Machine learning school year Junior above visa info US citizenship/visa not required skills Git TypeScript Deep Learning Natural Language Processing Prompt Engineering salary €2K - €3K EUR monthly (~$25.8K-$38.8K USD annually) deliverables Benchmark corpus/harness/dashboard improved production agentic loop metrics weekly lab notebook thesis report possible publication/open source release company info Open-source developer platform turning scripts workflows/apps founded Ruben Fiszel ex-Palantir profitable tier1 investors YC Google Bessemer stack Rust/TypeScript+Svelte/PostgreSQL