HumanEval measures whether generated Python functions satisfy given docstrings and pass rigorous unit tests.1 References References arxiv.org ↩