Eval Quadratic Python Code Tutorial

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

👋 Welcome to RefineBench — a comprehensive evaluation library for testing refinement capabilities of language models across multiple settings and domains. To reproduce the full results reported in ...

GitHub

CATArena: Engineering-Level Tournament Evaluation Platform for LLM-Driven Code Agents

CATArena (Code Agent Tournament Arena) is an open-ended environment where LLMs write executable code agents to battle each other and then learn from each other. CATArena is an engineering-level ...

IEEE

Evaluating Python Static Code Analysis Tools Using FAIR Principles

Abstract: The quality of modern software relies heavily on the effective use of static code analysis tools. To improve their usefulness, these tools should be evaluated using a framework that ...

IEEE

Model-Agnostic Empirical Evaluation of Test-Driven Prompt Engineering on Improving Accuracy and Efficiency in Large Language Models Python Code Generation

Abstract: Although Large Language Models (LLMs) are widely adopted for code generation, the generated code can be semantically incorrect, requiring iterations of evaluation and refinement. Test-driven ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results