You can now run prompt evaluations from the command line using the new gh models eval command. This evaluates prompts defined in a .prompt.yml file using the same built-in evaluators available in the GitHub Models UI, including string match, similarity to expected outputs, custom LLM-as-a-judge evaluators, and more.

This makes it easier to test model quality early and often, right from your terminal or CI workflow.

gh models eval my_prompt.prompt.yml

You’ll get a summary of test results for each case, including model output and evaluation scores.

For programmatic use, you can output results in JSON format:

gh models eval my_prompt.prompt.yml --json

The JSON output includes detailed test results, evaluation scores, and summary statistics that can be processed by other tools or CI/CD pipelines.

This new release also improves compatibility with the existing GitHub actions integration for models, making automated evaluations simpler to run as part of your actions workflow. For example, you can run evaluations automatically in actions whenever your .prompt.yml file changes:

Start building AI apps with GitHub Models today

GitHub Models and all our AI development tooling are available now to all GitHub users in public preview. This includes prompt editing and lightweight evaluations. Try our tools out by enabling them in your repository or organization, or learn more in our documentation.

Help us shape what’s next

The Models CLI is open source on GitHub. Check out the code, file issues, or contribute!

We’re just getting started, and your feedback helps guide our roadmap. Join the community discussion to share your thoughts and connect with other developers building the future of AI on GitHub.