Llama 2 vs Mistral (vs. GPT-3.5, OpenHermes 2.5, and Zephyr): comparing fine-tuning performance

This talk compares fine-tuning performance of Llama 2, Mistral, OpenHermes 2.5, Zephyr, and GPT-3.5 through detailed benchmark evaluations.

Overview

We’ve run a bunch of benchmarks of fine-tuned open-source models and compare them head to head. These include the strongest base models (Llama 2 and Mistral) as well as several open-source instruct-tuned variants that have made a splash recently like OpenHermes 2.5 and Zephyr. We also have evaluated against fine-tuning OpenAI GPT-3.5 itself, and gotten some surprising results!

Links

https://github.com/openpipe/openpipe
OpenPipe fine-tunes LLMs from expensive prompts for cheaper, OpenAI-compatible inference.

Tech stack