The Battle of the LLMs: Llama 3 vs. GPT-4 vs. Gemini
We’ve come a long way since GPT initially took the world by storm. These days, many organizations routinely incorporate large language models (LLMs) into their daily processes to improve productivity, and there are now around 40 LLMs in use worldwide.
Users can even vote for their favorite bots using this tool.
But in this rapidly evolving landscape, three models from technology heavyweights stand out: Llama 3, GPT-4, and Gemini. Below, we’ll explore the nuances and performance comparisons of these top LLMs.
Understanding LLM Versions
Meta’s Llama 3 launched in late April 2024, a little less than a year after the debut of Llama 2 in July 2023. Meta says the model shows more diversity in its answers, understands instructions better, and writes superior code compared to previous iterations.
Google DeepMind’s Gemini launched in December of 2023, and Gemini 1.5 in February of this year. The company offers four main versions of Gemini: Ultra, Pro, Flash, and Nano. Med-Gemini launched just weeks ago, is designed specifically for healthcare applications. There is also a Gemini Advanced version.
OpenAI’s GPT-3.5 debuted in November of 2022. Since then, the organization has launched GPT 4 in March of 2023, GPT-4 Turbo in December 2023, and is expected to launch GPT 5 in the summer of 2024. GPT-4o (Omni) was launched in early May of 2024.
Here’s how GPT 4, Llama 3, and Gemini stack up against each other.
Benchmark Performance
Here’s how select variants of the three models perform when data scientists measure against various LLM benchmarks, including HellaSWAG, MMLU, MATH, and HumanEval (we’ve highlighted the top score in each benchmark below).
In this evaluation, GPT-4 Omni takes the top spot in four out of six benchmarks, with Llama 3 400B and GPT-4 Turbo taking the others. Neither of the Gemini models in these tests took the top spot in any of the benchmarks.
What is Llama 3?
Meta’s flagship LLM includes model weights of either 8B, 70B, or 400B parameters (the more parameters, the more powerful the model). It’s especially suited for complex tasks such as those involving creativity and problem-solving.
It has also become known for flexing an oddly endearing sense of humor, otherwise creative and nuanced responses, and the ability to generate engaging storytelling and entertainment content.
Llama 3 is especially good at coding (or helping human devs write code) and offers an API to help users build and scale generative AI applications using its model.
While it only offers textual inputs and outputs (unlike GPT-4 and Gemini), Meta has indicated that a multimodal version of Llama 3 is in the works. Llama 3 performs very well in a range of tasks. Meta claims Llama 3 70B outperformed Gemini Pro 1.5 in the MMLU benchmark, indicating a model’s general knowledge level.
What is GPT-4?
Nearly everyone has heard of ChatGPT, the chat functionality built on top of OpenAI’s Generative Pre-trained Transformer (GPT) LLM. But some may not realize that several newer versions of the now-legendary ChatGPT are much more potent than the original.
GPT-4 Turbo, for example, offers significant improvements over GPT-4. These improvements include better performance and accuracy and an expanded knowledge cutoff up to April 2023.
And OpenAI says Omni is 2x faster, is half the price, and has 5x higher rate limits than Turbo, along with a knowledge cutoff of October 2023. And don’t forget, GPT-4 Omni is our champion from the benchmark performance comparison above.
In general, however, GPT-4 is known for its strong natural language understanding capabilities, including its ability to discern context and appreciate nuance in conversations. Its inputs are primarily text-based but it can also leverage image inputs by upgrading to GPT-4 with vision (GPT-4V). It provides text-only outputs.
That doesn’t mean it’s perfect, however. Data scientist Austin Zaccor of Databricks says GPT-4 Turbo “almost never” gives a straightforward answer, for example.
“It will say something NPR-esque like ‘while most scientists believe knives are made of steel, some have argued that wet napkin based knives could make a sustainable alternative as they would require less mining and metal refining,’” he says. “That’s not a real example but it illustrates my grievance.”
But he adds that GPT-4 is the only model service that allows users to customize some of its behavior tailored to each user, which is a pretty handy feature.
What is Gemini?
Most users agree that one of Gemini’s main bonuses is its willingness to use multiple data sources–such as Google search — when considering responses. That’s an improvement over GPT-4, which tends to default to just its training data unless specifically asked to search the web (and even then, GPT-4 sometimes refuses to do this).
Previously known as Bard AI, Gemini also features several tools to help enhance its response quality, including the ability of users to give feedback to improve its responses over time. At the same time, however, Gemini has been accused of refusing to answer queries and being somewhat dishonest about why.
It’s also easy for users to better tailor Gemini’s responses to make them shorter or longer, more or less detailed, casual or more professional, and offers avenues for users to fact-check its responses against the web.
In terms of its multimodality Gemini is the clear winner here, offering text, image, and audio inputs along with text outputs.
Conclusion
While GPT-4, Llama 3, and Gemini are each powerful LLMs that provide significant value, it’s impossible to declare a clear-cut No.1 because each includes different strengths, weaknesses, and features.
No matter which model you choose to experiment with, however, you can depend on the AI and data science experts at CapeStart to help you ideate, develop, and deploy your next LLM-based application. Contact us today to set up a one-on-one discovery call and let us help you scale your next innovative project.