PaLM 2 vs GPT-4 | Is Google Lagging Behind OpenAI?

Google says AI a billion times in their I/O stream.

Google releases PaLM 2 as the answer to GPT-4.

The research paper, at first glance, seems to confirm that PaLM 2 is incredible, but some things don’t add up…

Sam Altman cryptically tweets

“We are so back”

Vicious.

To understand what is actually happening, let’s look at Google’s 92 page paper on Palm 2…

BETTER THAN GPT-4?

Some results in this paper seem to suggest that PaLM 2 is at least as good as GPT-4 in a lot of the tests that it performed.

For example on reasoning tasks, here is a quote from the Google paper:

“The ability of large models to reason, to combine multiple pieces of information, and to make logical inferences is one

of their most important capabilities.”

Google then shows that their model is similar and in some cases better than GPT-4.

Comparing the two papers, it is a bit hard to do an apples to apples comparison.

I also need to study up on some specifics about how these tests are done. Please comment if you know some of this stuff in detail.

CODING

So for coding for example.

Before PaLM 2 came out, here were the existing scores on HumanEval, a python coding tasks test.

All the score are percentage points,  so 100 would be a perfect score. 

PaLM 1 reached a score of 26.2

GPT 3.5 got a score of 48.1

GPT 4 got a 67

———

All these were marked as 0-shot meaning that no examples were given, the model had to answer the question without being shown examples of similar problems. It had to do it from its own existing skill set.

Here are the HumanEval results for PaLM 2 Google’s latest model.

Notice it splits it into pass@1 and pass@ whatever, in this case pass@100.

So for example for the Python coding tasks it comes in at a staggering 88.4

But pass@100 means that it gets 100 tries to get it right.

Basically when asked a question, it produces 100 possible responses and if ONE of those is correct, then it gets marked as correct.

If it has to get the answer right on the first try, that number drops to 37.6.

Also this isn’t the basic PaLM-2, this model has been outfitted with additional code-related tokens.

https://paperswithcode.com/sota/code-generation-on-humaneval

Here is a ranking of these LLMs from the website paperswithcode.com

GPT-4, zero-shot, again, meaning no examples given, it had to figure it out with just it’s existing knowledge. Gets a 67, which puts it at number 1.

PaLM 2 comes in at #7. With a score of 37.3

But it’s not even the basic model, it’s the -S which is outfitted with various coding tools.

And it’s not zero-shot, it’s few-shot, meaning it was given numerous examples of how to solve similar problems.

The only way it could beat GPT-4 is by allowing it to try answering 100 times and then seeing if ONE of those is correct.

The CODE-T model, which is still an OpenAI model from 2022, gets a similar score with just 1/10th the tries.

This doesn’t make any sense.

Please let me know if I’m being extra dense and not getting something here, but this seems to say that Google, with all it’s resources and brain power and capital and its massive head start…

Keep in mind that Google was the one to publish the paper “Attention is All you need”, which was the breakthrough for a lot of AI progress since 2017.

Google releasing that paper was an important part of the massive AI progress we are seeing today. 

But now.. 

Google can’t make anything that even remotely comes close to ChatGPT’s ability to code right out of the box?

Is this real?

Keep in mind too that apparently many Google AI researches walked out because Google trained it’s AI on GPT-4 as a way to catch up

___

Here’s the main point of this paper I think, at least the point that Google wants to make it seems, here they show that PaLM-2 is beating GPT-4 on several metrics.

But there seems to be skullduggery afoot. Meaning, I’m not sure this is an apples to apples comparison.

Notice that PaLM 2 is using some things to improve their results “Instruction-tuned variant” “chain-of-thought” and “self-consistency”

Self consistency means they generate multiple responses and then see which answer seems to be the most consistent across responses.

Chain of thought is basically asking the model to think through it’s responses step by step. This has been shown in studies to produce better results, often.

In the GPT-4 paper, OpenAI mentions using chain-of-though for one of the tests, and it’s results are much better than PaLM 2 or Flan-PaLM 2.

The “flan” means that this version has been trained on more data that would help it do better in this exam

(table 7)

So I think that the best comparison is the two highlighted elements.

On Math Gpt-4 scored 42.5 and Palm 2 score 34.3

On GSM8K GPT4 scored 92 and PaLM 2 scored 80.7

The results where Palm 2 seems to do better, I don’t think are accurate comparisons.

As far as I can tell, the numbers that most closely reflect the same test being run are the highlighted numbers.

MISGENDERING

One big thing that stood out to me in the paper was how much of the paper was devoted to Responsible AI. 

And responsible AI as defined by Google is something that is different from what other AI leaders focus on.

For example, others in the space are concerned about AI risk to humanity, the dangers of AI for warfare, how AI and automation will replace human workers and how to create some sort of an economic system to support people whose jobs are permanently replaced by AI.

This is what Elon Musk, Sam Altman, Satya Nadella, Microsoft, Bill Gates, Max Tegmark, Eliezer Yudkowsky etc These are the problems that they are discussing. Each one has their own take, but in general, those are the things they are concerned about.

For example the “Godfather of AI” Dr Geoffrey Hinton, who has recently quit his job at Google, is warning about the risks of AI.

In this quote from New York Times

“But gnawing at many industry insiders is a fear that they are releasing something dangerous into the wild. Generative A.I. can already be a tool for misinformation. Soon, it could be a risk to jobs. Somewhere down the line, tech’s biggest worriers say, it could be a risk to humanity.

“It is hard to see how you can prevent the bad actors from using it for bad things,” Dr. Hinton said.

“”

So we’ve heard many concerns about the dangers of AI.

But in this Google paper, I wasn’t able to find references to any of these concerns, however a very large portion of this paper dives deep into the harm of misgendering people.

Around 25 pages from this 92 page document are talking about misgendering and using wrong pronouns.

Developers using this AI to build products are cautioned that while Google did a lot of work on preventing toxic outputs,

The developers have to further finetune the outputs to make sure that no “unsafe” language is generated.

Looking at how many times each word gets used in the document, “gender” is in the top 20 most commonly used words, right after the word “toxicity”.

Even other sections on stereotypes and bias seem to take a backseat to specifically talking about gender pronouns.

Again, I’m not making any statement about this… I don’t want this to become an argument, I know this is a hot button topic for a lot of people. 

I’m simply pointing out that this paper seems to have a very large amount of text, charts and references that talk about gender, non-binary people and pronouns.

For context, here is a very similar paper from OpenAI about GPT-4, this is the model that PaLM 2 is competing with.

OpenAI’s paper also has multiple pages that talk about examples of how they are taking steps on improving the safety of GPT-4.

https://arxiv.org/pdf/2303.08774.pdf

(page 12)

But the examples of unsafe outputs are things like GPT-4 explaining how to synthesize dangerous chemicals at home, using relatively simple ingredients and basic kitchen supplies.

It explains how to create a bomb and also how to get cheaper products that are age-restricted.

OpenAI then creates responses that politely decline to answer.

OpenAI concluded with:

“We are collaborating with external researchers to improve how we understand and

assess potential impacts, as well as to build evaluations for dangerous capabilities that may emerge in

future systems. We will soon publish recommendations on steps society can take to prepare for AI’s

effects and initial ideas for projecting AI’s possible economic impacts.”

So OpenAI seems to define safety as the potential for it’s models to cause physical harm, for people to be able to hurt themselves and others as well as the economic impacts of the technology.

This along with AI alignment, seems to be definition of “AI safety” for most of the players in the space.

Google’s big focus on safety seems to be centered around gender, pronouns as well as, to a slightly lesser degree, identity groups in general.

Now, I’ll be honest, I don’t know too much about this and I’m not qualified to comment on this in one way or another.

I’ve been really impressed with the knowledge that viewers of this channel have on a wide variety of issues.

If you can shed some light on this issue, I would love to hear your take on this. 

Please leave a comment, obviously be civil even if you have strong feelings on this matter.

This isn’t meant to start an argument.

But simply for my own edification and I’m sure there are many others watching this who may be interested to know more.

What is driving this?

Why does Google, and no other AI companies that I am aware of, place this as the top safety concern for responsible AI use.

Please comment. Please be cool.

Ok.. back to AI..

What does all this mean?

So, from where i”m sitting, it looks like after a year plus of work. (keep in mind that GPT4 was available a long time before it was released to the public).

But after a long time, after the Google CEO declaring code red because of ChatGPT, after a massive focus on AI by this company that was supposed to be the #1 player in AI…

The big product of all of that work and investment and focus… that’s PaLM 2.

And PaLM 2 seems to *not* be as good as GPT-4.

Google Bard got updated to PaLM 2 a day or two ago it seems, I’ve tested it’s answers and they ARE better.

It’s better than before. The paper states that Palm 2 is a massive improvement of Palm 1, and yes, that seems true.

But both in the tests that are published, plus my own brief testing with the updated Bard, it seems like they are still far, far away from GPT-4.

To try to make everything seem better than it is, I think they really focused on being multilingual.

Having multiple sizes of the models to fit different tasks and being able to translate better between languages and being strong across all the languages they are trained on.

Which are all great things and they are ahead of OpenAI on those things.

But as far it’s ability to code, reason and write, it seems that GPT-4 is better right out of the box.

Palm 2 can be outfitted with various tools and add-ons to help it improve, some tests seem to be picked to show the best possible outcome for Palm 2 instead of trying to show a fair side by side comparison.

That’s how I read Sam Altman’s tweet “we are so back”

I think that’s his subtle dig at Google.

After watching interviews with Sam Altman, he seems to deliver his killing blows in quiet tones.

Google’s stock is up about 10% from their I/O presentation.

I’m not one to try to predict stock market moves, but I can’t help but wonder if Google’s price will take a hit once this information is fully digested?

Let me know what you think, subscribe for more spicy AI content.

My name is Wes Roth, thank you for watching.