Google's PaLM 2 uses nearly five times more text data than its predecessor

Google’s PaLM 2 large language model uses nearly five times as much textual data for training as its predecessor, LLM, CNBC has learned.
In announcing the PaLM 2 last week, Google said the model is smaller than the previous PaLM but uses more efficient “technology.”
The lack of transparency about training data in AI models has become an increasingly hot topic among researchers.

Sundar Pichai, CEO, Alphabet Inc. , during the Google I/O Developers Conference in Mountain View, Calif., on Wednesday, May 10, 2023.

David Paul Morris | bloomberg | Getty Images

CNBC has learned that Google’s new big language model, which the company announced last week, uses nearly five times as much training data as its predecessor from 2022, allowing it to perform more advanced coding, math and creative writing tasks.

PaLM 2, the company’s new public-use large language (LLM) model unveiled at Google I/O, has been trained on 3.6 trillion tokens, according to internal documents seen by CNBC. Tokens, which are strings of words, are an important building block for training LLM, because they teach the model to predict the next word that will appear in a sequence.

Google’s previous version of PaLM, which stands for Pathways Language Model, was released in 2022 and trained on 780 billion tokens.

While Google was eager to show the power of its AI technology and how it could be integrated into search, emails, word processing, and spreadsheets, the company was unwilling to publish the volume or other details of its training data. OpenAI, the innovator of Microsoft-backed ChatGPT, has also kept details of the latest LLM language called GPT-4 secret.

The companies say the reason for the lack of disclosure is the competitive nature of the business. Google and OpenAI are rushing to attract users who might want to search for information using chatbots instead of traditional search engines.

But as the AI arms race rages on, the research community is calling for more transparency.

Since revealing PaLM 2, Google has said the new model is smaller than previous LLMs, which is significant because it means the company’s technology is becoming more efficient while accomplishing more complex tasks. PaLM 2 is trained, according to internal documentation, on 340 billion parameters, which is an indication of the complexity of the model. The initial PaLM is trained on 540 billion parameters.

Google did not immediately provide comment for this story.

Google He said In a blog post about PaLM 2, the model uses a “new technique” called Computational Scale Optimization. This makes the LLM “more efficient with better overall performance, including faster inference, fewer service parameters, and a lower cost of service.”

In announcing PaLM 2, Google confirmed previous CNBC reports that the model is trained in 100 languages and performs a wide range of tasks. It’s already being used to power 25 features and products, including the company’s experimental chatbot Bard. It’s available in four sizes, from smallest to largest: Gecko, Otter, Bison, and Unicorn.

PaLM 2 is more powerful than any existing model, based on public disclosures. LLM on Facebook is called LLaMA, which is announce In February, it was trained on 1.4 trillion tokens. The last time OpenAI shared ChatGPT training volume was with GPT-3, when the company said it had trained 300 billion codes in that time. OpenAI released GPT-4 in March, and said it shows “human-level performance” in several professional tests.

LaMDA, LLM conversation that Google foot Two years ago and promoted in February alongside Bard, it has been trained on 1.5 trillion tokens, according to the latest documents seen by CNBC.

As new AI applications quickly reach the mainstream, so does the debate over the underlying technology.

Mehdi Elmohamady, Senior Research Scientist at Google, He resigned in February About the company’s lack of transparency. On Tuesday, OpenAI CEO Sam Altman testified at a hearing of the Senate Judiciary Subcommittee on Privacy and Technology, and agreed with lawmakers that a new system is needed to deal with AI.

“For a technology that is so new, we need a new framework,” Altmann said. “Certainly companies like ours have a lot of responsibility for the tools we put out into the world.”

— CNBC’s Jordan Novette contributed to this report.

He watches: Sam Altman, CEO of OpenAI, has called for AI stewardship

Avery Kensington

Google’s PaLM 2 uses nearly five times more text data than its predecessor

Google Photos Adds Scheduled Exports for New Photos and Videos

Why AI Still Needs Human Scientists to Drive Discovery

Windows 11 Could Deliver Faster App Launches With New CPU Performance Feature

Google Maps Update Brings Larger Street Labels to Android Auto

Samsung and Google Expand Galaxy XR Capabilities With Major Android XR Update

Android May Expand Its Edge Over iOS With New “Notification Rules” Feature