Jon Goodey
The Mysterious Dataset Behind Google's Advanced AI System, Bard: Unravelling the Known and Unknown

Introduction to Google's Bard
Google’s Bard is one of Google's most advanced AI systems, designed to engage in dialogue with people. However, little is known about how the model was trained using datasets based on internet content called Infiniset. Only 12.5% of the data used to train Bard came from a public dataset of crawled content from the web, while another 12.5% came from Wikipedia. Google remains vague about the sources of the remaining scraped data, but there are hints of what sites comprise those datasets.
Bard and the LaMDA Language Model
Bard is based on the LaMDA language model, which was trained on a dataset called Infiniset. Infiniset is a blend of internet content deliberately chosen to enhance the model’s ability to engage in dialogue. The research paper explains the reasoning behind this content composition: "...this composition was chosen to achieve a more robust performance on dialogue tasks...while still keeping its ability to perform other tasks like code generation." The research paper references dialogue and dialogues, which are the spellings of the words used in this context within the realm of computer science.

Composition of the Infiniset Dataset
The dataset comprises the following mix: 12.5% C4-based data, 12.5% English language Wikipedia, 12.5% code documents from programming Q&A websites, tutorials, and others, 6.25% English web documents, 6.25% non-English web documents, and 50% dialogues data from public forums.
It is intriguing to know that only 25% of the data comes from a named source, with the remaining 75% consisting of words scraped from the internet. The research paper does not reveal how the data was obtained from websites, which websites it came from, or any other details about the scraped content. Google only uses generalised descriptions like "Non-English web documents." The term "murky" best describes the 75% of data that Google used for training Bard.

C4 and Wikipedia: Known Data Sources
The first two parts of Infiniset (C4 and Wikipedia) consist of known data. The C4 dataset is a specially filtered version of the Common Crawl, an open-source dataset. Common Crawl is a registered non-profit organisation that crawls the internet every month to create free datasets that anyone can use. The raw Common Crawl data is cleaned up by removing elements like thin content, obscene words, lorem ipsum, navigational menus, deduplication, etc., to limit the dataset to the main content. The purpose of filtering out unnecessary data was to remove gibberish and retain natural English examples.
Anomalies in the C4 Dataset
The second research paper discovered anomalies in the original C4 dataset that removed web pages aligned with Hispanic and African American web pages. Another finding was that 51.3% of the C4 dataset consisted of web pages hosted in the United States. When building a dataset from a web scrape, reporting the domains from which the text is scraped is integral to understanding the dataset; the data collection process can lead to a significantly different distribution of internet domains than expected.
Concerns and the Need for Transparency
Some publishers might feel uneasy about their websites being used to train AI systems, fearing that these systems could render their websites obsolete and ultimately disappear. While the validity of these concerns is yet to be determined, they reflect genuine apprehensions expressed by publishers and members of the search marketing community. Google should increase transparency regarding the websites used to train their AI or, at the very least, publish an easily accessible transparency report about the data utilised.
The Future of AI and Web Content
As AI systems like Google's Bard continue to advance, it is essential to understand their training processes and the sources of the data they use. Greater transparency can help address concerns about the potential impact of AI on the online ecosystem and foster trust between tech companies, publishers, and the public. By shedding light on the origins of datasets like Infiniset and addressing potential biases or anomalies, we can work towards a future where AI systems are more accountable, better understood, and coexist harmoniously with web content creators.