In the past I've written on the idea that we're all "Expert Language Model Trainers". The basic idea: large language models rely on our blog posts, Wikipedia articles, Reddit votes, arXiv papers, and more, so it's very likely that much of the Internet-using population has contributed in some fashion to the training data underlying large language models. Of course, everyone's "marginal impact" is small (most models would be nearly unchanged if just one person's data disappeared), but many people are undoubtedly contributors.

In this post, I'm going to trace the dataset breadcrumbs for the impressive (and impressively viral) new ChatGPT from OpenAI. The main goal: to highlight specific ways that Internet users have likely contributed. My agenda here is not detract from the mind boggling accomplishments of the OpenAI team, but rather to highlight the immense amount of upstream human activity and labor upon which these new outputs draw, and to describe how we can lean on this dependency as a force for good. Many people may find ChatGPT scary, but I think taking this data-centric and contribution-centric lens suggests a plausible path for global-scale "governance" of ChatGPT and inevitable successors.

A Table of Links

To start, here's a table with all the links I'll discuss in this post:

Sources Description
The ChatGPT Blog Post Details about ChatGPT
OpenAI's Model Index Describes the family of "GPT 3.5" models, to which ChatGPT belongs
InstructGPT blog post Details about building InstructGPT, which ChatGPT follows closely
InstructGPT paper More details than the blog post
InstructGPT Model Card Summary of the training data, and more
CommonCrawl Data A key data source
WebText Another key data source, leveraging reddit votes

Starting from the ChatGPT Blog Post

Now let's walk through the information we have about the data, to make a rough guess about: (1) whether you deserve some credit (admission of editorial bias: I'm inclined to to say "Yes"), (2) which data sources are particularly important, and (3) what this means for the impact of AI systems like GPT 3.5 on society.

We start with the ChatGPT blog post. We learn that the model was trained "using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup." This points us to the InstructGPT blog post, paper, and model card. The InstructGPT blog post provides more info about how OpenAI was able to solicit valuable human feedback, i.e. get people to share specifically how they feel out GPT responses (we'll return to this, but first we'll focus on the text data).

Data Sources in the InstructGPT Model Card

Looking now at the InstructGPT model card, we learn that "InstructGPT models are initialized from GPT-3 models, whose training dataset is composed of text posted to the internet or uploaded to the internet (e.g., books). " Four specific sources are named: CommonCrawl, Webtext (an expanded version), two internet-based book corpora (one which is perhaps BookCorpus), and English-language Wikipedia. We also learn that for InstructGPT, just 40 labelers participated (and some unknown number of customers who provided ecologically valid prompts).

The Common Crawl is, in short, almost everything on the Internet that doesn't explicitly ask to not be scraped (via a robots.txt file). OK, so if you've posted text to the Internet, you may have already contributed. We also learn from the model card that this corpus was "filtered based on similarity to high-quality reference corpora", so whoever contributed high-quality content to said reference corpora is likely playing an amplified role. Unfortunately, we can't be sure yet exactly what this high-quality reference corpora looks like (but might guess it's something like news articles, scientific articles, etc.) This also means if contributions to the Common Crawl were deemed low-quality, those contributions may not have made it into ChatGPT.

If you're curious about what's specifically in the Common Crawl, the operators provide a list of common domains. A quick glance shows quite a few opportunities for user-generated content (i.e., your contributions): blogpost, wordpress, pinterest, stackexchange, github, youtube, etc. Of course, there's a ton of wiki projects, and a lot of ".edu" domains corresponding to university webpages. If anyone is aware of more tools for web users to explore Common Crawl and their contributions to it, please do share and I'll be sure to update this section here.

It's also important to note that Common Crawl has been shown to have quite a bit of "undesirable content" (see this paper from Luccioni and Vivianio). This is certainly true of the other data sources we'll describe as well!

Next, we have the WebText dataset (described here). OpenAI wanted another "filtered" version of the web, so they grabbed the content from every webpage posted to Reddit with 3 or more "karma" points. In other words, Reddit users voted on the training data to be included. This means even if you've never written a single sentence of text on the web, but you have voted (up or down) on Reddit, you contributed to what we might call a "collective intelligence" process for making GPT's data high quality.

Webtext is not available for direct download, but it was replicated as part of an open source effort. It's also worth noting that if you want an open dataset along these lines, one of the best choices may be The Pile from EleutherAI.

The third source mentioned in the InstructGPT model card is the combination of "two internet-based book corpora". This could be BookCorpus, a popular dataset of books from smashbooks.com. It could also be books from Google Books, from Project Gutenberg, or something else. Looking back to the GPT3 paper, we see these datasets described as "books1" (12 billion tokens) and "books2" (55 billion tokens).

Finally, the fourth -- and perhaps least surprising -- example in the model card is English-language Wikipedia. Much of my doctoral research has discussed how Wikipedia is one of the most important training sources for data-dependent computing in general (see some examples here). Another interesting note from the GPT3 paper is that Wikipedia is intentionally "over-weighted": all the content in English Wikipedia was seen 3.4 times during training whereas some of the filtered Common Crawl was skipped entirely.

To summarize so far, if you have: written any high quality text that appeared in the Common Crawl, submitted to Reddit, voted on a link submitted to Reddit, written a book, or edited English Wikipedia, you're almost certainly a contributor to ChatGPT.

Other Info

We have one other surprising claim about ChatGPT's training... a tweet from Elon Musk. Here's the text for posterity: