Common Crawl

01 · About

Founded 2007 · Beverly Hills, United States

Common Crawl Foundation is a California-registered 501(c)(3) non-profit organization, founded in 2007 by Gil Elbaz with the mission of democratizing access to web data. The organization builds and maintains an open web crawl data repository available free of charge to anyone. Data is collected by the CCBot crawler (based on Apache Nutch), and the archive is stored on Amazon S3 through the AWS Open Data Sponsorship Program. The dataset comprises more than 250 billion pages and is expanded by several billion new pages each month. By 2024, Common Crawl data had been cited in more than 10,000 scientific papers. The organization has been and continues to be a key provider of training data for leading language models, including GPT-3 and GPT-4 (OpenAI), Gemini (Google DeepMind), LLaMA (Meta), and Anthropic's models. The registered office is located in Beverly Hills, California, USA. For a long time Common Crawl employed a single person; since 2023 the organization has been expanding its team and actively seeking grants. The Executive Director is Rich Skrenta and the Board Chair is Gil Elbaz.

Founders

Gil Elbaz
Founder
American entrepreneur, co-founder of Applied Semantics (acquired by Google); founded Common Crawl in 2007 as an open repository of web crawl data.

02 · What they do

Domain of activity

03 · Geography

Global presence

HQ Beverly Hills

04 · Metrics

Scale & funding

founded 2007

2007founded

fewer than 20 (small non-profit organization)employees

Funded by the Elbaz Family Foundation Trust (primary sponsor for the first ~15 years) and donations from AI companies: OpenAI and Anthropic (each contributing USD 250,000 in 2023) and other entities from the AI industry. Data storage on AWS is covered under the AWS Open Data Sponsorship Program.total funding

maturedevelopment stage

Privatenot listed

activestatus

06 · Relations

Organizational relations

3 strategic relations

Strategic relations

Funded by
Amazon Web Services sponsors the storage of Common Crawl data under the AWS Open Data Sponsorship Program.
Funded by
OpenAIOpenAI made a donation of $250,000 in 2023.
Funded by
AnthropicAnthropic donated $250,000 in 2023.

07 · Classification

Profile & metadata

3 classification · 7 external links

Classification

Statusactive
Development stagemature
Listed onNoprivate

External links