Robots AtlasRobots Atlas

Common Crawl

ACTIVE
CommonCrawlCCFCommon Crawl Foundation

Common Crawl Foundation is an American 501(c)(3) non-profit that builds and maintains an open, publicly available web crawl data repository used by researchers and AI companies to train language models.

Founded2007United States
HQBeverly HillsUnited States
Total fundingFunded by the Elbaz Family Foundation Trust (primary sponsor for the first ~15 years) and donations from AI companies: OpenAI and Anthropic (each contributing USD 250,000 in 2023) and other entities from the AI industry. Data storage on AWS is covered under the AWS Open Data Sponsorship Program.
01 · About

Common Crawl

Founded 2007 · Beverly Hills, United States

Common Crawl Foundation is a California-registered 501(c)(3) non-profit organization, founded in 2007 by Gil Elbaz with the mission of democratizing access to web data. The organization builds and maintains an open web crawl data repository available free of charge to anyone. Data is collected by the CCBot crawler (based on Apache Nutch), and the archive is stored on Amazon S3 through the AWS Open Data Sponsorship Program. The dataset comprises more than 250 billion pages and is expanded by several billion new pages each month. By 2024, Common Crawl data had been cited in more than 10,000 scientific papers. The organization has been and continues to be a key provider of training data for leading language models, including GPT-3 and GPT-4 (OpenAI), Gemini (Google DeepMind), LLaMA (Meta), and Anthropic's models. The registered office is located in Beverly Hills, California, USA. For a long time Common Crawl employed a single person; since 2023 the organization has been expanding its team and actively seeking grants. The Executive Director is Rich Skrenta and the Board Chair is Gil Elbaz.

  • Gil Elbaz

    Founder

    American entrepreneur, co-founder of Applied Semantics (acquired by Google); founded Common Crawl in 2007 as an open repository of web crawl data.

02 · What they do

Domain of activity

03 · Geography

Global presence

HQ Beverly Hills
04 · Metrics

Scale & funding

founded 2007
2007founded
fewer than 20 (small non-profit organization)employees
Funded by the Elbaz Family Foundation Trust (primary sponsor for the first ~15 years) and donations from AI companies: OpenAI and Anthropic (each contributing USD 250,000 in 2023) and other entities from the AI industry. Data storage on AWS is covered under the AWS Open Data Sponsorship Program.total funding
maturedevelopment stage
Privatenot listed
activestatus
06 · Relations

Organizational relations

3 strategic relations

Strategic relations

3
  • Funded by
    Amazon Web Services sponsors the storage of Common Crawl data under the AWS Open Data Sponsorship Program.
  • Funded by
    OpenAIOpenAI made a donation of $250,000 in 2023.
  • Funded by
    AnthropicAnthropic donated $250,000 in 2023.
07 · Classification

Profile & metadata

3 classification · 7 external links

Classification

  • Statusactive
  • Development stagemature
  • Listed onNoprivate