
Common Crawl
ACTIVECommon Crawl Foundation is an American 501(c)(3) non-profit that builds and maintains an open, publicly available web crawl data repository used by researchers and AI companies to train language models.
Common Crawl
Founded 2007 · Beverly Hills, United StatesCommon Crawl Foundation is a California-registered 501(c)(3) non-profit organization, founded in 2007 by Gil Elbaz with the mission of democratizing access to web data. The organization builds and maintains an open web crawl data repository available free of charge to anyone. Data is collected by the CCBot crawler (based on Apache Nutch), and the archive is stored on Amazon S3 through the AWS Open Data Sponsorship Program. The dataset comprises more than 250 billion pages and is expanded by several billion new pages each month. By 2024, Common Crawl data had been cited in more than 10,000 scientific papers. The organization has been and continues to be a key provider of training data for leading language models, including GPT-3 and GPT-4 (OpenAI), Gemini (Google DeepMind), LLaMA (Meta), and Anthropic's models. The registered office is located in Beverly Hills, California, USA. For a long time Common Crawl employed a single person; since 2023 the organization has been expanding its team and actively seeking grants. The Executive Director is Rich Skrenta and the Board Chair is Gil Elbaz.
Founders
American entrepreneur, co-founder of Applied Semantics (acquired by Google); founded Common Crawl in 2007 as an open repository of web crawl data.
Domain of activity
Global presence
HQ Beverly HillsScale & funding
founded 2007Organizational relations
3 strategic relationsProfile & metadata
3 classification · 7 external linksClassification
- Statusactive
- Development stagemature
- Listed onNo