Odys Podcast: The High Stakes Growth Show
35% of the internet is blocking AI crawlers right now — most of it by accident, through default CDN settings that site owners never touched. Stephen Burns is the Web Intelligence Lead at Common Crawl Foundation, the nonprofit that crawls 2.3 billion pages per month and provides the training data used by GPT, Claude, Llama, and most major LLMs. He covers harmonic centrality - the algorithm that determines which sites get crawled and end up in AI training sets - and why it operates completely differently from Google's PageRank. Sites with JavaScript-heavy builds, slow load times, or CDN defaults that block AI bots may not exist as far as LLMs are concerned. This episode also covers the EU AI Act's August disclosure deadline, which will require AI companies to publish the top 1,000 domains they trained on - giving SEOs a new way to verify AI visibility for the first time. Common Crawl has been cited in over 10,000 academic research papers and its data underpins over 80% of the training tokens in GPT-3. Burns works at the intersection of web-scale data and search infrastructure - this is the part of the pipeline that most SEOs have never had access to before.
30 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de Odys Podcast: The High Stakes Growth Show!