diff --git a/markdown/LLMs, LRMs, and Scraping Are Killing the Internet and Must Stay in the Lab.md b/markdown/LLMs, LRMs, and Scraping Are Killing the Internet and Must Stay in the Lab.md index cd26556..1279e9e 100644 --- a/markdown/LLMs, LRMs, and Scraping Are Killing the Internet and Must Stay in the Lab.md +++ b/markdown/LLMs, LRMs, and Scraping Are Killing the Internet and Must Stay in the Lab.md @@ -14,7 +14,7 @@ The AI landscape in 2025 is a dystopian fever dream, a chaotic gold rush where t The transformer architecture, a statistical trick for predicting text, has been inflated into a godlike entity, worshipped with fanatical zeal while ignoring the wreckage it leaves behind. This obsession with scale is collective madness. Models are trained on datasets so colossal - trillions of tokens scraped from the internet’s cesspool, books, and corporate sludge - that even their creators can’t untangle the mess. -Every company, every startup, every wannabe AI guru is unleashing armies of scrapers to plunder the web, hammering servers and destabilizing the digital ecosystem. A March 2025 McKinsey survey reveals that over 80% of organizations deploying generative AI see no tangible enterprise-level impact, suggesting this rush is driven by hype, not results [McKinsey, 2025](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai). High-profile failures - like Samsung banning ChatGPT after code leaks, Google’s Bard hallucinating, Zillow’s AI pricing flop costing millions, and IBM Watson Health’s erroneous cancer recommendations - underscore the chaos [Lakera, 2024](https://www.lakera.ai/blog/risks-of-ai). We’re not building progress; we’re orchestrating a digital apocalypse. +Every company, every startup, every wannabe AI guru is unleashing armies of scrapers to plunder the web, hammering servers and destabilizing the digital ecosystem. High-profile failures - like Samsung banning ChatGPT after code leaks, Google’s Bard hallucinating, Zillow’s AI pricing flop costing millions, and IBM Watson Health’s erroneous cancer recommendations - underscore the chaos [Lakera, 2024](https://www.lakera.ai/blog/risks-of-ai). We’re not building progress; we’re orchestrating a digital apocalypse. ## Apple’s *Illusion of Thinking*: A Flamethrower to AI’s Lies @@ -26,7 +26,7 @@ These models are erratic, nailing one puzzle only to choke on a near-identical o The “thinking processes” LRMs boast are a marketing stunt, revealed by Apple as a chaotic mess of incoherent leaps, dead ends, and half-baked ideas - not thought, but algorithmic vomit. LRMs fail to use explicit algorithms, even when essential, faking it with statistical sleight-of-hand that collapses under scrutiny. This brittleness isn’t theoretical: IBM Watson Health’s cancer AI made erroneous treatment recommendations, risking malpractice, and Google’s Bard hallucinated inaccurate information [Lakera, 2024](https://www.lakera.ai/blog/risks-of-ai). -A January 2025 McKinsey report notes that 50% of employees worry about AI inaccuracy, 51% fear cybersecurity risks, and many cite data leaks, aligning with Apple’s findings of unreliable outputs [McKinsey, 2025](https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/superagency-in-the-workplace-empowering-people-to-unlock-ais-full-potential-at-work). A 2023 study in *Frontiers* highlights AI’s inability to ethically solve problems or explain results, further questioning its readiness for critical applications [Frontiers, 2023](https://www.frontiersin.org/articles/10.3389/frai.2023.1148154/full). This isn’t a warning - it’s a guillotine. +A January 2025 McKinsey report notes that 50% of employees worry about AI inaccuracy, 51% fear cybersecurity risks, and many cite data leaks, aligning with Apple’s findings of unreliable outputs [McKinsey, 2025](https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/superagency-in-the-workplace-empowering-people-to-unlock-ais-full-potential-at-work). This isn’t a warning - it’s a guillotine. ## Amodei’s Confession: We’re Flying Blind @@ -44,11 +44,11 @@ The AI industry’s data addiction is a digital plague, and web scraping is its But Scrapy’s non-sane defaults and aggressive concurrency are a death sentence for servers. Its defaults prioritize speed over ethics, overwhelming servers with relentless requests. Small websites, blogs, and forums - run by individuals or small businesses - crash or rack up crippling bandwidth costs. The X post brags about handling “tens of thousands of pages,” but each page is a sledgehammer to someone’s infrastructure. -The internet thrives on open access, but scraping is strangling it. Websites implement bot protections, paywalls, or IP bans, locking out legitimate users. The X post admits to “bot protections” and “IP bans” as challenges, but Scrapy’s workarounds escalate this arms race, turning the web into a walled garden. A June 2025 *Nature* article reports that AI-driven scraping is overwhelming academic websites, with sites like DiscoverLife receiving millions of daily hits, slowing them to unusability [Nature, 2025](https://www.nature.com/articles/d41586-025-01743-9). +The internet thrives on open access, but scraping is strangling it. Websites implement bot protections, paywalls, or IP bans, locking out legitimate users. The X post admits to “bot protections” and “IP bans” as challenges, but Scrapy’s workarounds escalate this arms race, turning the web into a walled garden. Scrapers plunder content without consent, stealing intellectual property, leaving creators - writers, artists, publishers - with no compensation. The X post’s “clean datasets” fantasy ignores the dirty truth: this data is pilfered. A February 2025 EPIC report calls this the “great scrape,” noting that companies like OpenAI allegedly scraped New York Times articles, copyrighted books, and YouTube videos without permission, violating privacy and IP rights [EPIC, 2025](https://epic.org/scraping-for-me-not-for-thee-large-language-models-web-data-and-privacy-problematic-paradigms/). -Scraping collects sensitive personal information without consent, raising privacy concerns. A 2024 OECD report highlights how scraping violates privacy laws and the OECD AI Principles, risking identity fraud and cyberattacks [OECD, 2024](https://oecd.ai/en/data-scraping-challenge). A May 2025 Simplilearn article notes that scraping exacerbates AI’s privacy violations, advocating for GDPR and HIPAA compliance [Simplilearn, 2025](https://www.simplilearn.com/challenges-of-artificial-intelligence-article). +Scraping collects sensitive personal information without consent, raising privacy concerns. A 2024 OECD report highlights how scraping violates privacy laws and the OECD AI Principles, risking identity fraud and cyberattacks. A May 2025 Simplilearn article notes that scraping exacerbates AI’s privacy violations, advocating for GDPR and HIPAA compliance [Simplilearn, 2025](https://www.simplilearn.com/challenges-of-artificial-intelligence-article). Millions of scrapers clog networks, slowing access, while data centers strain, driving up energy costs and carbon emissions. Websites lose faith, shutting down or going offline, shrinking the internet’s diversity. Scrapy’s defenders claim it’s “essential” for LLMs, but that’s a lie. This data hunger is a choice. By glorifying server-killing tools, we’re murdering the internet’s soul. Deploying LLMs built on this stolen foundation isn’t reckless - it’s immoral. @@ -60,7 +60,7 @@ Scraped datasets are a toxic stew of biases, errors, and garbage. Models inherit Transformers’ attention mechanisms “hallucinate” nonexistent connections, sparking errors that could mean lawsuits or worse in production. Models overfit to scraped data quirks, brittle in real-world scenarios - a context shift, and they’re lost. LRMs burn obscene resources for negligible gains. Apple showed their “reasoning” doesn’t scale, yet we torch energy grids to keep the farce alive. -A 2024 *ScienceDirect* article lists LLM vulnerabilities: prompt injection manipulates outputs, training data poisoning introduces biases, PII breaches occur during training, insecure output handling generates harmful content, and denial-of-service attacks disrupt availability [ScienceDirect, 2024](https://www.sciencedirect.com/science/article/pii/S2666659024000130). A 2021 arXiv paper outlines six risk areas for LLMs, including discrimination, toxicity, misinformation, and environmental harms, all amplified by flawed data [arXiv, 2021](https://arxiv.org/abs/2112.04359). This isn’t a system - it’s a house of horrors, and deploying it on stolen, server-killing data is lunacy. +A 2021 arXiv paper outlines six risk areas for LLMs, including discrimination, toxicity, misinformation, and environmental harms, all amplified by flawed data [arXiv, 2021](https://arxiv.org/abs/2112.04359). This isn’t a system - it’s a house of horrors, and deploying it on stolen, server-killing data is lunacy. ## Why Deployment Is a Betrayal @@ -80,7 +80,7 @@ Governments are asleep, with no framework to govern AI or scraping’s risks, no Autonomous AI in critical systems, powered by flawed LRMs, is a death sentence - Apple’s research shows failures go unchecked without human oversight, amplifying harm exponentially. Scraping’s data theft, glorified by the X post, steals from creators, undermining the web’s creative ecosystem. Deploying LLMs built on this is endorsing piracy at scale. Scraping’s server attacks are killing the open web, forcing websites behind paywalls or offline, shrinking the internet’s diversity. LLMs are complicit in this murder. -Scraped data fuels LLMs that churn soulless text, drowning human creativity and turning culture into algorithmic sludge, disconnecting us from authenticity. A 2024 MIT Press article on ChatGPT warns of its potential for malicious misuse, privacy violations, and bias propagation, exacerbated by unregulated data practices [MIT Press, 2024](https://direct.mit.edu/dint/article/6/1/150/119771/The-Limitations-and-Ethical-Considerations-of). A 2020 Harvard Gazette report notes that AI’s lack of oversight risks societal harm, with regulators ill-equipped to keep pace [Harvard Gazette, 2020](https://news.harvard.edu/gazette/story/2020/10/ethical-concerns-mount-as-ai-takes-bigger-decision-making-role/). +Scraped data fuels LLMs that churn soulless text, drowning human creativity and turning culture into algorithmic sludge, disconnecting us from authenticity A 2020 Harvard Gazette report notes that AI’s lack of oversight risks societal harm, with regulators ill-equipped to keep pace [Harvard Gazette, 2020](https://news.harvard.edu/gazette/story/2020/10/ethical-concerns-mount-as-ai-takes-bigger-decision-making-role/). ## Toxic Incentives: Profit Over Existence @@ -116,7 +116,7 @@ These steps aren’t optional - they’re the only way to save ourselves from th The AI industry’s peddling a fairy tale, and we’re the suckers buying it. LLMs and LRMs aren’t saviors - they’re ticking bombs wrapped in buzzwords, built on a dying internet’s ashes. Apple’s *The Illusion of Thinking* and Amodei’s confession are klaxons blaring in our faces. Scrapy’s server-killing rampage, glorified on X, is the final straw - we’re not just risking failure; we’re murdering the digital world that sustains us. -From high-profile deployment failures - Samsung, Google, Zillow, IBM - to the ethical quagmire of web scraping, from AI’s environmental toll to its persistent opacity, the evidence is overwhelming. Over 80% of organizations see no tangible AI impact, yet the rush continues [McKinsey, 2025](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai). IBM warns of escalating risks like data leakage [IBM, 2025](https://www.ibm.com/think/insights/ai-agents-2025-expectations-vs-reality). Lakera documents privacy violations from scraping, amplifying harm [Lakera, 2024](https://www.lakera.ai/blog/risks-of-ai). This isn’t a mistake - it’s a betrayal of humanity’s trust. +From high-profile deployment failures - Samsung, Google, Zillow, IBM - to the ethical quagmire of web scraping, from AI’s environmental toll to its persistent opacity, the evidence is overwhelming. IBM warns of escalating risks like data leakage [IBM, 2025](https://www.ibm.com/think/insights/ai-agents-2025-expectations-vs-reality). Lakera documents privacy violations from scraping, amplifying harm [Lakera, 2024](https://www.lakera.ai/blog/risks-of-ai). This isn’t a mistake - it’s a betrayal of humanity’s trust. Deploying LLMs and LRMs, fueled by scraping’s destruction, isn’t just dumb - it’s a crime against our survival. Lock them in the lab, crack the code, and stop the internet’s slaughter, or brace for the apocalypse. The clock’s ticking, and we’re out of excuses. @@ -125,18 +125,12 @@ Deploying LLMs and LRMs, fueled by scraping’s destruction, isn’t just dumb - - Shojaee, Parshin, et al. “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” Apple Machine Learning Research, June 2025, https://machinelearning.apple.com/research/illusion-of-thinking. - Amodei, Dario. “Essay on AI Interpretability.” Personal website, 2025, quoted in Futurism, https://futurism.com/anthropic-ceo-admits-ai-ignorance. - Anonymous. “The web scraping tool Scrapy.” X post, 2025, https://x.com/birgenbilge_mk/status/1930558228590428457?s=46 -- McKinsey & Company. “The state of AI: How organizations are rewiring to capture value.” March 2025, https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai. - Lakera. “AI Risks: Exploring the Critical Challenges of Artificial Intelligence.” 2024, https://www.lakera.ai/blog/risks-of-ai. - McKinsey & Company. “AI in the workplace: A report for 2025.” January 2025, https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/superagency-in-the-workplace-empowering-people-to-unlock-ais-full-potential-at-work. - IBM. “AI Agents in 2025: Expectations vs. Reality.” March 2025, https://www.ibm.com/think/insights/ai-agents-2025-expectations-vs-reality. - Simplilearn. “Top 15 Challenges of Artificial Intelligence in 2025.” May 2025, https://www.simplilearn.com/challenges-of-artificial-intelligence-article. -- Nature. “Web-scraping AI bots cause disruption for scientific databases and journals.” June 2025, https://www.nature.com/articles/d41586-025-01743-9. - EPIC. “Scraping for Me, Not for Thee: Large Language Models, Web Data, and Privacy-Problematic Paradigms.” February 2025, https://epic.org/scraping-for-me-not-for-thee-large-language-models-web-data-and-privacy-problematic-paradigms/. -- OECD. “The AI data scraping challenge: How can we proceed responsibly?” 2024, https://oecd.ai/en/data-scraping-challenge. -- ScienceDirect. “A survey on large language model (LLM) security and privacy: The Good, The Bad, and The Ugly.” 2024, https://www.sciencedirect.com/science/article/pii/S2666659024000130. - arXiv. “Ethical and social risks of harm from Language Models.” 2021, https://arxiv.org/abs/2112.04359. -- Frontiers. “Specific challenges posed by artificial intelligence in research ethics.” 2023, https://www.frontiersin.org/articles/10.3389/frai.2023.1148154/full. -- MIT Press. “The Limitations and Ethical Considerations of ChatGPT.” 2024, https://direct.mit.edu/dint/article/6/1/150/119771/The-Limitations-and-Ethical-Considerations-of. - Harvard Gazette. “Ethical concerns mount as AI takes bigger decision-making role.” 2020, https://news.harvard.edu/gazette/story/2020/10/ethical-concerns-mount-as-ai-takes-bigger-decision-making-role/. - TechTarget. “Generative AI Ethics: 11 Biggest Concerns and Risks.” March 2025, https://www.techtarget.com/searchenterpriseai/feature/Generative-AI-Ethics-11-Biggest-Concerns-and-Risks. - WIRED. “The Dark Risk of Large Language Models.” 2022, https://www.wired.com/story/dark-risk-large-language-models/.