Rising1 sources· last seen 1h ago· first seen 1h ago

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful.

Posted this to r/MachineLearning a couple weeks ago (30K views, 100+ upvotes) and have been meaning to share it here where the fine-tuning angle is more directly relevant. I spent years building and processing a complete Usenet corpus from 1980 to 2013. Here’s why it might matter for local model wo

Lead: r/LocalLLaMABigness: 28built103b-tokenusenetcorpus1980

Open primary source

📡 Coverage

1 news source

🟠 Hacker News

🔴 Reddit

127 upvotes across 1 sub

📈 Google Trends

Full methodology: How scoring works

Receipts (all sources)

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful.

REDDIT · r/LocalLLaMA · 1h ago · ⬆ 127 · 💬 56

score 130