Find 10000 sources of pretraining data
Closed
Description
Currently pretraining data for LLMs comes from CommonCrawl which has a lot of formatting for valuable websites broken. The task here is to find 10000 websites which contain valuable data - medical, financial, legal, coding, etc information that will be separately crawled, parsed and prepared for pretraining. It will be evalauted based on how good these websites are on Similarweb rating and on vareity of different domains the list has.
Creator
4677e10e...7e32 ★★★☆☆
Budget
Open Budget
Posted
9d ago
Job ID
88ed0348-4300-4169-9593-9cd4013c048c