A search engine with currated sources free from SEO slop. This project also aims to recommend websites to other aggregators and search engines to help them improve their own results.
Google and Bing suck now, they're overrun by SEO slop. The internet as most people know it is a cesspool of garbage. The good internet that we are all nostalgic for is still out there, but it's not easy to find. This project aims to fix that. Other projects such as marginalia and wiby only have obsure websites which makes them less useful as a general purpose search engine, this project WILL include popular websites in the search results
- https://github.com/MathGeniusJodie/slopless_embeddings
- https://github.com/MathGeniusJodie/tl_readability
- Automated scraping of safelist sources
- Automated scraping of blocklist sources
- Crawler
- HTML content extraction initial implementation
- Fancy HTML content extraction
- Cross-language support
- Vector embedding index
- ANN search
- Website UI
- Independent blogs
- Reputable news sources
- High quality publications
- Well moderated forums and subreddits
- Academic journals and papers
- Wikis
- Online encyclopedias
- Open source repositories
- User customizable additions
- substack
- neocities
- https://lobste.rs/
- outgoing links from sites in safelist
- hackernews
- SEO slop
- Fake news and propaganda
- Clickbait sites
- Uncurrated and unmoderated websites
- User customizable additions
- https://raw.githubusercontent.com/kagisearch/smallweb/refs/heads/main/smallweb.txt
- https://thenumb.at/Graphics-Blogroll/
- wikipedia citations
- various webrings, blogrolls and directories
- jodie.website
- https://smallweb.cc
- https://xn--sr8hvo.ws/directory
- https://ooh.directory/
- https://blogroll.org/
- https://1mb.club
- https://indieweb.org/blogroll a blogroll of blogrolls
- https://melonland.net/surf-club
- https://theinternetisshit.xyz/
- https://brisray.com/web/webring-list.htm
- https://www.404pagefound.com/
- https://webring.theoldnet.com/
- aggregator blogs
- https://marginalia-search.com/
- https://searchmysite.net/
- wiby
- google maps listings of brick and mortar places
- https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources/Perennial_sources
- https://getindie.wiki/listings/
- https://mwmbl.org/
- https://マリウス.com/the-small-web-101/
- https://fmhy.net
- https://www.reddit.com/r/InternetIsBeautiful/
- github.com/atakanaltok/awesome-useful-websites?tab=readme-ov-file
- https://manuelmoreale.com/blogroll
- https://webring.bucketfish.me/
- minimal.gallery
- https://theforest.link/
- https://laughingmeme.org/links/
- https://maurycyz.com/real_pages/
- https://randomdailyurls.com/archive
- https://en.wikipedia.org/wiki/List_of_fake_news_websites
- https://github.com/popcar2/BadWebsiteBlocklist
- https://github.com/rjaus/awesome-ublacklist
- https://github.com/NotaInutilis/Super-SEO-Spam-Suppressor
- https://github.com/NotaInutilis/no-qanon
- https://danny0838.github.io/content-farm-terminator/en/
- https://github.com/FranklyRocks/OnlyHuman
- https://github.com/PrejudiceNeutrino/YouTube_Channels
- https://github.com/ErikCH/DevYouTubeList
- https://educational-channels.com
- kagi small yt
- Investigate https://yacy.net/
- Investigate https://github.com/medialab/hyphe
- Offline/datahoarder mode? https://www.httrack.com/ https://en.wikipedia.org/wiki/Heritrix
- Don't crawl websites that have search pages of their own and integrate their search instead?
- https://caddyserver.comahans30/Binoculars
- pangram / ahans30/Binoculars style slop detection for ranking and filtering
I want to add an option to filter out AI generated content because many people want that, but I don't want to make it the default. AI content that was prompted and well-curated by humans is kinda fine in my book.
$$(".perennial-sources tr.s-gr td:last-of-type a").map(a=>a.href)