├── requirements.txt ├── LICENSE.md ├── submit-dialog.webp ├── test └── test.txt ├── export.py ├── readme.md └── sites.txt /requirements.txt: -------------------------------------------------------------------------------- 1 | validators -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | I don't really care what you do with this data. Boil it, mash it, stick it in a stew. 2 | -------------------------------------------------------------------------------- /submit-dialog.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0PandaDEV/submit-site-to-marginalia-search/master/submit-dialog.webp -------------------------------------------------------------------------------- /test/test.txt: -------------------------------------------------------------------------------- 1 | # This file is for loading specific crawl sets into a test environment 2 | # for debugging the crawler. Its content may change over time. 3 | 4 | oddduck.neocities.org 5 | www.cambridgeclarion.org 6 | -------------------------------------------------------------------------------- /export.py: -------------------------------------------------------------------------------- 1 | import validators 2 | import re 3 | 4 | blank_line = re.compile(r'^\s*\n?$') 5 | domain_regex = re.compile('[a-zA-Z0-9\-\.]+') 6 | 7 | def has_domain_name_string(line): 8 | return re.match(domain_regex, line) 9 | 10 | def has_url_string(line): 11 | return validators.url(line) 12 | 13 | def is_url(line): 14 | if re.match(blank_line, line): 15 | return False 16 | 17 | if (line[0] == '#'): 18 | return False 19 | 20 | return has_url_string(line) or has_domain_name_string(line) 21 | 22 | 23 | with open('sites.txt', 'r') as f: 24 | stripped_lines = [ line.strip(' \t\n') for line in f.readlines() ] 25 | 26 | sites = [ line for line in stripped_lines if is_url(line) ] 27 | 28 | for site in sites: 29 | print(site) -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Submit websites to be crawled by Marginalia Search 2 | 3 | There are a few options for submitting websites for crawling by Marginalia Search. 4 | 5 | ## Option A: 6 | 7 | If the website is already known, you can do it directly from the search engine website. 8 | 9 | Search for your website on Marginalia Search using the search bar at the top of the page. 10 | Prefix the domain name with `site:`, like `site:yoursite.com`. Try also various subdomains, like `site:www.yoursite.com`. 11 | If it is already known by the search engine, a button will appear where you can add the site manually. 12 | 13 | ![](submit-dialog.webp) 14 | 15 | ## Option B: 16 | 17 | Fork this repository, add your site to the `sites.txt` file in this repository, and submit a pull request. The site will be added to the search engine when the next crawl is run, which may be a month or more. Be patient. 18 | 19 | ## Option B 1/2: 20 | 21 | Create an issue in this repository with the URL of your website. I'll add it to the list when I have the time. 22 | 23 | Don't be alarmed if this takes a while, as it doesn't really matter when they're loaded into the search engine database as long as it happens before the next crawling cycle begins. 24 | 25 | ## Option C: 26 | 27 | If you do not want to mess around with GitHub, you can also send me an email with the URL of your website. Email the website to `contact@marginalia-search.com` and I'll add it to this list. 28 | 29 | 30 | # Why do I have to submit my website manually? 31 | 32 | This is to prevent abuse. Adding obstacles makes this much harder. Otherwise it would be possible to use the search engine's crawler essentially as a weapon to disrupt a website. 33 | 34 | The 2020s Internet is sadly very adversarial this way. 35 | -------------------------------------------------------------------------------- /sites.txt: -------------------------------------------------------------------------------- 1 | # Add your sites here 2 | # Lines starting with '#' as well as blank lines are ignored 3 | 4 | # If your site listens to an unusual port, add it on the form of 5 | # 6 | # https://www.example.com:8081/ 7 | # else 8 | # www.example.com 9 | # ... is sufficient 10 | 11 | # Note: Feel free to add the domain anywhere in the list, preferrably not at the beginning or the end, 12 | # as this reduces the likelihood of a conflict in github. Makes it easier to merge. 13 | 14 | # Sites will be fetched and crawled from this list 15 | 16 | # Sites To Add: 17 | farthergate.com 18 | mitatefe.bearblog.dev 19 | sofoundit.com 20 | ultracrepidarian.phfactor.net 21 | notfromhere.fyi 22 | www.johnwalker.nl 23 | llmlabs.net 24 | h.cjh0613.com 25 | trude.dev 26 | manualdousuario.net 27 | tuxilio.codeberg.page 28 | devintheshell.com 29 | secu.pages.dev 30 | robkohr.com 31 | propagandahat.com 32 | kaimac.org 33 | matth-ijs.nl 34 | matthijszwinderman.nl 35 | 1001ideas.org 36 | getpenny.app 37 | www.winstoncooke.com 38 | p-a-w.net 39 | posttext.pl 40 | firous.com 41 | ellacollier.com 42 | thatonecoder.codeberg.page 43 | immibis.com 44 | priro.pro 45 | gracekwak.me 46 | openwhen.site 47 | laihoconsulting.com 48 | blog.lovsund.eu 49 | robsplace.dev 50 | robsplace.thsite.top 51 | cogilabs.eu 52 | samfeldstein.xyz 53 | notebook.samfeldstein.xyz 54 | isought.info 55 | larsbarnabee.com 56 | teifiwoodworker.co.uk 57 | 10maurycy10.github.io 58 | www.ninjazumbi.com 59 | garrido.io 60 | www.eladnarra.com 61 | invariance.ceu.su 62 | nizzlay.com 63 | mrtno.com 64 | bb89.nl 65 | fragglejazz.nl 66 | blog.philz.dev 67 | philz.dev 68 | paper-review.net 69 | www.getitfree.nl 70 | discountdiskz.com 71 | borne.nl 72 | groenlinks.nl 73 | tubantia.nl 74 | bablr.org 75 | fsolt.es 76 | borneboeit.nl 77 | bornebeweegt.nl 78 | adminlibre.org 79 | beybladeanime.com 80 | www.cosmicbagel.com 81 | itsjake.me 82 | robinkay.uk 83 | blog.nawaz.org 84 | chinafake.wiki 85 | http://198.50.213.215:8141/ 86 | eugentoptic44.codeberg.page 87 | www.hackerspace.cl 88 | www.aylett.co.uk 89 | wronex.com 90 | kanoe.yuuko.eu 91 | phantastic.me 92 | italianpoetry.it 93 | omustardo.com 94 | chaotic.ninja 95 | bartekmuracki.com 96 | reggaenet.org 97 | libera.solucioneslibres.com 98 | ethanleitch.dev 99 | heonian.org 100 | serradas.org 101 | zewaren.net 102 | agney.dev 103 | blog.movimento-centrale.it 104 | www.stegriff.co.uk 105 | carlseleborg.com 106 | sugradh.com 107 | danielking.dev 108 | simpatico.io 109 | marcusb.org 110 | www.madhadron.com 111 | ports.serenityos.net 112 | mkr.pw 113 | costacoders.es 114 | neter.fi 115 | rukii.net 116 | covacha.net.ar 117 | galaxy.click 118 | justapa.thologi.st 119 | balt.sno.mba 120 | sno.mba 121 | teamakes.games 122 | kolesnikov.se 123 | elisioneffect.com 124 | tea.wtf 125 | red-white.neocities.org 126 | humanknowledge.neocities.org 127 | longesttextever.neocities.org 128 | wonderland.gay 129 | sylv.gay 130 | ozxy.xyz 131 | koerismo.digital 132 | stratasource.org 133 | momentum-mod.org 134 | blog.momentum-mod.org 135 | dawnfelstar.com 136 | alpha.shark.tobot.dev 137 | dev.shark.tobot.dev 138 | shark.tobot.dev 139 | tobot.dev 140 | paperclipmaximizer.net 141 | www.2uo.de 142 | spitlo.com 143 | blog.plain.technolog 144 | lunabee.space 145 | nthia.dev 146 | thenitropower.neocities.org 147 | gonna.surf 148 | kaiortlieb.com 149 | blog.nthia.dev 150 | git.nthia.dev 151 | tools.nthia.dev 152 | music.nthia.dev 153 | wolfgirl.dev 154 | phoki.neocities.org 155 | skylarhill.me 156 | gwenpri.me 157 | www.viblo.se 158 | blog.vanloo.ch 159 | rgort.nl 160 | cyberia.club 161 | capsul.org 162 | the-blackboard.org 163 | nullhex.com 164 | www.stavros.io 165 | krishadi.com 166 | nichobi.com 167 | kypath.sdf.org 168 | tourdeniu.vonneudeck.com 169 | zwieb.insomnia247.nl 170 | nhobb.com 171 | autisticasfxxk.com 172 | godsped.com 173 | blog.neuenet.com 174 | sglmr.com 175 | expara.social 176 | refereeingandreflection.wordpress.com 177 | www.quaddicted.com 178 | diyarciftci.xyz 179 | gooseandquill.blog 180 | disinfo.zone 181 | www.givefood.org.uk 182 | divination.disinfo.zone 183 | macroexpand.net 184 | gallery-mental.neocities.org 185 | robotsinplainenglish.com 186 | holzer.online 187 | laravista.altervista.org/L8x/searchdown 188 | www.bytefish.de 189 | bouncepaw.com 190 | jak2k.schwanenberg.name 191 | garden.bouncepaw.com 192 | links.bouncepaw.com 193 | evrim.zone 194 | mycorrhiza.wiki 195 | betula.mycorrhiza.wiki 196 | forwardemail.net 197 | bhankas.org 198 | xnacly.me 199 | pbement.com 200 | philosophyforchange.wordpress.com 201 | mylitjourney.wordpress.com 202 | dgroshev.com 203 | blog.waseigo.com 204 | wilkinson.graphics 205 | purserclub.com 206 | 0x85.org 207 | arunwadhwa.com 208 | srijan.ch 209 | ottomannews.com 210 | pub.colonq.computer 211 | andrejradovic.com 212 | kyudosudoku.timwi.de 213 | captainslog.me 214 | www.amity.city 215 | blog.reciperadar.com 216 | www.reciperadar.com 217 | hazybridge.com 218 | rowanbird779.nekoweb.org 219 | bawcast.com 220 | zadnyspe.ch 221 | jentak.co 222 | mrevil.asvachin.com 223 | verzettelung.com 224 | alexsci.com 225 | robalexdev.com 226 | www.getlocalcert.net 227 | grahameger.com 228 | regattapages.com 229 | benlacroix.com 230 | hidersout.com 231 | shapesinrealti.me 232 | www.bumps.cafe 233 | oddduck.neocities.org 234 | basspistol.com 235 | v.basspistol.org 236 | git.basspistol.org 237 | do.basspistol.org 238 | pathoplexus.org 239 | loculus.org 240 | nextstrain.org 241 | setto.basspistol.com 242 | paxnion.basspistol.com 243 | radio.basspistol.com 244 | txt.basspistol.org 245 | sakrecoer.com 246 | blackilykat.dev 247 | recursewithless.net 248 | comma.directory 249 | mccd.space 250 | mattstein.com 251 | offthehook.cc 252 | www.gross.sh 253 | segments.zhan.science 254 | chrisritchie.org 255 | phinjensen.com 256 | newsonaut.com 257 | hotgarba.ge 258 | hecko.my.to 259 | elerosv.pet 260 | dlants.me 261 | durakiconsulting.com 262 | deviltux.thedev.id 263 | www.billdietrich.me 264 | oracle333.blog.fc2.com 265 | www.mormoroi.com 266 | mormoroi.com 267 | jorisbukala.com 268 | wildflower.work 269 | rayberger.org 270 | ikesau.co 271 | fanty.xyz 272 | www.ziritione.org 273 | kaka.farm 274 | alexalejandre.com 275 | hackersphere.space 276 | tsxyz.site 277 | www.archaeoramblings.com 278 | grantlemons.com 279 | blog.burakcankus.com 280 | burakcankus.com 281 | btxx.org 282 | matthewfells.com 283 | voxel.wiki 284 | bauer.codes 285 | notes.bauer.codes 286 | lukaswerner.com 287 | jnthn.me 288 | darrennathanael.com 289 | blog.darrennathanael.com 290 | www.makeworld.space 291 | lovelydumpling.nekoweb.org 292 | mopacic.net 293 | mcpar.land 294 | graphicalmethods.com 295 | redterminal.org 296 | www.listenfaster.com 297 | cobycat.neocities.org 298 | fatbuffalo.neocities.org 299 | dvdnerd.neocities.org 300 | a-blue-in-a-sea-of-reds.neocities.org 301 | releases.bruta.link 302 | www.cerealously.net 303 | bionicle.gay 304 | www.nightsintodreams.com 305 | www.themanequest.com 306 | tanookisite.com 307 | ckhollidayplans.com 308 | frockflicks.com 309 | moth.monster 310 | asozial.org 311 | hugolee.xyz 312 | beefox.xyz 313 | informatik-hhx.pages.dev 314 | b-ring.nl 315 | rooiratel.red 316 | sacred.neocities.org 317 | linkpantry.com 318 | svilendobrev.com 319 | yso.blue 320 | volpeon.ink 321 | www.lexiqqq.com 322 | sanitarium.se 323 | ktibow.github.io 324 | jade.ellis.link 325 | jcd.pub 326 | luke.hsiao.dev 327 | www.opentierboy.com 328 | opentierboy.com 329 | ben.companjen.name 330 | thoughtsofmine.ca 331 | personal.thoughtsofmine.ca 332 | webring.thoughtsofmine.ca 333 | recomp.eco 334 | ideophone.org 335 | ideophone.org 336 | www.thomas-huehn.de 337 | www.thomas-huehn.com 338 | www.schoene-kinderbuecher.de 339 | remblanc.nekoweb.org 340 | dotbun.com 341 | arcades.agency 342 | cascading.space 343 | aajonus.net 344 | neilensperch.withtank.com 345 | agnieszka.dev 346 | runtimeterror.dev 347 | benstokman.me 348 | wiggle.monster 349 | cellio.org 350 | eugene-andrienko.com 351 | michaelwelford.com 352 | lifewaza.com 353 | cotcli.com 354 | bugwhisperer.dev 355 | aikidomontreal.com 356 | www.nepaliukhan.com 357 | www.nepalilekh.com 358 | jonathan-frere.com 359 | lanzani.nl 360 | blog.lanzani.nl 361 | johnskinnerportfolio.com 362 | jonblack.gg 363 | oddlypresent.com 364 | lmao.ovh 365 | whatdidyouexpect.eu 366 | moxk.net 367 | darren.me 368 | gautampk.com 369 | musicgames.wikidot.com 370 | bendun.cc 371 | newdegeneration.xyz 372 | blog.newdegeneration.xyz 373 | relint.de 374 | popnyc.org 375 | www.terracrypt.net 376 | lunareclipse.zone 377 | exilian.co.uk 378 | academia.fzrw.info 379 | crowoak.com 380 | middleagesinmoderngames.net 381 | medievalcaucasus.org 382 | www.cvennevik.no 383 | freesteamtables.com 384 | linuxpréinstallé.com 385 | thermodynamique.fr 386 | jacobfilipp.com 387 | birla.io 388 | kusoneko.moe 389 | www.kusoneko.moe 390 | ticonoce.xyz 391 | blog.kusoneko.moe 392 | git.kusoneko.moe 393 | redlibrary.info 394 | blog.nicolas-guruphat.com 395 | 512b.dev 396 | indigo.re 397 | vaclavzoubek.eu 398 | maxwellforbes.com 399 | max.gripe 400 | tinyweatherforecastgermanygroup.frama.io 401 | tinyweatherforecastgermanygroup.gitlab.io 402 | thedabbler.patatas.ca 403 | dylanjava.com 404 | zaclloyd.net 405 | realites-paralleles.com 406 | bellacoolaharbour.ca 407 | www.bellacoolafoodshed.org 408 | dziban.net 409 | janela.digital 410 | shiftingedges.com 411 | tchan.lol 412 | 1500chan.org 413 | chingchan.org 414 | monero.land 415 | lolifox.moe 416 | vastalauta.org 417 | bronnen.net 418 | ptchan.org 419 | usagi.reisen 420 | dospuntostr.es 421 | thefrogpond.org 422 | soygem.party 423 | saracean.com 424 | szymonnastaly.com 425 | mirohlichan.net 426 | alterchan.net 427 | staffas.org 428 | 4get.ch 429 | imstillthinking.net 430 | lunar.icu 431 | hbubli.cc 432 | ggtyler.dev 433 | zzls.xyz 434 | nadeko.net 435 | michaelkupietz.com 436 | mint.lgbt 437 | sudovanilla.org 438 | ducks.party 439 | cat-boop.com 440 | arunkd13.github.io 441 | www.cat-boop.com 442 | sijh.net 443 | neco.lol 444 | edmateo.site 445 | www.articulatecode.dev 446 | kohlchan.top 447 | dietchan.org 448 | vidlii.top 449 | vidlii.nu 450 | kirill.korins.ky 451 | dings.tech 452 | piturnah.xyz 453 | blog.passwordclass.xyz 454 | blog.strus.guru 455 | blog.moubou.com 456 | ajkprojects.com 457 | cneira.github.io 458 | janelaarquitetos.com.br 459 | unixdigest.com 460 | axxuy.xyz 461 | www.slatecave.net 462 | slatecave.net 463 | eggware.xyz 464 | blog.eggware.xyz 465 | boykisser.nl 466 | morgan.zoemp.be 467 | kenhv.com 468 | ermageeerd.neocities.org 469 | old.parts 470 | crtv.pages.dev 471 | mayas-archive.com 472 | superdarke.neocities.org 473 | litew.pages.dev 474 | www.bobek.cz 475 | tarotmancer.com 476 | digitaltibetan.github.io/DigitalTibetan 477 | theraa.net 478 | lancache.net 479 | uklans.net 480 | weirdwaves.net 481 | www.aketawi.space/home 482 | www.write-on.org 483 | www.eressea.de 484 | basi.nya.pub 485 | wiki.eressea.de 486 | www.enno.horse 487 | helenchong.dev 488 | leilukin.com 489 | www.write-on.orgster 490 | tinyglitch.net 491 | noahwbaldwin.me 492 | davejansen.com 493 | tgm.happyngreen.fr 494 | lgrando1.github.io 495 | locrian.zone 496 | kabirk.com 497 | maxbo.me 498 | parallel-experiments.github.io 499 | triptechwiki.github.io 500 | triptechwiki.neocities.org 501 | www.whysf.xyz 502 | bobbyhiltz.com 503 | blog.mbirth.uk 504 | dylanjuran.me 505 | ianreppel.org 506 | r0b.bearblog.dev 507 | tombrandis.uk 508 | blog.shr4pnel.com 509 | andrewpakhomov.com 510 | benchristel.com 511 | tonotes.com 512 | haborym.bearblog.dev 513 | shanefera.com 514 | alanmckay.blog 515 | www.bryan-kaperick.me 516 | terminal.pink 517 | blog.terminal.pink 518 | dovel.email 519 | knazarov.com 520 | jaytaylor.com 521 | brandonpittman.com 522 | sillon-fictionnel.club 523 | pvv.ntnu.no 524 | wiki.pvv.ntnu.no 525 | akshitgaur2005.github.io 526 | http://bluepixelcomputing.com:8141/ 527 | oxyn.org 528 | brainlane.bearblog.dev 529 | sanderium.codeberg.page 530 | # Please add new sites to a random line in the list, 531 | # not all at the end, doing this helps avoid tedious 532 | # merge conflicts 533 | --------------------------------------------------------------------------------