├── .github └── workflows │ └── main.yml ├── .gitignore ├── LICENSE.txt ├── MANIFEST.in ├── README.md ├── public_domains.py ├── release.sh ├── requirements.txt ├── screenshot.gif ├── setup.cfg ├── setup.py ├── test-data └── moby-dick.txt └── test_public_domains.py /.github/workflows/main.yml: -------------------------------------------------------------------------------- 1 | name: tests 2 | on: 3 | push: 4 | paths-ignore: 5 | - 'README.md' 6 | jobs: 7 | build: 8 | runs-on: ubuntu-latest 9 | strategy: 10 | matrix: 11 | python-version: ["3.10"] 12 | steps: 13 | 14 | - uses: actions/checkout@v2 15 | with: 16 | submodules: recursive 17 | 18 | - name: Set up Python ${{ matrix.python-version }} 19 | uses: actions/setup-python@v2 20 | with: 21 | python-version: ${{ matrix.python-version }} 22 | 23 | - name: Install dependencies 24 | run: | 25 | python -m pip install --upgrade pip 26 | pip install -r requirements.txt 27 | 28 | - name: Test with pytest 29 | run: python setup.py test 30 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .eggs 2 | Pipfile 3 | *.log 4 | public_domains.egg-info/ 5 | dist 6 | build 7 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include requirements.txt 2 | include README.md 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # public_domains 2 | 3 | [![Build Status](https://github.com/thisisparker/public_domains/workflows/tests/badge.svg)](https://github.com/thisisparker/public_domains/actions/workflows/main.yml) 4 | 5 | *public_domains* searches through a text (such as a novel like [Moby Dick](https://www.gutenberg.org/files/2701/2701-0.txt)) and looks for possible host names to use. For more context about why such a thing was created see: 6 | 7 | https://parkerhiggins.net/2022/11/public-sub-domains 8 | 9 | ## Install 10 | 11 | ``` 12 | $ pip3 install public_domains 13 | ``` 14 | 15 | ## Use 16 | 17 | Get a text, e.g. from [Project Gutenberg](https://www.gutenberg.org/): 18 | 19 | ``` 20 | wget https://www.gutenberg.org/files/2701/2701-0.txt 21 | ``` 22 | 23 | Run: 24 | 25 | ```bash 26 | $ public_domains 2701-0.txt 27 | 28 | tattooing.burned.like 29 | fishermen.technically.call 30 | violent.scraping.contact 31 | irregular.between.here 32 | verbally.developed.here 33 | mizzen.rigging.like 34 | eepeningly.contracted.like 35 | tropic.whaling.life 36 | trailing.behind.like 37 | certain.fragmentary.parts 38 | redundant.yellow.hair 39 | personality.stands.here 40 | wicked.miserable.world 41 | ... 42 | ``` 43 | 44 | Or if you prefer to do a "feeling lucky" search of gutenberg.org you can enter a title: 45 | 46 | ```bash 47 | $ public_domains "Treasure Island" 48 | 49 | famous.buccaneer.here 50 | maroon.wriggling.like 51 | keeping.better.watch 52 | little.mountain.stream 53 | admiral.benbow.black 54 | breathing.loudly.like 55 | breath.hanging.like 56 | shirts.thrown.open 57 | promontory.without.fail 58 | feverish.unhealthy.spot 59 | something.almost.like 60 | following.important.news 61 | lookout.shouted.land 62 | spirits.eating.like 63 | resumed.silver.here 64 | stranded.beyond.help 65 | trebly.worthless.life 66 | including.checks.online 67 | canvas.cracking.like 68 | almost.entirely.exposed 69 | flowers.ablowing.like 70 | counting.hawkins.here 71 | little.stranger.here 72 | foliage.compact.like 73 | strange.collection.like 74 | seamen.aboard.here 75 | creature.flitted.like 76 | nighhand.fainting.doctor 77 | crying.johnny.black 78 | enough.little.place 79 | ``` 80 | 81 | 82 | If you'd like *public_domains* to check if the domain is available use the the `--check` option: 83 | 84 | 85 | 86 | -------------------------------------------------------------------------------- /public_domains.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import sys 4 | import nltk 5 | import time 6 | import tqdm 7 | import whois 8 | import string 9 | import logging 10 | import argparse 11 | import requests 12 | 13 | # this should be a no-op if it has already been run 14 | nltk.download('punkt', quiet=True) 15 | 16 | known_unavailable = [ 17 | 18 | # these are known to not openly accept registrations 19 | 'active', 20 | 'amazon', 21 | 'apple', 22 | 'arte', 23 | 'audi', 24 | 'audible', 25 | 'bank', 26 | 'baseball', 27 | 'basketball', 28 | 'boots', 29 | 'case', 30 | 'dell', 31 | 'drive', 32 | 'fast', 33 | 'fire', 34 | 'fly', 35 | 'gallo', 36 | 'globo', 37 | 'infiniti', 38 | 'like', 39 | 'museum', 40 | 'natura', 41 | 'origins', 42 | 'post', 43 | 'prime', 44 | 'silk', 45 | 'smile', 46 | 'star', 47 | 'va', 48 | 'vana', 49 | 'visa', 50 | 'viva', 51 | 'vivo', 52 | 'weather', 53 | 'windows', 54 | 55 | # these appear to be proposed but are not available yet 56 | 'data', 57 | 'latino', 58 | 'mobile' 59 | ] 60 | 61 | nic = whois.NICClient() 62 | 63 | def main(): 64 | args = get_args() 65 | logging.basicConfig(filename=args.log, level=logging.INFO) 66 | hosts = get_hosts(args.text_file, args.quiet) 67 | if args.check: 68 | hosts = available_hosts(hosts, args.quiet, args.sleep) 69 | for host in hosts: 70 | print(host) 71 | 72 | def get_args(): 73 | parser = argparse.ArgumentParser( 74 | prog="public_domains", 75 | description="Get possible hostnames from a text" 76 | ) 77 | parser.add_argument("text_file", help="A text file to look for hostnames in or a title to look up in gutenberg.org") 78 | parser.add_argument("--check", action="store_true", help="Check if the domain is actually available") 79 | parser.add_argument("--quiet", action="store_true", help="Silence diagnostic messages on the console.") 80 | parser.add_argument("--sleep", type=float, default=0.5, help="Time to sleep between whois requests") 81 | parser.add_argument("--log", default="public_domains.log", help="Log file to write to.") 82 | return parser.parse_args() 83 | 84 | def get_hosts(input_text, quiet=False): 85 | tlds = get_tlds() 86 | 87 | # if it's a file use it as input or look up the text in gutenberg 88 | if os.path.isfile(input_text): 89 | with open(input_text, 'r') as f: 90 | md = ' '.join([l.strip() for l in f.readlines()]) 91 | else: 92 | md = gutenberg(input_text) 93 | if md is None: 94 | sys.exit('No Gutenberg results for "%s"' % input_text) 95 | 96 | md_sents = nltk.tokenize.sent_tokenize(md) 97 | 98 | possible_domains = set() 99 | 100 | for s in md_sents: 101 | wl = nltk.tokenize.word_tokenize(s) 102 | wl = [w.lower() for w in wl] 103 | wl = [''.join([c for c in w if c in string.ascii_lowercase]) for w in wl] 104 | wl = [w for w in wl if w] 105 | for i, w in enumerate(wl): 106 | if (i > 1 and w in tlds and len(w) > 3 107 | and len(wl[i-1]) > 5 and len(wl[i-2]) > 5): 108 | full_domain = '.'.join([wl[i-2], wl[i-1], w]) 109 | possible_domains.add(full_domain) 110 | 111 | return possible_domains 112 | 113 | def get_tlds(): 114 | tlds = [] 115 | r = requests.get("https://data.iana.org/TLD/tlds-alpha-by-domain.txt") 116 | 117 | for d in r.text.splitlines(): 118 | if d.startswith("#") or d.startswith('XN--'): 119 | continue 120 | d = d.lower() 121 | if d not in known_unavailable: 122 | tlds.append(d) 123 | return tlds 124 | 125 | def available_hosts(hosts, quiet, sleep): 126 | av = [] 127 | with tqdm.tqdm(disable=quiet, total=len(hosts)) as progress: 128 | for host in hosts: 129 | progress.update() 130 | progress.set_postfix_str("found %s" % len(av)) 131 | if available(host) in [None, True]: 132 | av.append(host) 133 | time.sleep(sleep) 134 | return av 135 | 136 | def available(hostname): 137 | domain = hostname.split('.', 1)[1] 138 | logging.info("whois look up for %s" % domain) 139 | try: 140 | response = nic.whois_lookup(None, domain, flags=0, quiet=True) 141 | entry = whois.WhoisEntry.load(domain, response) 142 | logging.info("got: %s" % entry) 143 | return entry['domain_name'] is None 144 | except whois.parser.PywhoisError as e: 145 | logging.warning("whois parse error: %s", e) 146 | return None 147 | 148 | def gutenberg(title): 149 | params = {"query": title, "format": "json"} 150 | results = requests.get('https://www.gutenberg.org/ebooks/search/', params).json() 151 | 152 | if len(results) != 4 or len(results[3]) <= 1: 153 | logging.warning("No Gutenberg results for %s" % title) 154 | return None 155 | 156 | match = re.match(r'^/ebooks/(\d+)\.json$', results[3][1]) 157 | if not match: 158 | logging.warning("Unexpected JSON from Gutenberg") 159 | 160 | guten_id = match.group(1) 161 | url = "https://www.gutenberg.org/cache/epub/%s/pg%s.txt" % (guten_id, guten_id) 162 | 163 | logging.info("fetching text from %s" % url) 164 | return requests.get(url).text 165 | 166 | if __name__ == "__main__": 167 | main() 168 | 169 | -------------------------------------------------------------------------------- /release.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # release to pypi! 4 | 5 | rm -rf dist 6 | python3 setup.py sdist bdist_wheel 7 | twine upload dist/* 8 | 9 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | nltk 2 | tqdm 3 | requests 4 | python-whois 5 | -------------------------------------------------------------------------------- /screenshot.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thisisparker/public_domains/5fb18ee92d2d45bf9c46d4939baf7f039b587bc6/screenshot.gif -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [aliases] 2 | test=pytest 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import setuptools 3 | 4 | with open("README.md") as f: 5 | long_description = f.read() 6 | 7 | with open("requirements.txt") as f: 8 | dependencies = f.read().split() 9 | 10 | if __name__ == "__main__": 11 | setuptools.setup( 12 | name="public_domains", 13 | version="0.0.8", 14 | url="https://github.com/edsu/public_domains", 15 | author="Ed Summers", 16 | author_email="ehs@pobox.com", 17 | py_modules=["public_domains"], 18 | description="Find possible host names in a source text", 19 | long_description=long_description, 20 | long_description_content_type="text/markdown", 21 | license="MIT", 22 | classifiers=[ 23 | "License :: OSI Approved :: MIT License", 24 | ], 25 | python_requires=">=2.7", 26 | install_requires=dependencies, 27 | setup_requires=["pytest-runner"], 28 | tests_require=[ 29 | "pytest", 30 | ], 31 | entry_points={ 32 | "console_scripts": [ 33 | "public_domains = public_domains:main" 34 | ] 35 | }, 36 | ) 37 | -------------------------------------------------------------------------------- /test_public_domains.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import public_domains 3 | 4 | logging.basicConfig(filename="test.log", level=logging.DEBUG) 5 | 6 | def test_get_hosts(): 7 | hosts = public_domains.get_hosts('test-data/moby-dick.txt', quiet=True) 8 | assert len(hosts) == 145 9 | assert "almost.looked.next" in hosts 10 | 11 | def test_available(): 12 | assert public_domains.available("this.parkerhiggins.net") == False 13 | assert public_domains.available("choo.choo.choo") == True 14 | 15 | def test_get_tlds(): 16 | tlds = public_domains.get_tlds() 17 | assert len(tlds) > 0 18 | assert "org" in tlds 19 | 20 | def test_gutenberg(): 21 | text = public_domains.gutenberg('origin of the species') 22 | assert len(text) > 0 23 | assert "I shall discuss the complex and little known laws" in text 24 | --------------------------------------------------------------------------------