├── README.txt ├── TODO.txt ├── browse.php ├── config.php ├── crawl.php ├── create-tables.sql ├── export.php ├── includes ├── cookie.txt ├── functions.php └── mysql_functions.php ├── query.php ├── sitemap.php └── stats.php /README.txt: -------------------------------------------------------------------------------- 1 | TO USE: 2 | 3 | 1. Edit config.PHP with appropriate database and domain information 4 | 2. (for now) in phpMyAdmin insert the seed URL into the urls table. 5 | * URL should be something like: www.fcc.gov 6 | * URL should have a trailing slash 7 | * (for now) May also want to set clicks to '0' to avoid problems 8 | 3. Open crawler.php 9 | 4. (optional) open stats.php to watch progress 10 | 11 | TIPS: 12 | Changes to php.ini 13 | 1. Increase memory limit (1GB) 14 | 2. Remove execution time limit 15 | Changes to mysql.ini 16 | * Increased max query size (to avoid "mysql went away" error) 17 | 18 | Additional documentation (source code) in (/source) -------------------------------------------------------------------------------- /TODO.txt: -------------------------------------------------------------------------------- 1 | # TO DO 2 | 3 | - Review/improve handling of confirmed in/out domains and also silly links like ?font=large 4 | - Check that all queries for urls includes the crawl_tag -------------------------------------------------------------------------------- /browse.php: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | 7 | 8 | 18 | 19 |

20 | 50 |
51 | 0) { ?> 54 | 79 | 80 |

No Links on this page

81 | 82 | -------------------------------------------------------------------------------- /config.php: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /crawl.php: -------------------------------------------------------------------------------- 1 | 23 | 24 | 25 | STARTED: " . date('Y-m-d H:i:s') . "

"; 27 | echo "

Domains: $domains

"; 28 | echo "

crawl_tag: $crawl_tag

"; 29 | echo "

database: $mysql_db

"; 30 | echo "

Crawling...

"; 31 | 32 | /* 33 | * Grab list of uncrawled URLs, repeat while there are still URLs to crawl 34 | */ 35 | while ($urls = uncrawled_urls($crawl_tag)) { 36 | 37 | /** 38 | * Loop through the array of uncrawled URLs 39 | */ 40 | foreach ($urls as $id=>$url_data) { 41 | 42 | /** 43 | * If we're in debug mode, indicate that we are begining to crawl a new URL 44 | */ 45 | if (isset($_GET['debug'])) 46 | echo "

Starting to crawl " . urldecode($url_data['url']) . "