├── .gitignore ├── README.md ├── bin ├── index_lookup_local ├── index_lookup_remote └── remote_copy ├── docs ├── data_block.png ├── file_format.graffle │ ├── data.plist │ └── image1.pdf ├── file_overview.png ├── header.png ├── index_block.png ├── project_header.png └── tree.png └── lib ├── __init__.py ├── adaptor.py ├── pbtree.py ├── prefix.py ├── sorted_urls ├── test.py ├── test_map.py └── test_pbtree.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[co] 2 | .DS_Store 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![Trivio Logo](/docs/project_header.png?raw=true) 2 | 3 | ================== 4 | 5 | The Common Crawl data is composed of billions of pages randomly crawled from the internet. Each page is archived with thousands of other pages into an archive file, commonly known as an ARC file. There are hundreds of thousands of these archives and the pages are stored essentially randomly distributed and unordered within these archives. 6 | 7 | 8 | Without an index, even if you know a pages URL, you're forced to download, uncompress and each of the archives until you locate the pages you're are interested in. 9 | 10 | Using the index described here, you can find all the archive files that contain pages for a give URL prefix, subdomain or top level domain with no more than 3 small network requests. There is also a utility script bin/remote_copy 11 | (documented at the bottom of this readme) which uses the index to copy all the webpages from a specified list of domains 12 | to an S3 location of your choosing. 13 | 14 | 15 | Challenges 16 | --------- 17 | To understand the design of the index let's first look at the challenges unique to the Common Crawl Project. 18 | 19 | Specifically: 20 | 21 | 22 | 1. It's huge 23 | * 5 Billion unique URLs 24 | * Average URL is 66 bytes 25 | * A pointer to an individual page requires an additional 28 bytes 26 | (see file format for details) 27 | 28 | Just storing this information uncompressed, requires a file greater 29 | than 437 GB in size. (5x 10^9 x (66+28)). 30 | 31 | 2. The size alone prevents constructing the entire index in memory 32 | 33 | 3. Or.. even on one modest machine 34 | 35 | 4. There's a large community with a variety of tools interested in accessing the index. 36 | 37 | 5. Common Crawl is a non profit, it's especially important to keep processing and hosting costs down while ensuring the data is highly available. 38 | 39 | Since 2012, the Common Crawl data has been hosted for free by Amazon on the Amazon Public Data Sets. The Amazon Public Data Set program is a boon to everyone who uses data - it lowers the overhead costs for organizations that want to share data, lowers the cost for users to download open data, and makes valuable data sets more discoverable. 40 | 41 | Putting this altogether leads to the following goals. 42 | 43 | 44 | Goals: 45 | ------ 46 | 47 | * Store and share the entire index from S3 without any other services 48 | * The entire index should be utilizable without downloading the entire thing. 49 | * The number of network calls needed should be minimized. 50 | * You can search for any URL, URL prefix, subdomain or top-level domain. 51 | * Once you've determined the approximate location of the URL you're 52 | interested in you can sequentially read other similar urls. 53 | * It should be easy to access the index from any programming language *This the main reason we opted to roll our own format rather than rely on a third party library 54 | 55 | 56 | File Format 57 | ------------ 58 | 59 | The file format is based on a [Prefixed B-tree](http://ict.pue.udlap.mx/people/carlos/is215/papers/p11-bayer.pdf). The proceeding link will take you to a paper that gives you an in depth overview of this data structure, however, hopefully, the information we provide here will be enough for your to utilize the index without requiring you to read the whole paper. 60 | 61 | Conceptually the index is organized into a b+ tree like this. 62 | 63 | ![Tree](/docs/tree.png?raw=true) 64 | 65 | To access any given URL in the index, you start by reading the root block in the tree and then follow the pointers to zero or more other index blocks and finally to a data block. The urls is the data block are stored in lexicographic order. So for a url of `http://example.com/page1.html` will come before `http://example.com/page2.html`. Because of this property you 66 | can find all the pages that share a common prefix by subsequently reading 67 | each url in the data portion of the file. 68 | 69 | 70 | ###File overview 71 | The entire index plus data are stored in one file that has 3 major parts. **The header**, **index blocks** and **data blocks** as depicted below. 72 | 73 | ![File Overview](/docs/file_overview.png?raw=true) 74 | 75 | 76 | ###Header 77 | The header is exactly 8 bytes long. The first 4 bytes represents the **block size** used in the file and the second 4 bytes represent the number of index blocks or **block count** contained in the file. 78 | 79 | All numbers are encoded in little-endian order 80 | 81 | ![Header](/docs/header.png?raw=true) 82 | 83 | Once you have the block size and block count you can randomly access any block by following the instructions in the Operations section of this guide. 84 | 85 | Interpreting a block depends on whether it is a **Index block** or a **Data block**. 86 | 87 | 88 | ###Index block 89 | Any block number that is less than the block count in the header is considered an index block. It is interpreted as follows: 90 | 91 | ![Index Block](/docs/index_block.png?raw=true) 92 | 93 | An index block always starts with 4 byte <block number> then one or more <prefix> <null> <block number> triplets until the next triplet can not fit within the block, at which point it is padded with additional <null> bytes. 94 | 95 | As a result should you ever encounter a <null> immediately after reading a <block number> you know you've reached the end of the block. 96 | 97 | See the section on searching the index to see how to use the index blocks to find the first data block appropriate for your search. 98 | 99 | ###Data block 100 | 101 | Data blocks consist of oner or more <url> <null> <location pointer> collectively known as an **item**. There are a variable number of items in a block based on the length of the url. 102 | 103 | Just like index blocks, data blocks are padded with <null> bytes, when during construction, the next item can not fit within the given block size. 104 | 105 | ![Data Block](/docs/data_block.png?raw=true) 106 | 107 | 108 | A <url> consists of one or more characters terminated by the null byte. 109 | 110 | The location pointer is 32 bytes long and can be interpreted as follows. The first 8 bytes represents the segment id, the next 8 bytes represents the ARC file creation date, followed by 4 bytes that represent the ARC file partition, followed by 8 bytes that represent the offset within the ARC file and then 111 | finally the last 4 bytes represent the size of compressed data stored inside the ARC file. 112 | 113 | See the the section on retrieving a page, to put all this information together to access a page. 114 | 115 | 116 | Operations 117 | ---------- 118 | 119 | ###Read a block 120 | 121 | Once you've read the header you can use this information to randomly read any block given it's block number. Keep in mind the size of the header is exactly 8 bytes long. 122 | 123 | First determine the offset of the block in the file 124 | 125 | ``` 126 | block offset = (block number * block size) + header size 127 | ``` 128 | 129 | Then read the range of bytes starting from the *block offset* plus the size of a single block. 130 | 131 | For example imagine the block size is 65536 bytes long and you wanted to read block number 2. You'd first calculate tho block offset which is (2 x 65536) + 8 = 131080. Then you would read the block. 132 | 133 | If you were making an http request to S3 this would be done using the HTTP Range header as follows 134 | 135 | ``` 136 | Range: bytes=131080-196616 137 | ``` 138 | 139 | If you have download the index you can simply seek to position 131080 in the file and then read the next 65536 bytes. 140 | 141 | ###Jumping directly to the start of the data block 142 | 143 | Any block number that is larger than the index block count in the header, is a data block. You can skip over all the index blocks to the very first data block by following the instructions in the Reading a Block and using the block count as the block number. 144 | 145 | ###Searching the Index 146 | 147 | To find a URL in the index or find all the URLs that start with a common prefix, we'll call this a "target", start by reading block 0 also known as the **root block** 148 | 149 | 1. Read through each prefix found in the index block until you find the first prefix that is lexicographically greater than or equal to the given target. 150 | 151 | 2. The next block number you need to read is the number that was found immediately to the left of the prefix that was greater than or equal to your target. If your target is bigger than every prefix in the block use the last block number in the block. 152 | 153 | 3. Repeat steps 1 through 3 until the block number is greater than the block count that is stored in the header. You've now found the first data block that could possibly contain your target. 154 | 155 | 156 | Once you've found the first data block. 157 | 158 | 1. Retrieve the given block number 159 | 2. Read all characters up to the first <null> byte 160 | 3. Read the 32 byte location pointer 161 | 4. Repeat steps 1-3; as long as the URL is greater than the target or you reach the end of the block. If you reach the end of the block without finding the target, it is not in the index. 162 | 5. Now read each item out of the data block, until each item url no longer starts with the target string. 163 | 164 | 165 | ###Retrieving a page 166 | 167 | The location pointer represents 5 numbers: 168 | 169 | * segment id 170 | * file date 171 | * partition 172 | * file offset 173 | * compressed sized 174 | 175 | Using the first 3 numbers you can construct the URL of the arc file that contains the page you are interested in. 176 | 177 | ``` 178 | s3://aws-publicdatasets/common-crawl/parse-output/segment/[segment id]/[file date]_[partition].arc.gz 179 | ``` 180 | 181 | The file offset and compressed size can be used to fetch the compressed chunk from the arc file, without 182 | downloading the entire arc file. 183 | 184 | Here is an example using the boto library in python to retreive and uncompress the chunk. 185 | 186 | 187 | 188 | ```python 189 | def arc_file(s3, bucket, info): 190 | 191 | bucket = s3.lookup(bucket) 192 | keyname = "/common-crawl/parse-output/segment/{arcSourceSegmentId}/{arcFileDate}_{arcFileParition}. 193 | arc.gz".format(**info) 194 | key = bucket.lookup(keyname) 195 | 196 | start = info['arcFileOffset'] 197 | end = start + info['compressedSize'] - 1 198 | 199 | headers={'Range' : 'bytes={}-{}'.format(start, end)} 200 | 201 | chunk = StringIO( 202 | key.get_contents_as_string(headers=headers) 203 | ) 204 | 205 | return GzipFile(fileobj=chunk).read() 206 | ``` 207 | 208 | ### Using the remote_copy utility script 209 | 210 | bin/remote_copy is a utility script which takes a domain or comma-separated list of domains, 211 | uses the index to find the webpages in the crawl belonging to the given domains, 212 | downloads them (to get just the bytes corresponding to the webpages from the 100 MB segment files), 213 | and reuploads them to a specified S3 location. It uses python multiprocessing to make several 214 | requests in parallel. As this still takes a lot of bandwidth, it is recommended that you run it 215 | on an m1.xlarge EC2 instance for a fast connection to S3. 216 | 217 | The "check" command will give stats such as number of webpages and total file size for a list of domains. 218 | The "copy" command will download the webpages from aws-publicdatasets and reupload to a specified S3 location. 219 | 220 | Example usage: 221 | 222 | chmod +x bin/remote_copy 223 | export AWS_ACCESS_KEY= 224 | export AWS_SECRET_KEY= 225 | bin/remote_copy check "com.nytimes.blogs.fivethirtyeight, com.nytimes.blogs.thecaucus" 226 | bin/remote_copy copy "com.nytimes.blogs.fivethirtyeight, com.nytimes.blogs.thecaucus" --bucket your-output-bucket --key common_crawl/blogs_crawl --parallel 4 227 | 228 | -------------------------------------------------------------------------------- /bin/index_lookup_local: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright [2012] [Triv.io, Scott Robertson] 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | 18 | import sys 19 | import mmap 20 | 21 | from os.path import join, dirname 22 | sys.path.append(join(dirname(__file__), '..')) 23 | 24 | from lib.pbtree import PBTreeDictReader 25 | 26 | 27 | def main(): 28 | 29 | stream = open(sys.argv[1], 'r+') 30 | stream = mmap.mmap(stream.fileno(),0) 31 | reader = PBTreeDictReader( 32 | stream, 33 | value_format="', 66 | # '', 67 | ) 68 | 69 | reader = PBTreeDictReader( 70 | mmap, 71 | value_format=" 2 | 3 | 4 | 5 | ActiveLayerIndex 6 | 0 7 | ApplicationVersion 8 | 9 | com.omnigroup.OmniGrafflePro 10 | 139.16.0.171715 11 | 12 | AutoAdjust 13 | 14 | BackgroundGraphic 15 | 16 | Bounds 17 | {{0, 0}, {576, 1466}} 18 | Class 19 | SolidGraphic 20 | FontInfo 21 | 22 | Font 23 | LucidaGrande-Bold 24 | Size 25 | 18 26 | 27 | ID 28 | 2 29 | Style 30 | 31 | shadow 32 | 33 | Draws 34 | NO 35 | 36 | stroke 37 | 38 | Draws 39 | NO 40 | 41 | 42 | 43 | BaseZoom 44 | 0 45 | CanvasOrigin 46 | {0, 0} 47 | ColumnAlign 48 | 1 49 | ColumnSpacing 50 | 36 51 | CreationDate 52 | 2012-11-13 17:07:40 +0000 53 | Creator 54 | Scott Robertson 55 | DisplayScale 56 | 1 0/72 in = 1 0/72 in 57 | GraphDocumentVersion 58 | 8 59 | GraphicsList 60 | 61 | 62 | Class 63 | LineGraphic 64 | Head 65 | 66 | ID 67 | 37901 68 | 69 | ID 70 | 37902 71 | Points 72 | 73 | {325.33805135644178, 900.31115892819366} 74 | {379.89529297952606, 959.76688858859177} 75 | 76 | Style 77 | 78 | stroke 79 | 80 | HeadArrow 81 | FilledArrow 82 | HeadScale 83 | 0.92857086658477783 84 | Legacy 85 | 86 | Pattern 87 | 1 88 | TailArrow 89 | 0 90 | TailScale 91 | 0.5 92 | 93 | 94 | Tail 95 | 96 | ID 97 | 37853 98 | 99 | 100 | 101 | Class 102 | LineGraphic 103 | Head 104 | 105 | ID 106 | 37887 107 | 108 | ID 109 | 37899 110 | Points 111 | 112 | {204.32895315715842, 906.49991653860252} 113 | {205.21530664678642, 955.00010110840731} 114 | 115 | Style 116 | 117 | stroke 118 | 119 | HeadArrow 120 | FilledArrow 121 | HeadScale 122 | 0.92857086658477783 123 | Legacy 124 | 125 | Pattern 126 | 1 127 | TailArrow 128 | 0 129 | TailScale 130 | 0.5 131 | 132 | 133 | Tail 134 | 135 | ID 136 | 37847 137 | 138 | 139 | 140 | Class 141 | LineGraphic 142 | Head 143 | 144 | ID 145 | 37888 146 | 147 | ID 148 | 37898 149 | Points 150 | 151 | {90.672534737826467, 893.22399837943101} 152 | {36.266666666666609, 956} 153 | 154 | Style 155 | 156 | stroke 157 | 158 | HeadArrow 159 | FilledArrow 160 | HeadScale 161 | 0.92857086658477783 162 | Legacy 163 | 164 | Pattern 165 | 1 166 | TailArrow 167 | 0 168 | TailScale 169 | 0.5 170 | 171 | 172 | Tail 173 | 174 | ID 175 | 37845 176 | 177 | 178 | 179 | Class 180 | LineGraphic 181 | Head 182 | 183 | ID 184 | 37853 185 | 186 | ID 187 | 37897 188 | Points 189 | 190 | {296.17720420364861, 821.48222527485575} 191 | {309.61307252812549, 870.51749984194748} 192 | 193 | Style 194 | 195 | stroke 196 | 197 | HeadArrow 198 | FilledArrow 199 | HeadScale 200 | 0.92857086658477783 201 | Legacy 202 | 203 | Pattern 204 | 1 205 | TailArrow 206 | 0 207 | TailScale 208 | 0.5 209 | 210 | 211 | Tail 212 | 213 | ID 214 | 37825 215 | 216 | 217 | 218 | Class 219 | LineGraphic 220 | Head 221 | 222 | ID 223 | 37845 224 | 225 | ID 226 | 37896 227 | Points 228 | 229 | {177.87504060493688, 813.09318665694229} 230 | {112.37780562847492, 870.87061623778675} 231 | 232 | Style 233 | 234 | stroke 235 | 236 | HeadArrow 237 | FilledArrow 238 | HeadScale 239 | 0.92857086658477783 240 | Legacy 241 | 242 | Pattern 243 | 1 244 | TailArrow 245 | 0 246 | TailScale 247 | 0.5 248 | 249 | 250 | Tail 251 | 252 | ID 253 | 37823 254 | 255 | 256 | 257 | Bounds 258 | {{380.5714111328125, 956}, {185.4285888671875, 30}} 259 | Class 260 | ShapedGraphic 261 | ID 262 | 37901 263 | Magnets 264 | 265 | {-0.45300457989336662, -0.033333333333331439} 266 | 267 | Shape 268 | Rectangle 269 | Style 270 | 271 | fill 272 | 273 | Draws 274 | NO 275 | 276 | shadow 277 | 278 | Draws 279 | NO 280 | 281 | stroke 282 | 283 | Width 284 | 2 285 | 286 | 287 | 288 | 289 | Bounds 290 | {{195.71429443359375, 956}, {185.4285888671875, 30}} 291 | Class 292 | ShapedGraphic 293 | ID 294 | 37887 295 | Magnets 296 | 297 | {-0.44761166822796894, -0.033333333333331439} 298 | 299 | Shape 300 | Rectangle 301 | Style 302 | 303 | fill 304 | 305 | Draws 306 | NO 307 | 308 | shadow 309 | 310 | Draws 311 | NO 312 | 313 | stroke 314 | 315 | Width 316 | 2 317 | 318 | 319 | 320 | 321 | Bounds 322 | {{10.28570556640625, 956}, {185.4285888671875, 30}} 323 | Class 324 | ShapedGraphic 325 | ID 326 | 37888 327 | Magnets 328 | 329 | {-0.42064710990098297, -0.066666666666669983} 330 | 331 | Shape 332 | Rectangle 333 | Style 334 | 335 | fill 336 | 337 | Draws 338 | NO 339 | 340 | shadow 341 | 342 | Draws 343 | NO 344 | 345 | stroke 346 | 347 | Width 348 | 2 349 | 350 | 351 | 352 | 353 | Class 354 | TableGroup 355 | Graphics 356 | 357 | 358 | Bounds 359 | {{10.28570556640625, 956.00003051757812}, {92.571426391601562, 30}} 360 | Class 361 | ShapedGraphic 362 | ID 363 | 37890 364 | Shape 365 | Rectangle 366 | Style 367 | 368 | shadow 369 | 370 | Draws 371 | NO 372 | 373 | stroke 374 | 375 | Pattern 376 | 2 377 | 378 | 379 | Text 380 | 381 | Text 382 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 383 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 384 | {\colortbl;\red255\green255\blue255;} 385 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 386 | 387 | \f0\b\fs24 \cf0 item 1} 388 | 389 | 390 | 391 | Bounds 392 | {{102.85713958740234, 956.00003051757812}, {92.571426391601562, 30}} 393 | Class 394 | ShapedGraphic 395 | ID 396 | 37891 397 | Shape 398 | Rectangle 399 | Style 400 | 401 | shadow 402 | 403 | Draws 404 | NO 405 | 406 | stroke 407 | 408 | Pattern 409 | 2 410 | 411 | 412 | Text 413 | 414 | Text 415 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 416 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 417 | {\colortbl;\red255\green255\blue255;} 418 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 419 | 420 | \f0\b\fs24 \cf0 item 2} 421 | 422 | 423 | 424 | Bounds 425 | {{195.42855834960938, 956.00003051757812}, {92.571441650390625, 30}} 426 | Class 427 | ShapedGraphic 428 | ID 429 | 37892 430 | Shape 431 | Rectangle 432 | Style 433 | 434 | shadow 435 | 436 | Draws 437 | NO 438 | 439 | stroke 440 | 441 | Pattern 442 | 2 443 | 444 | 445 | Text 446 | 447 | Text 448 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 449 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 450 | {\colortbl;\red255\green255\blue255;} 451 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 452 | 453 | \f0\b\fs24 \cf0 item 3} 454 | 455 | 456 | 457 | Bounds 458 | {{287.99998474121094, 956.00003051757812}, {92.571441650390625, 30}} 459 | Class 460 | ShapedGraphic 461 | ID 462 | 37893 463 | Shape 464 | Rectangle 465 | Style 466 | 467 | shadow 468 | 469 | Draws 470 | NO 471 | 472 | stroke 473 | 474 | Pattern 475 | 2 476 | 477 | 478 | Text 479 | 480 | Text 481 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 482 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 483 | {\colortbl;\red255\green255\blue255;} 484 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 485 | 486 | \f0\b\fs24 \cf0 item 4} 487 | 488 | 489 | 490 | Bounds 491 | {{380.57145690917969, 956.00003051757812}, {92.571441650390625, 30}} 492 | Class 493 | ShapedGraphic 494 | ID 495 | 37894 496 | Shape 497 | Rectangle 498 | Style 499 | 500 | shadow 501 | 502 | Draws 503 | NO 504 | 505 | stroke 506 | 507 | Pattern 508 | 2 509 | 510 | 511 | Text 512 | 513 | Text 514 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 515 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 516 | {\colortbl;\red255\green255\blue255;} 517 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 518 | 519 | \f0\b\fs24 \cf0 item 5} 520 | 521 | 522 | 523 | Bounds 524 | {{473.14288330078125, 956.00003051757812}, {92.5714111328125, 30}} 525 | Class 526 | ShapedGraphic 527 | ID 528 | 37895 529 | Shape 530 | Rectangle 531 | Style 532 | 533 | shadow 534 | 535 | Draws 536 | NO 537 | 538 | stroke 539 | 540 | Pattern 541 | 2 542 | 543 | 544 | Text 545 | 546 | Text 547 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 548 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 549 | {\colortbl;\red255\green255\blue255;} 550 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 551 | 552 | \f0\b\fs24 \cf0 item 6} 553 | 554 | 555 | 556 | GridV 557 | 558 | 37890 559 | 37891 560 | 37892 561 | 37893 562 | 37894 563 | 37895 564 | 565 | 566 | ID 567 | 37889 568 | 569 | 570 | Class 571 | TableGroup 572 | Graphics 573 | 574 | 575 | Bounds 576 | {{325, 871}, {79, 35}} 577 | Class 578 | ShapedGraphic 579 | ID 580 | 37852 581 | Shape 582 | Rectangle 583 | Text 584 | 585 | Text 586 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 587 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 588 | {\colortbl;\red255\green255\blue255;} 589 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 590 | 591 | \f0\b\fs24 \cf0 prefix N} 592 | 593 | 594 | 595 | Bounds 596 | {{304, 871}, {21, 35}} 597 | Class 598 | ShapedGraphic 599 | ID 600 | 37853 601 | Shape 602 | Rectangle 603 | Text 604 | 605 | Text 606 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 607 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 608 | {\colortbl;\red255\green255\blue255;} 609 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 610 | 611 | \f0\b\fs24 \cf0 5} 612 | 613 | 614 | 615 | Bounds 616 | {{404, 871}, {26, 35}} 617 | Class 618 | ShapedGraphic 619 | ID 620 | 37855 621 | Shape 622 | Rectangle 623 | Text 624 | 625 | Text 626 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 627 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 628 | {\colortbl;\red255\green255\blue255;} 629 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 630 | 631 | \f0\b\fs24 \cf0 6} 632 | 633 | 634 | 635 | Bounds 636 | {{430, 871}, {26, 35}} 637 | Class 638 | ShapedGraphic 639 | ID 640 | 37856 641 | Shape 642 | Rectangle 643 | Text 644 | 645 | Text 646 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 647 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 648 | {\colortbl;\red255\green255\blue255;} 649 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 650 | 651 | \f0\b\fs24 \cf0 ...} 652 | 653 | 654 | 655 | GridV 656 | 657 | 37853 658 | 37852 659 | 37855 660 | 37856 661 | 662 | 663 | ID 664 | 37851 665 | 666 | 667 | Bounds 668 | {{172, 658.5}, {144, 30}} 669 | Class 670 | ShapedGraphic 671 | FitText 672 | YES 673 | Flow 674 | Resize 675 | FontInfo 676 | 677 | Color 678 | 679 | w 680 | 0 681 | 682 | Font 683 | LucidaGrande-Bold 684 | Size 685 | 12 686 | 687 | ID 688 | 37850 689 | Shape 690 | Rectangle 691 | Style 692 | 693 | fill 694 | 695 | Draws 696 | NO 697 | 698 | shadow 699 | 700 | Draws 701 | NO 702 | 703 | stroke 704 | 705 | Draws 706 | NO 707 | 708 | 709 | Text 710 | 711 | Pad 712 | 0 713 | Text 714 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 715 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 716 | {\colortbl;\red255\green255\blue255;} 717 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 718 | 719 | \f0\b\fs24 \cf0 item \ 720 | ( 721 | \b0 url and arc file location 722 | \b )} 723 | VerticalPad 724 | 0 725 | 726 | Wrap 727 | NO 728 | 729 | 730 | Bounds 731 | {{37, 609.05108642578125}, {428, 41.948900000000002}} 732 | Class 733 | ShapedGraphic 734 | ID 735 | 37849 736 | ImageID 737 | 1 738 | Shape 739 | Rectangle 740 | Style 741 | 742 | fill 743 | 744 | Draws 745 | NO 746 | 747 | shadow 748 | 749 | Draws 750 | NO 751 | 752 | stroke 753 | 754 | Draws 755 | NO 756 | 757 | 758 | 759 | 760 | Class 761 | TableGroup 762 | Graphics 763 | 764 | 765 | Bounds 766 | {{112, 871}, {79, 35}} 767 | Class 768 | ShapedGraphic 769 | ID 770 | 37844 771 | Shape 772 | Rectangle 773 | Text 774 | 775 | Text 776 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 777 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 778 | {\colortbl;\red255\green255\blue255;} 779 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 780 | 781 | \f0\b\fs24 \cf0 prefix 1} 782 | 783 | 784 | 785 | Bounds 786 | {{91, 871}, {21, 35}} 787 | Class 788 | ShapedGraphic 789 | ID 790 | 37845 791 | Magnets 792 | 793 | {0.071428571428571175, -0.27142857142857224} 794 | 795 | Shape 796 | Rectangle 797 | Text 798 | 799 | Text 800 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 801 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 802 | {\colortbl;\red255\green255\blue255;} 803 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 804 | 805 | \f0\b\fs24 \cf0 3} 806 | 807 | 808 | 809 | Bounds 810 | {{191, 871}, {26, 35}} 811 | Class 812 | ShapedGraphic 813 | ID 814 | 37847 815 | Shape 816 | Rectangle 817 | Text 818 | 819 | Text 820 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 821 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 822 | {\colortbl;\red255\green255\blue255;} 823 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 824 | 825 | \f0\b\fs24 \cf0 4} 826 | 827 | 828 | 829 | Bounds 830 | {{217, 871}, {37.5, 35}} 831 | Class 832 | ShapedGraphic 833 | ID 834 | 37848 835 | Shape 836 | Rectangle 837 | Text 838 | 839 | Text 840 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 841 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 842 | {\colortbl;\red255\green255\blue255;} 843 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 844 | 845 | \f0\b\fs24 \cf0 ...} 846 | 847 | 848 | 849 | GridV 850 | 851 | 37845 852 | 37844 853 | 37847 854 | 37848 855 | 856 | 857 | ID 858 | 37843 859 | 860 | 861 | Class 862 | TableGroup 863 | Graphics 864 | 865 | 866 | Bounds 867 | {{199.25, 786}, {79, 35}} 868 | Class 869 | ShapedGraphic 870 | ID 871 | 37822 872 | Shape 873 | Rectangle 874 | Text 875 | 876 | Text 877 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 878 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 879 | {\colortbl;\red255\green255\blue255;} 880 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 881 | 882 | \f0\b\fs24 \cf0 prefix 1} 883 | 884 | 885 | 886 | Bounds 887 | {{178.25, 786}, {21, 35}} 888 | Class 889 | ShapedGraphic 890 | ID 891 | 37823 892 | Shape 893 | Rectangle 894 | Text 895 | 896 | Text 897 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 898 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 899 | {\colortbl;\red255\green255\blue255;} 900 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 901 | 902 | \f0\b\fs24 \cf0 1} 903 | 904 | 905 | 906 | Bounds 907 | {{278.25, 786}, {26, 35}} 908 | Class 909 | ShapedGraphic 910 | ID 911 | 37825 912 | Shape 913 | Rectangle 914 | Text 915 | 916 | Text 917 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 918 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 919 | {\colortbl;\red255\green255\blue255;} 920 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 921 | 922 | \f0\b\fs24 \cf0 2} 923 | 924 | 925 | 926 | Bounds 927 | {{304.25, 786}, {64.5, 35}} 928 | Class 929 | ShapedGraphic 930 | ID 931 | 37826 932 | Shape 933 | Rectangle 934 | Text 935 | 936 | Text 937 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 938 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 939 | {\colortbl;\red255\green255\blue255;} 940 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 941 | 942 | \f0\b\fs24 \cf0 ...} 943 | 944 | 945 | 946 | GridV 947 | 948 | 37823 949 | 37822 950 | 37825 951 | 37826 952 | 953 | 954 | ID 955 | 37821 956 | 957 | 958 | Class 959 | TableGroup 960 | Graphics 961 | 962 | 963 | Bounds 964 | {{471, 566.99993896484375}, {84, 35}} 965 | Class 966 | ShapedGraphic 967 | ID 968 | 37818 969 | Shape 970 | Rectangle 971 | Text 972 | 973 | Text 974 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 975 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 976 | {\colortbl;\red255\green255\blue255;} 977 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 978 | 979 | \f0\b\fs24 \cf0 url 2} 980 | 981 | 982 | 983 | Bounds 984 | {{30, 566.99993896484375}, {108, 35}} 985 | Class 986 | ShapedGraphic 987 | ID 988 | 37809 989 | Shape 990 | Rectangle 991 | Text 992 | 993 | Text 994 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 995 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 996 | {\colortbl;\red255\green255\blue255;} 997 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 998 | 999 | \f0\b\fs24 \cf0 url 1} 1000 | 1001 | 1002 | 1003 | Bounds 1004 | {{243, 566.99993896484375}, {76, 35}} 1005 | Class 1006 | ShapedGraphic 1007 | ID 1008 | 37813 1009 | Shape 1010 | Rectangle 1011 | Text 1012 | 1013 | Text 1014 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1015 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1016 | {\colortbl;\red255\green255\blue255;} 1017 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1018 | 1019 | \f0\b\fs24 \cf0 date} 1020 | 1021 | 1022 | 1023 | Bounds 1024 | {{167, 566.99993896484375}, {76, 35}} 1025 | Class 1026 | ShapedGraphic 1027 | ID 1028 | 37812 1029 | Shape 1030 | Rectangle 1031 | Text 1032 | 1033 | Text 1034 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1035 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1036 | {\colortbl;\red255\green255\blue255;} 1037 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1038 | 1039 | \f0\b\fs24 \cf0 segment} 1040 | 1041 | 1042 | 1043 | Bounds 1044 | {{138, 566.99993896484375}, {29, 35}} 1045 | Class 1046 | ShapedGraphic 1047 | ID 1048 | 37811 1049 | Shape 1050 | Rectangle 1051 | Text 1052 | 1053 | Text 1054 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1055 | {\fonttbl\f0\fnil\fcharset222 Ayuthaya;} 1056 | {\colortbl;\red255\green255\blue255;} 1057 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1058 | 1059 | \f0\fs28 \cf0 0} 1060 | 1061 | 1062 | 1063 | Bounds 1064 | {{319, 566.99993896484375}, {76, 35}} 1065 | Class 1066 | ShapedGraphic 1067 | ID 1068 | 37815 1069 | Shape 1070 | Rectangle 1071 | Text 1072 | 1073 | Text 1074 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1075 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1076 | {\colortbl;\red255\green255\blue255;} 1077 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1078 | 1079 | \f0\b\fs24 \cf0 partition} 1080 | 1081 | 1082 | 1083 | Bounds 1084 | {{395, 566.99993896484375}, {76, 35}} 1085 | Class 1086 | ShapedGraphic 1087 | ID 1088 | 37816 1089 | Shape 1090 | Rectangle 1091 | Text 1092 | 1093 | Text 1094 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1095 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1096 | {\colortbl;\red255\green255\blue255;} 1097 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1098 | 1099 | \f0\b\fs24 \cf0 file offset} 1100 | 1101 | 1102 | 1103 | GridV 1104 | 1105 | 37809 1106 | 37811 1107 | 37812 1108 | 37813 1109 | 37815 1110 | 37816 1111 | 37818 1112 | 1113 | 1114 | ID 1115 | 37808 1116 | 1117 | 1118 | Bounds 1119 | {{25, 561}, {543, 68}} 1120 | Class 1121 | ShapedGraphic 1122 | ID 1123 | 37807 1124 | Shape 1125 | Rectangle 1126 | Style 1127 | 1128 | stroke 1129 | 1130 | Pattern 1131 | 4 1132 | 1133 | 1134 | Text 1135 | 1136 | Align 1137 | 2 1138 | Text 1139 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1140 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1141 | {\colortbl;\red255\green255\blue255;} 1142 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qr 1143 | 1144 | \f0\b\fs24 \cf0 data block} 1145 | 1146 | TextPlacement 1147 | 2 1148 | 1149 | 1150 | Bounds 1151 | {{456, 470.57662963867188}, {14, 22}} 1152 | Class 1153 | ShapedGraphic 1154 | FitText 1155 | YES 1156 | Flow 1157 | Resize 1158 | FontInfo 1159 | 1160 | Color 1161 | 1162 | w 1163 | 0 1164 | 1165 | Font 1166 | LucidaGrande-Bold 1167 | Size 1168 | 18 1169 | 1170 | ID 1171 | 37806 1172 | Shape 1173 | Rectangle 1174 | Style 1175 | 1176 | fill 1177 | 1178 | Draws 1179 | NO 1180 | 1181 | shadow 1182 | 1183 | Draws 1184 | NO 1185 | 1186 | stroke 1187 | 1188 | Draws 1189 | NO 1190 | 1191 | 1192 | Text 1193 | 1194 | Pad 1195 | 0 1196 | Text 1197 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1198 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1199 | {\colortbl;\red255\green255\blue255;} 1200 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1201 | 1202 | \f0\b\fs36 \cf0 ...} 1203 | VerticalPad 1204 | 0 1205 | 1206 | Wrap 1207 | NO 1208 | 1209 | 1210 | Class 1211 | TableGroup 1212 | Graphics 1213 | 1214 | 1215 | Bounds 1216 | {{51, 457.57662963867188}, {79, 35}} 1217 | Class 1218 | ShapedGraphic 1219 | ID 1220 | 37795 1221 | Shape 1222 | Rectangle 1223 | Text 1224 | 1225 | Text 1226 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1227 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1228 | {\colortbl;\red255\green255\blue255;} 1229 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1230 | 1231 | \f0\b\fs24 \cf0 prefix 1} 1232 | 1233 | 1234 | 1235 | Bounds 1236 | {{30, 457.57662963867188}, {21, 35}} 1237 | Class 1238 | ShapedGraphic 1239 | ID 1240 | 37796 1241 | Shape 1242 | Rectangle 1243 | Text 1244 | 1245 | Text 1246 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1247 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1248 | {\colortbl;\red255\green255\blue255;} 1249 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1250 | 1251 | \f0\b\fs24 \cf0 1} 1252 | 1253 | 1254 | 1255 | Bounds 1256 | {{130, 457.57662963867188}, {29, 35}} 1257 | Class 1258 | ShapedGraphic 1259 | ID 1260 | 37797 1261 | Shape 1262 | Rectangle 1263 | Text 1264 | 1265 | Text 1266 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1267 | {\fonttbl\f0\fnil\fcharset222 Ayuthaya;} 1268 | {\colortbl;\red255\green255\blue255;} 1269 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1270 | 1271 | \f0\fs28 \cf0 0} 1272 | 1273 | 1274 | 1275 | Bounds 1276 | {{159, 457.57662963867188}, {26, 35}} 1277 | Class 1278 | ShapedGraphic 1279 | ID 1280 | 37799 1281 | Shape 1282 | Rectangle 1283 | Text 1284 | 1285 | Text 1286 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1287 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1288 | {\colortbl;\red255\green255\blue255;} 1289 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1290 | 1291 | \f0\b\fs24 \cf0 2} 1292 | 1293 | 1294 | 1295 | Bounds 1296 | {{185, 457.57662963867188}, {64.5, 35}} 1297 | Class 1298 | ShapedGraphic 1299 | ID 1300 | 37800 1301 | Shape 1302 | Rectangle 1303 | Text 1304 | 1305 | Text 1306 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1307 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1308 | {\colortbl;\red255\green255\blue255;} 1309 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1310 | 1311 | \f0\b\fs24 \cf0 prefix 2} 1312 | 1313 | 1314 | 1315 | Bounds 1316 | {{249.5, 457.57662963867188}, {29, 35}} 1317 | Class 1318 | ShapedGraphic 1319 | ID 1320 | 37801 1321 | Shape 1322 | Rectangle 1323 | Text 1324 | 1325 | Text 1326 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1327 | {\fonttbl\f0\fnil\fcharset222 Ayuthaya;} 1328 | {\colortbl;\red255\green255\blue255;} 1329 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1330 | 1331 | \f0\fs28 \cf0 0} 1332 | 1333 | 1334 | 1335 | Bounds 1336 | {{278.5, 457.57662963867188}, {29, 35}} 1337 | Class 1338 | ShapedGraphic 1339 | ID 1340 | 37802 1341 | Shape 1342 | Rectangle 1343 | Text 1344 | 1345 | Text 1346 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1347 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1348 | {\colortbl;\red255\green255\blue255;} 1349 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1350 | 1351 | \f0\b\fs24 \cf0 3} 1352 | 1353 | 1354 | 1355 | Bounds 1356 | {{307.5, 457.57662963867188}, {72, 35}} 1357 | Class 1358 | ShapedGraphic 1359 | ID 1360 | 37803 1361 | Shape 1362 | Rectangle 1363 | Text 1364 | 1365 | Text 1366 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1367 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1368 | {\colortbl;\red255\green255\blue255;} 1369 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1370 | 1371 | \f0\b\fs24 \cf0 prefix 3} 1372 | 1373 | 1374 | 1375 | Bounds 1376 | {{379.5, 457.57662963867188}, {29, 35}} 1377 | Class 1378 | ShapedGraphic 1379 | ID 1380 | 37804 1381 | Shape 1382 | Rectangle 1383 | Text 1384 | 1385 | Text 1386 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1387 | {\fonttbl\f0\fnil\fcharset222 Ayuthaya;} 1388 | {\colortbl;\red255\green255\blue255;} 1389 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1390 | 1391 | \f0\fs28 \cf0 0} 1392 | 1393 | 1394 | 1395 | Bounds 1396 | {{408.5, 457.57662963867188}, {29, 35}} 1397 | Class 1398 | ShapedGraphic 1399 | ID 1400 | 37805 1401 | Shape 1402 | Rectangle 1403 | Text 1404 | 1405 | Text 1406 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1407 | {\fonttbl\f0\fnil\fcharset222 Ayuthaya;} 1408 | {\colortbl;\red255\green255\blue255;} 1409 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1410 | 1411 | \f0\fs28 \cf0 4} 1412 | 1413 | 1414 | 1415 | GridV 1416 | 1417 | 37796 1418 | 37795 1419 | 37797 1420 | 37799 1421 | 37800 1422 | 37801 1423 | 37802 1424 | 37803 1425 | 37804 1426 | 37805 1427 | 1428 | 1429 | ID 1430 | 37794 1431 | 1432 | 1433 | Bounds 1434 | {{19, 447}, {543, 58}} 1435 | Class 1436 | ShapedGraphic 1437 | ID 1438 | 37798 1439 | Shape 1440 | Rectangle 1441 | Style 1442 | 1443 | stroke 1444 | 1445 | Pattern 1446 | 4 1447 | 1448 | 1449 | Text 1450 | 1451 | Align 1452 | 2 1453 | Text 1454 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1455 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1456 | {\colortbl;\red255\green255\blue255;} 1457 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qr 1458 | 1459 | \f0\b\fs24 \cf0 index block} 1460 | 1461 | TextPlacement 1462 | 2 1463 | 1464 | 1465 | Bounds 1466 | {{339, 215.94891357421875}, {72, 15}} 1467 | Class 1468 | ShapedGraphic 1469 | FitText 1470 | YES 1471 | Flow 1472 | Resize 1473 | FontInfo 1474 | 1475 | Color 1476 | 1477 | w 1478 | 0 1479 | 1480 | Font 1481 | LucidaGrande-Bold 1482 | Size 1483 | 12 1484 | 1485 | ID 1486 | 37791 1487 | Shape 1488 | Rectangle 1489 | Style 1490 | 1491 | fill 1492 | 1493 | Draws 1494 | NO 1495 | 1496 | shadow 1497 | 1498 | Draws 1499 | NO 1500 | 1501 | stroke 1502 | 1503 | Draws 1504 | NO 1505 | 1506 | 1507 | Text 1508 | 1509 | Align 1510 | 2 1511 | Pad 1512 | 0 1513 | Text 1514 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1515 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1516 | {\colortbl;\red255\green255\blue255;} 1517 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qr 1518 | 1519 | \f0\b\fs24 \cf0 data blocks} 1520 | VerticalPad 1521 | 0 1522 | 1523 | Wrap 1524 | NO 1525 | 1526 | 1527 | Bounds 1528 | {{133.25, 76.423347473144531}, {79, 15}} 1529 | Class 1530 | ShapedGraphic 1531 | FitText 1532 | YES 1533 | Flow 1534 | Resize 1535 | FontInfo 1536 | 1537 | Color 1538 | 1539 | w 1540 | 0 1541 | 1542 | Font 1543 | LucidaGrande-Bold 1544 | Size 1545 | 12 1546 | 1547 | ID 1548 | 146 1549 | Shape 1550 | Rectangle 1551 | Style 1552 | 1553 | fill 1554 | 1555 | Draws 1556 | NO 1557 | 1558 | shadow 1559 | 1560 | Draws 1561 | NO 1562 | 1563 | stroke 1564 | 1565 | Draws 1566 | NO 1567 | 1568 | 1569 | Text 1570 | 1571 | Align 1572 | 2 1573 | Pad 1574 | 0 1575 | Text 1576 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1577 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1578 | {\colortbl;\red255\green255\blue255;} 1579 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qr 1580 | 1581 | \f0\b\fs24 \cf0 index blocks} 1582 | VerticalPad 1583 | 0 1584 | 1585 | Wrap 1586 | NO 1587 | 1588 | 1589 | Bounds 1590 | {{106.5, 99.948900000000009}, {132.5, 22}} 1591 | Class 1592 | ShapedGraphic 1593 | ID 1594 | 37789 1595 | ImageID 1596 | 1 1597 | Shape 1598 | Rectangle 1599 | Style 1600 | 1601 | fill 1602 | 1603 | Draws 1604 | NO 1605 | 1606 | shadow 1607 | 1608 | Draws 1609 | NO 1610 | 1611 | stroke 1612 | 1613 | Draws 1614 | NO 1615 | 1616 | 1617 | VFlip 1618 | YES 1619 | 1620 | 1621 | Class 1622 | TableGroup 1623 | Graphics 1624 | 1625 | 1626 | Bounds 1627 | {{30, 130.47445678710938}, {58, 35}} 1628 | Class 1629 | ShapedGraphic 1630 | ID 1631 | 37783 1632 | Shape 1633 | Rectangle 1634 | Text 1635 | 1636 | Text 1637 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1638 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1639 | {\colortbl;\red255\green255\blue255;} 1640 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1641 | 1642 | \f0\b\fs24 \cf0 header} 1643 | 1644 | 1645 | 1646 | Bounds 1647 | {{88, 130.47445678710938}, {79, 35}} 1648 | Class 1649 | ShapedGraphic 1650 | ID 1651 | 37784 1652 | Shape 1653 | Rectangle 1654 | Text 1655 | 1656 | Text 1657 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1658 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1659 | {\colortbl;\red255\green255\blue255;} 1660 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1661 | 1662 | \f0\b\fs24 \cf0 block 1} 1663 | 1664 | 1665 | 1666 | Bounds 1667 | {{167, 130.47445678710938}, {79, 35}} 1668 | Class 1669 | ShapedGraphic 1670 | ID 1671 | 37785 1672 | Shape 1673 | Rectangle 1674 | Text 1675 | 1676 | Text 1677 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1678 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1679 | {\colortbl;\red255\green255\blue255;} 1680 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1681 | 1682 | \f0\b\fs24 \cf0 block 2} 1683 | 1684 | 1685 | 1686 | Bounds 1687 | {{246, 130.47445678710938}, {79, 35}} 1688 | Class 1689 | ShapedGraphic 1690 | ID 1691 | 37786 1692 | Shape 1693 | Rectangle 1694 | Text 1695 | 1696 | Text 1697 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1698 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1699 | {\colortbl;\red255\green255\blue255;} 1700 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1701 | 1702 | \f0\b\fs24 \cf0 block 3} 1703 | 1704 | 1705 | 1706 | Bounds 1707 | {{325, 130.47445678710938}, {79, 35}} 1708 | Class 1709 | ShapedGraphic 1710 | ID 1711 | 37787 1712 | Shape 1713 | Rectangle 1714 | Text 1715 | 1716 | Text 1717 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1718 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1719 | {\colortbl;\red255\green255\blue255;} 1720 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1721 | 1722 | \f0\b\fs24 \cf0 ...} 1723 | 1724 | 1725 | 1726 | Bounds 1727 | {{404, 130.47445678710938}, {79, 35}} 1728 | Class 1729 | ShapedGraphic 1730 | ID 1731 | 37788 1732 | Shape 1733 | Rectangle 1734 | Text 1735 | 1736 | Text 1737 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1738 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1739 | {\colortbl;\red255\green255\blue255;} 1740 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1741 | 1742 | \f0\b\fs24 \cf0 block n} 1743 | 1744 | 1745 | 1746 | GridV 1747 | 1748 | 37783 1749 | 37784 1750 | 37785 1751 | 37786 1752 | 37787 1753 | 37788 1754 | 1755 | 1756 | ID 1757 | 37781 1758 | 1759 | 1760 | Bounds 1761 | {{262, 174.00001335144043}, {216, 41.948900000000002}} 1762 | Class 1763 | ShapedGraphic 1764 | ID 1765 | 37780 1766 | ImageID 1767 | 1 1768 | Shape 1769 | Rectangle 1770 | Style 1771 | 1772 | fill 1773 | 1774 | Draws 1775 | NO 1776 | 1777 | shadow 1778 | 1779 | Draws 1780 | NO 1781 | 1782 | stroke 1783 | 1784 | Draws 1785 | NO 1786 | 1787 | 1788 | 1789 | 1790 | Class 1791 | TableGroup 1792 | Graphics 1793 | 1794 | 1795 | Bounds 1796 | {{31.5, 314}, {109, 35}} 1797 | Class 1798 | ShapedGraphic 1799 | ID 1800 | 211 1801 | Shape 1802 | Rectangle 1803 | Text 1804 | 1805 | Text 1806 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1807 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1808 | {\colortbl;\red255\green255\blue255;} 1809 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1810 | 1811 | \f0\b\fs24 \cf0 block size} 1812 | 1813 | 1814 | 1815 | Bounds 1816 | {{140.5, 314}, {109, 35}} 1817 | Class 1818 | ShapedGraphic 1819 | ID 1820 | 222 1821 | Shape 1822 | Rectangle 1823 | Text 1824 | 1825 | Text 1826 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1827 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1828 | {\colortbl;\red255\green255\blue255;} 1829 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qc 1830 | 1831 | \f0\b\fs24 \cf0 block count} 1832 | 1833 | 1834 | 1835 | GridV 1836 | 1837 | 211 1838 | 222 1839 | 1840 | 1841 | ID 1842 | 209 1843 | 1844 | 1845 | Bounds 1846 | {{25, 309}, {231, 68}} 1847 | Class 1848 | ShapedGraphic 1849 | ID 1850 | 37790 1851 | Shape 1852 | Rectangle 1853 | Style 1854 | 1855 | stroke 1856 | 1857 | Pattern 1858 | 4 1859 | 1860 | 1861 | Text 1862 | 1863 | Align 1864 | 2 1865 | Text 1866 | {\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf510 1867 | {\fonttbl\f0\fnil\fcharset0 LucidaGrande;} 1868 | {\colortbl;\red255\green255\blue255;} 1869 | \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\qr 1870 | 1871 | \f0\b\fs24 \cf0 header} 1872 | 1873 | TextPlacement 1874 | 2 1875 | 1876 | 1877 | GridInfo 1878 | 1879 | GuidesLocked 1880 | NO 1881 | GuidesVisible 1882 | YES 1883 | HPages 1884 | 1 1885 | ImageCounter 1886 | 2 1887 | ImageLinkBack 1888 | 1889 | 1890 | ApplicationURL 1891 | http://www.omnigroup.com/applications/OmniGraffle 1892 | appData 1893 | 1894 | Color 1895 | 1896 | w 1897 | 1 1898 | 1899 | DocumentSettings 1900 | 1901 | ApplicationVersion 1902 | 1903 | com.omnigroup.OmniGrafflePro 1904 | 137.11.0.108132 1905 | 1906 | FileName 1907 | Base Wireframe Kit.gstencil 1908 | GraphDocumentVersion 1909 | 6 1910 | ModelCount 1911 | 1 1912 | ModelIndex 1913 | 0 1914 | ModificationDate 1915 | 2009-03-06 10:18:30 -0500 1916 | Modifier 1917 | Michael Angeles 1918 | SheetTitle 1919 | Canvas 1 1920 | 1921 | GraphicsList 1922 | 1923 | 1924 | Class 1925 | LineGraphic 1926 | ControlPoints 1927 | 1928 | {0.331787, 0.0552979} 1929 | {-80, -1} 1930 | {31, 0} 1931 | {0, 0} 1932 | {0, 0} 1933 | {-30, 1} 1934 | {79, 0} 1935 | {-0.231934, 0.712708} 1936 | 1937 | ID 1938 | 37779 1939 | LayerIndex 1940 | 0 1941 | Points 1942 | 1943 | {301.717, 414.589} 1944 | {381.717, 435.589} 1945 | {421.717, 453.589} 1946 | {463.717, 434.589} 1947 | {542.717, 414.589} 1948 | 1949 | Style 1950 | 1951 | stroke 1952 | 1953 | Bezier 1954 | 1955 | Color 1956 | 1957 | b 1958 | 0.998718 1959 | g 1960 | 0.781033 1961 | r 1962 | 0.424323 1963 | 1964 | HeadArrow 1965 | 0 1966 | LineType 1967 | 1 1968 | TailArrow 1969 | 0 1970 | Width 1971 | 2 1972 | 1973 | 1974 | 1975 | 1976 | Layers 1977 | 1978 | 1979 | Lock 1980 | NO 1981 | Name 1982 | Layer 1 1983 | Print 1984 | YES 1985 | View 1986 | YES 1987 | 1988 | 1989 | ZoomLevel 1990 | 1 1991 | 1992 | bundleId 1993 | com.omnigroup.OmniGrafflePro 1994 | refresh 1995 | 0.0 1996 | serverAppName 1997 | OmniGraffle 1998 | serverName 1999 | OmniGraffle 2000 | version 2001 | A 2002 | 2003 | 2004 | ImageList 2005 | 2006 | image1.pdf 2007 | 2008 | KeepToScale 2009 | 2010 | Layers 2011 | 2012 | 2013 | Lock 2014 | NO 2015 | Name 2016 | Layer 1 2017 | Print 2018 | YES 2019 | View 2020 | YES 2021 | 2022 | 2023 | LayoutInfo 2024 | 2025 | Animate 2026 | NO 2027 | circoMinDist 2028 | 18 2029 | circoSeparation 2030 | 0.0 2031 | layoutEngine 2032 | dot 2033 | neatoSeparation 2034 | 0.0 2035 | twopiSeparation 2036 | 0.0 2037 | 2038 | LinksVisible 2039 | NO 2040 | MagnetsVisible 2041 | NO 2042 | MasterSheets 2043 | 2044 | ModificationDate 2045 | 2012-11-13 18:45:51 +0000 2046 | Modifier 2047 | Scott Robertson 2048 | NotesVisible 2049 | NO 2050 | Orientation 2051 | 2 2052 | OriginVisible 2053 | NO 2054 | PageBreaks 2055 | YES 2056 | PrintInfo 2057 | 2058 | NSBottomMargin 2059 | 2060 | float 2061 | 41 2062 | 2063 | NSHorizonalPagination 2064 | 2065 | coded 2066 | BAtzdHJlYW10eXBlZIHoA4QBQISEhAhOU051bWJlcgCEhAdOU1ZhbHVlAISECE5TT2JqZWN0AIWEASqEhAFxlwCG 2067 | 2068 | NSLeftMargin 2069 | 2070 | float 2071 | 18 2072 | 2073 | NSPaperSize 2074 | 2075 | size 2076 | {612, 792} 2077 | 2078 | NSPrintReverseOrientation 2079 | 2080 | int 2081 | 0 2082 | 2083 | NSRightMargin 2084 | 2085 | float 2086 | 18 2087 | 2088 | NSTopMargin 2089 | 2090 | float 2091 | 18 2092 | 2093 | 2094 | PrintOnePage 2095 | 2096 | ReadOnly 2097 | NO 2098 | RowAlign 2099 | 1 2100 | RowSpacing 2101 | 36 2102 | SheetTitle 2103 | Canvas 1 2104 | SmartAlignmentGuidesActive 2105 | YES 2106 | SmartDistanceGuidesActive 2107 | YES 2108 | UniqueID 2109 | 1 2110 | UseEntirePage 2111 | 2112 | VPages 2113 | 2 2114 | WindowInfo 2115 | 2116 | CurrentSheet 2117 | 0 2118 | ExpandedCanvases 2119 | 2120 | 2121 | name 2122 | Canvas 1 2123 | 2124 | 2125 | Frame 2126 | {{241, 0}, {919, 874}} 2127 | ListView 2128 | 2129 | OutlineWidth 2130 | 142 2131 | RightSidebar 2132 | 2133 | ShowRuler 2134 | 2135 | Sidebar 2136 | 2137 | SidebarWidth 2138 | 120 2139 | VisibleRegion 2140 | {{-104, 421}, {784, 719}} 2141 | Zoom 2142 | 1 2143 | ZoomValues 2144 | 2145 | 2146 | Canvas 1 2147 | 1 2148 | 1 2149 | 2150 | 2151 | 2152 | 2153 | 2154 | -------------------------------------------------------------------------------- /docs/file_format.graffle/image1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trivio/common_crawl_index/235ba4dd81474c741908e8a0cc29604cbcada545/docs/file_format.graffle/image1.pdf -------------------------------------------------------------------------------- /docs/file_overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trivio/common_crawl_index/235ba4dd81474c741908e8a0cc29604cbcada545/docs/file_overview.png -------------------------------------------------------------------------------- /docs/header.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trivio/common_crawl_index/235ba4dd81474c741908e8a0cc29604cbcada545/docs/header.png -------------------------------------------------------------------------------- /docs/index_block.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trivio/common_crawl_index/235ba4dd81474c741908e8a0cc29604cbcada545/docs/index_block.png -------------------------------------------------------------------------------- /docs/project_header.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trivio/common_crawl_index/235ba4dd81474c741908e8a0cc29604cbcada545/docs/project_header.png -------------------------------------------------------------------------------- /docs/tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trivio/common_crawl_index/235ba4dd81474c741908e8a0cc29604cbcada545/docs/tree.png -------------------------------------------------------------------------------- /lib/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright [2012] [Triv.io, Scott Robertson] 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # 15 | 16 | import urlparse 17 | 18 | def reversehost(url): 19 | # reverse netlocation http://www.example.com/foo -> com.example.www/foo:http 20 | url = urlparse.urlsplit(str(url)) 21 | 22 | netloc = url.netloc.split(':') 23 | host = netloc[0] 24 | if len(netloc) == 2: 25 | port = ':' + netloc[1] 26 | else: 27 | port = '' 28 | 29 | 30 | # reverse the host 31 | host = '.'.join(reversed(host.split('.'))) 32 | 33 | return ( 34 | host + 35 | url.path + 36 | (('?' + url.query) if url.query else '' ) + 37 | port + 38 | ':' + url.scheme 39 | ) 40 | 41 | -------------------------------------------------------------------------------- /lib/adaptor.py: -------------------------------------------------------------------------------- 1 | # Copyright [2012] [Triv.io, Scott Robertson] 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # 15 | 16 | import re 17 | import struct 18 | 19 | 20 | from triv.io import datasources 21 | from .pbtree import PBTreeDictWriter, DataWriter, IndexWriter, DataBlockReader 22 | 23 | BLOCK_SIZE = 2**16 24 | VALUE_FORMAT = " self.block_size: 346 | self.delegate.on_item_exceeds_block_size(key,value) 347 | return 348 | 349 | if size > self.remaining: 350 | self.write_buffer.extend(self.terminator * self.remaining) 351 | self.stream.write(self.write_buffer) 352 | del self.write_buffer[:] 353 | self.remaining = self.block_size 354 | self.delegate.on_new_block(key) 355 | 356 | self.write_buffer.extend(key) 357 | self.write_buffer.extend(self.terminator) 358 | self.write_buffer.extend(packet) 359 | 360 | self.remaining -= size 361 | 362 | def close(self): 363 | if not self.finalized: 364 | self.finish() 365 | self.stream.close() 366 | 367 | def finish(self): 368 | if self.write_buffer: 369 | self.write_buffer.extend(self.terminator * self.remaining) 370 | self.stream.write(self.write_buffer) 371 | 372 | del self.write_buffer 373 | self.delegate = None 374 | self.stream.seek(0) 375 | self.finalized = True 376 | 377 | def read(self, bytes = None): 378 | """ 379 | Returns bytes written to the stream. 380 | 381 | It is an error to call this method prior to calling DataWriter.finish() 382 | """ 383 | return self.stream.read(bytes) 384 | 385 | class IndexWriter(object): 386 | def __init__(self, stream, block_size, terminator, pointer_format=' remaining: 407 | # pad the rest with null bytes 408 | stream.write(self.terminator * remaining) 409 | assert stream.tell() % self.block_size == 0 410 | 411 | # start the block off with the offset 412 | stream.write(struct.pack(self.pointer_format, pointers)) 413 | 414 | next_level = level + 1 415 | if next_level > len(self.indexes)-1: 416 | self.push_index() 417 | self.add(next_level, key) 418 | remaining = self.block_size - self.pointer_size 419 | 420 | pointers += 1 421 | stream.write(key) 422 | stream.write(self.terminator) 423 | stream.write(struct.pack(self.pointer_format, pointers)) 424 | 425 | remaining = remaining - size 426 | self.indexes[level] = stream, pointers, remaining 427 | 428 | def push_index(self): 429 | stream = SpooledTemporaryFile(max_size = 20*MB) 430 | 431 | pointers = 0 432 | stream.write(struct.pack(OFFSET_FMT,pointers)) 433 | 434 | self.indexes.append([ 435 | stream, pointers, self.block_size-self.pointer_size 436 | ]) 437 | 438 | def finish(self): 439 | out = self.stream 440 | blocks_written = 0 441 | 442 | out.write(struct.pack(OFFSET_FMT, self.block_size)) 443 | 444 | # blocks in the index 445 | out.write(struct.pack(OFFSET_FMT, 0)) 446 | 447 | 448 | for stream, pointers, remaining in reversed(self.indexes): 449 | 450 | # pad the stream 451 | stream.write(self.terminator * remaining) 452 | level_length = stream.tell() 453 | 454 | assert level_length % self.block_size == 0 455 | 456 | blocks_to_write = (level_length / self.block_size) 457 | 458 | stream.seek(0) 459 | 460 | # loop through each pointer and key writing 461 | for o, key in PBTreeReader.parse(stream, self.block_size): 462 | out.write(struct.pack(self.pointer_format, o+blocks_written+blocks_to_write)) 463 | # note: the last key in the block returned by the reader will be all null bytes 464 | # pads 465 | out.write(key) 466 | 467 | blocks_written += blocks_to_write 468 | stream.close() 469 | 470 | out.seek(OFFSET_SIZE) 471 | out.write(struct.pack(OFFSET_FMT, blocks_written)) 472 | out.seek(0,2) # move to the end of the file 473 | 474 | self.blocks_written = blocks_written 475 | 476 | 477 | def close(self): 478 | self.finish() 479 | 480 | class DataBlockReader(object): 481 | def __init__(self, bytes, value_size, terminator='\0'): 482 | self.bytes = bytes 483 | self.terminator = terminator 484 | self.value_size = value_size 485 | 486 | 487 | def __iter__(self): 488 | block = self.bytes 489 | start = 0 490 | while True: 491 | pos = block.find(self.terminator, start) 492 | if pos == -1: 493 | break 494 | key = block[start:pos] 495 | start=pos+1 496 | 497 | if key == '': 498 | return 499 | else: 500 | value = block[start:start+self.value_size] 501 | start+=self.value_size 502 | 503 | yield key, value 504 | 505 | 506 | 507 | class IndexBlockReader(object): 508 | def __init__(self, data): 509 | self.data = data 510 | 511 | 512 | def __iter__(self): 513 | end_of_block = False 514 | buffer = StringIO(self.data) 515 | 516 | while not end_of_block: 517 | offset = self.read_offset(buffer) 518 | 519 | key = self.read_key(buffer) 520 | # near the end of the block, eveything from here on should be null bytes 521 | if key == '': 522 | # ended on block boundry 523 | end_of_block = True 524 | elif key == '\0': 525 | # nothing but pad bytes left 526 | end_of_block = True 527 | key += self.read_rest_of_block(buffer) 528 | 529 | yield offset, key 530 | 531 | def find(self, key): 532 | pointers = [] 533 | prefixes = [] 534 | for pointer, prefix in self: 535 | pointers.append(pointer) 536 | prefixes.append(prefix) 537 | 538 | # discard padding 539 | prefixes.pop() 540 | 541 | index = bisect.bisect(prefixes, key) 542 | return pointers[index] 543 | 544 | 545 | def read_offset(self, buffer): 546 | 547 | bytes = buffer.read(OFFSET_SIZE) 548 | # bytes should never be empty when reading an offset if so the file 549 | # corrupt 550 | return struct.unpack(OFFSET_FMT, bytes)[0] 551 | 552 | 553 | def read_key(self, buffer): 554 | buff = bytearray() 555 | while True: 556 | c = buffer.read(1) 557 | if c == '': 558 | if len(buff): 559 | raise RuntimeError('EOF found when string was expected') 560 | else: 561 | break 562 | elif c == '\0': 563 | buff.append(c) 564 | break 565 | else: 566 | buff.append(c) 567 | return str(buff) 568 | 569 | def read_rest_of_block(self, buffer): 570 | return buffer.read() 571 | 572 | 573 | 574 | 575 | 576 | 577 | -------------------------------------------------------------------------------- /lib/prefix.py: -------------------------------------------------------------------------------- 1 | # Copyright 2012 Triv.io, Scott Robertson 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # 15 | 16 | from itertools import izip_longest, dropwhile 17 | 18 | def commonlen(s1,s2): 19 | """ 20 | Returns the length of the common prefix 21 | """ 22 | 23 | # given "hi", "hip" 24 | # izip_longest("hi", "hip") -> ('h','h'), ('i','i'), (None, 'p') 25 | # enumerate -> (0,('h','h')), (1,('i','i')), (2,(None, 'p')) 26 | # dropwhile(lambda (i,(x,y)): x == 5 -> (2,(None,'p')) ... 27 | 28 | try: 29 | return dropwhile(lambda (i,(x,y)): x == y,enumerate(zip(s1, s2))).next()[0] 30 | except StopIteration: 31 | # strings are identical return the len of one of them 32 | return len(s1) 33 | 34 | def common(s1,s2): 35 | """ 36 | Returns the common prefix 37 | """ 38 | cl = commonlen(s1,s2) 39 | return s2[:cl] 40 | 41 | def signifigant(s1,s2): 42 | """ 43 | Given two strings s1 and s2, and assuming s2 > s1 returns the character 44 | that make s2 gerater. 45 | """ 46 | cl = commonlen(s1,s2) 47 | return s2[:cl+1] 48 | -------------------------------------------------------------------------------- /lib/test.py: -------------------------------------------------------------------------------- 1 | # Copyright [2012] [Triv.io, Scott Robertson] 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # 15 | 16 | from unittest import TestCase 17 | from tempfile import NamedTemporaryFile 18 | from functools import partial 19 | from nose.tools import eq_ 20 | import mmap 21 | 22 | 23 | class TestIndex(TestCase): 24 | def test_btree_index(self): 25 | from .pbtree import PBTreeWriter, PBTreeReader 26 | 27 | def data(): 28 | """Returns an iterator of (url, linepos)""" 29 | return ((url.strip(), pos) for pos, url in enumerate(open('sorted_urls'))) 30 | 31 | self.validate( 32 | PBTreeWriter, 33 | PBTreeReader, 34 | data(), 35 | prefix = 'http://natebeaty.com/', 36 | known_keys = [ 37 | 'http://natebeaty.com/illustration/4452349850', 38 | 'http://natebeaty.com/illustration/4573016166', 39 | 'http://natebeaty.com/illustration/4747271212', 40 | 'http://natebeaty.com/illustration/4752986875', 41 | ], 42 | known_values =[ 43 | 1891, 44 | 1892, 45 | 1893, 46 | 1894 47 | ] 48 | ) 49 | 50 | def test_btree_dict_index(self): 51 | from .pbtree import PBTreeDictWriter, PBTreeDictReader 52 | 53 | writer = partial(PBTreeDictWriter, item_keys=("key1", "key2"), value_format=" :: = ( | ( {} )) 2 | # ::= 3 | 4 | # 5 | # 6 | 7 | from unittest import TestCase 8 | from .prefix import signifigant 9 | 10 | 11 | class TestIndexMapCase(TestCase): 12 | def test(self): 13 | class P(): 14 | pass 15 | params = P() 16 | 17 | results =[] 18 | for partition_number, input in enumerate(inputs): 19 | params.last_key = None 20 | params.partition_number = partition_number 21 | for block in input: 22 | for item in map_block(iter(block), params): 23 | results.append(item) 24 | 25 | self.assertSequenceEqual(results, final) 26 | 27 | 28 | 29 | file1 = [ 30 | [ 31 | 'key01', 32 | 'key02', 33 | 'key03a', 34 | 'key03ac', 35 | ], 36 | [ 37 | 'key03bc' 38 | 'key06', 39 | 'key07', 40 | 'key08z', 41 | ], 42 | [ 43 | 'key08zafz' 44 | 'key10', 45 | 'key11', 46 | 'key12', 47 | ], 48 | ] 49 | 50 | file2= [ 51 | [ 52 | 'key13feee', 53 | 'key14', 54 | 'key16', 55 | 'key16a', 56 | ], 57 | [ 58 | 'key16b' 59 | 'key18', 60 | 'key19', 61 | 'key20', 62 | ] 63 | ] 64 | 65 | 66 | final=( 67 | (0,"key01"), 68 | (0,"key03b"), 69 | (0,"key08za"), 70 | (1,"key13feee"), 71 | (1,"key16b") 72 | ) 73 | 74 | inputs = [file1, file2] 75 | 76 | def map_block(block, params): 77 | # yield first item and last 78 | 79 | first_key = block.next() 80 | assert first_key.find('\0') == -1 81 | 82 | if params.last_key is None: 83 | yield params.partition_number, first_key 84 | else: 85 | yield params.partition_number, signifigant(params.last_key, first_key) 86 | 87 | second_to_last = None 88 | for key in block: 89 | if not key.startswith('\0'): 90 | second_to_last = key 91 | 92 | if second_to_last is not None: 93 | params.last_key = second_to_last 94 | else: 95 | params.last_key = first_key 96 | 97 | 98 | 99 | 100 | -------------------------------------------------------------------------------- /lib/test_pbtree.py: -------------------------------------------------------------------------------- 1 | # Copyright 2012 Triv.io, Scott Robertson] 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # 15 | 16 | from unittest import TestCase 17 | 18 | from nose.tools import eq_ 19 | 20 | from .pbtree import PBTreeWriter, PBTreeReader, IndexWriter 21 | from tempfile import TemporaryFile 22 | 23 | class TestPBTree(TestCase): 24 | def test_btree_index(self): 25 | t = TemporaryFile() 26 | pbtree = PBTreeWriter(t) 27 | pbtree.add("blah", 1) 28 | #pbtree.commit() 29 | 30 | #t.seek(0) 31 | packet = pbtree.data_segment.write_buffer #t.read() 32 | 33 | eq_(packet, 'blah\x00\x01\x00\x00\x00\x00\x00\x00\x00') 34 | 35 | def test_one_key_per_block_writer(self): 36 | # 2 pointers and a 1 byte string null terminated string = 10 bytes 37 | stream = TemporaryFile() 38 | 39 | i = IndexWriter(stream, block_size=10, terminator='\0') 40 | i.add(0, 'b') 41 | eq_(len(i.indexes), 1) 42 | 43 | i.add(0, 'c') 44 | eq_(len(i.indexes), 2) 45 | i.finish() 46 | 47 | 48 | stream.seek(0) 49 | packet = stream.read() 50 | eq_(len(packet), 30) 51 | 52 | 53 | root_block = packet[:10] 54 | eq_(root_block, '\x01\x00\x00\x00c\x00\x02\x00\x00\x00') 55 | 56 | block_1 = packet[10:20] 57 | eq_(block_1, '\x03\x00\x00\x00b\x00\x04\x00\x00\x00') 58 | 59 | block_2 = packet[20:] 60 | eq_(block_2, '\x04\x00\x00\x00c\x00\x05\x00\x00\x00') --------------------------------------------------------------------------------