├── 2012 ESS crosswalk of Russian Regions.csv ├── LICENSE.md ├── README.md ├── how to map the american community survey.R ├── how to map the consumer expenditure survey.R ├── how to map the current population survey.R ├── how to map the demographic and health surveys.R ├── how to map the european social survey.R ├── how to map the new york city housing and vacancy survey.R └── how to map the pesquisa nacional por amostra de domicilios.R /2012 ESS crosswalk of Russian Regions.csv: -------------------------------------------------------------------------------- 1 | region,NAME_1,ID_1 2 | RU17,Altay,2513 3 | RU18,Amur,2514 4 | RU12,Arkhangel'sk,2515 5 | RU14,Astrakhan',2516 6 | RU11,Belgorod,2518 7 | RU11,Bryansk,2519 8 | RU15,Chechnya,2521 9 | RU16,Chelyabinsk,2522 10 | RU18,Chukot,2524 11 | RU13,Chuvash,2525 12 | RU18,Evenk,2528 13 | RU17,Irkutsk,2531 14 | RU11,Ivanovo,2532 15 | RU15,Kabardin-Balkar,2533 16 | RU12,Kaliningrad,2534 17 | RU11,Kaluga,2536 18 | RU18,Kamchatka,2537 19 | RU18,Koryak,2547 20 | RU15,Karachay-Cherkess,2538 21 | RU17,Kemerovo,2540 22 | RU18,Khabarovsk,2541 23 | RU13,Kirov,2544 24 | RU11,Kostroma,2548 25 | RU14,Krasnodar,2549 26 | RU17,Krasnoyarsk,2550 27 | RU16,Kurgan,2551 28 | RU11,Kursk,2552 29 | RU12,Leningrad,2553 30 | RU11,Lipetsk,2554 31 | RU18,Maga Buryatdan,2555 32 | RU11,Moskva,2558 33 | RU12,Murmansk,2559 34 | RU13,Nizhegorod,2561 35 | RU12,Novgorod,2563 36 | RU17,Novosibirsk,2564 37 | RU17,Omsk,2565 38 | RU13,Orenburg,2567 39 | RU11,Orel,2566 40 | RU13,Penza,2568 41 | RU13,Perm',2569 42 | RU13,Komi-Permyak,2546 43 | RU18,Primor'ye,2570 44 | RU12,Pskov,2571 45 | RU14,Adygey,2511 46 | RU17,Gorno-Altay,2529 47 | RU13,Bashkortostan,2517 48 | RU17,Buryat,2520 49 | RU15,Dagestan,2527 50 | RU15,Ingush,2530 51 | RU14,Kalmyk,2535 52 | RU12,Karelia,2539 53 | RU17,Khakass,2542 54 | RU12,Komi,2545 55 | RU13,Mariy-El,2556 56 | RU13,Mordovia,2557 57 | RU18,Sakha,2574 58 | RU13,Tatarstan,2582 59 | RU17,Tuva,2586 60 | RU14,Rostov,2572 61 | RU11,Ryazan',2573 62 | RU18,Sakhalin,2575 63 | RU13,Samara,2576 64 | RU13,Saratov,2577 65 | RU15,North Ossetia,2562 66 | RU11,Smolensk,2578 67 | RU12,City of St. Petersburg,2526 68 | RU15,Stavropol',2579 69 | RU16,Sverdlovsk,2580 70 | RU11,Tambov,2581 71 | RU17,Tomsk,2584 72 | RU11,Tula,2585 73 | RU11,Tver',2587 74 | RU16,Tyumen',2588 75 | RU13,Udmurt,2589 76 | RU13,Ul'yanovsk,2590 77 | RU11,Vladimir,2592 78 | RU14,Volgograd,2593 79 | RU12,Vologda,2594 80 | RU11,Voronezh,2595 81 | RU11,Yaroslavl',2597 82 | RU17,Chita,2523 83 | RU17,Aga Buryat,2512 84 | RU17,Khanty-Mansiy,2543 85 | RU17,Yamal-Nenets,2596 86 | RU17,Taymyr,2583 87 | RU17,Ust-Orda Buryat,2591 88 | RU18,Yevrey,2598 89 | RU12,Nenets,2560 90 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | # GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 [Free Software Foundation, Inc.](http://fsf.org/) 5 | 6 | Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. 7 | 8 | ## Preamble 9 | The GNU General Public License is a free, copyleft license for software and other kinds of works. 10 | 11 | The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to 12 | share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. 13 | 14 | When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for 15 | them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. 16 | 17 | To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if 18 | you modify it: responsibilities to respect the freedom of others. 19 | 20 | For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. 21 | 22 | Developers that use the GNU GPL protect your rights with two steps: 23 | 24 | 1. assert copyright on the software, and 25 | 2. offer you this License giving you legal permission to copy, distribute and/or modify it. 26 | 27 | For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as 28 | changed, so that their problems will not be attributed erroneously to authors of previous versions. 29 | 30 | Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. 31 | 32 | Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. 33 | 34 | The precise terms and conditions for copying, distribution and modification follow. 35 | 36 | ## TERMS AND CONDITIONS 37 | 38 | ### 0. Definitions. 39 | *This License* refers to version 3 of the GNU General Public License. 40 | 41 | *Copyright* also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. 42 | 43 | *The Program* refers to any copyrightable work licensed under this License. Each licensee is addressed as *you*. *Licensees* and *recipients* may be individuals or organizations. 44 | 45 | To *modify* a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a *modified version* of the earlier work or a work *based on* the earlier work. 46 | 47 | A *covered work* means either the unmodified Program or a work based on the Program. 48 | 49 | To *propagate* a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. 50 | 51 | To *convey* a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. 52 | 53 | An interactive user interface displays *Appropriate Legal Notices* to the extent that it includes a convenient and prominently visible feature that 54 | 55 | 1. displays an appropriate copyright notice, and 56 | 2. tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. 57 | 58 | If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 59 | 60 | ### 1. Source Code. 61 | The *source code* for a work means the preferred form of the work for making modifications to it. *Object code* means any non-source form of a work. 62 | 63 | A *Standard Interface* means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. 64 | 65 | The *System Libraries* of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A *Major Component*, in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. 66 | 67 | The *Corresponding Source* for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. 68 | 69 | The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. 70 | 71 | The Corresponding Source for a work in source code form is that same work. 72 | 73 | ### 2. Basic Permissions. 74 | All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. 75 | 76 | You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do 77 | not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of 78 | your copyrighted material outside their relationship with you. 79 | 80 | Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 81 | 82 | ### 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 83 | No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. 84 | 85 | When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 86 | 87 | ### 4. Conveying Verbatim Copies. 88 | You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. 89 | 90 | You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 91 | 92 | ### 5. Conveying Modified Source Versions. 93 | You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: 94 | 95 | - a) The work must carry prominent notices stating that you modified it, and giving a relevant date. 96 | - b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to *keep intact all notices*. 97 | - c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. 98 | - d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. 99 | 100 | A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an *aggregate* if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 101 | 102 | ### 6. Conveying Non-Source Forms. 103 | You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: 104 | 105 | - a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. 106 | - b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either 107 | 1. a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or 108 | 2. access to copy the Corresponding Source from a network server at no charge. 109 | 110 | - c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. 111 | - d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. 112 | - e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. 113 | 114 | A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. 115 | 116 | A *User Product* is either 117 | 118 | 1. a *consumer product*, which means any tangible personal property which is normally used for personal, family, or household purposes, or 119 | 2. anything designed or sold for incorporation into a dwelling. 120 | 121 | In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, *normally used* refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. 122 | 123 | *Installation Information* for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. 124 | 125 | If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). 126 | 127 | The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and 128 | protocols for communication across the network. 129 | 130 | Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 131 | 132 | ### 7. Additional Terms. 133 | *Additional permissions* are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. 134 | 135 | When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. 136 | 137 | Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: 138 | 139 | a. Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or 140 | b. Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or 141 | c. Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or 142 | d. Limiting the use for publicity purposes of names of licensors or authors of the material; or 143 | e. Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or 144 | f. Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. 145 | 146 | All other non-permissive additional terms are considered *further restrictions* within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. 147 | 148 | If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. 149 | 150 | Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 151 | 152 | ### 8. Termination. 153 | You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). 154 | 155 | However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated 156 | 157 | a. provisionally, unless and until the copyright holder explicitly and finally terminates your license, and 158 | b. permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. 159 | 160 | Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. 161 | 162 | Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 163 | 164 | ### 9. Acceptance Not Required for Having Copies. 165 | You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 166 | 167 | ### 10. Automatic Licensing of Downstream Recipients. 168 | Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. 169 | 170 | An *entity transaction* is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. 171 | 172 | You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 173 | 174 | ### 11. Patents. 175 | A *contributor* is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's *contributor version*. 176 | 177 | A contributor's *essential patent claims* are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, *control* includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. 178 | 179 | Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. 180 | 181 | In the following three paragraphs, a *patent license* is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to 182 | sue for patent infringement). To *grant* such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. 183 | 184 | If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either 185 | 186 | 1. cause the Corresponding Source to be so available, or 187 | 2. arrange to deprive yourself of the benefit of the patent license for this particular work, or 188 | 3. arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. 189 | 190 | *Knowingly relying* means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. 191 | 192 | If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. 193 | 194 | A patent license is *discriminatory* if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license 195 | 196 | a. in connection with copies of the covered work conveyed by you (or copies made from those copies), or 197 | b. primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. 198 | 199 | Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 200 | 201 | ### 12. No Surrender of Others' Freedom. 202 | If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 203 | 204 | ### 13. Use with the GNU Affero General Public License. 205 | Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such. 206 | 207 | ### 14. Revised Versions of this License. 208 | The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. 209 | 210 | Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License *or any later version* applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. 211 | 212 | If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. 213 | 214 | Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 215 | 216 | ### 15. Disclaimer of Warranty. 217 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM *AS IS* WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 218 | 219 | ### 16. Limitation of Liability. 220 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 221 | 222 | ### 17. Interpretation of Sections 15 and 16. 223 | If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. 224 | 225 | ## END OF TERMS AND CONDITIONS 226 | ### How to Apply These Terms to Your New Programs 227 | If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. 228 | 229 | To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the *copyright* line and a pointer to where the full notice is found. 230 | 231 | 232 | Copyright (C) 233 | 234 | This program is free software: you can redistribute it and/or modify 235 | it under the terms of the GNU General Public License as published by 236 | the Free Software Foundation, either version 3 of the License, or 237 | (at your option) any later version. 238 | 239 | This program is distributed in the hope that it will be useful, 240 | but WITHOUT ANY WARRANTY; without even the implied warranty of 241 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 242 | GNU General Public License for more details. 243 | 244 | You should have received a copy of the GNU General Public License 245 | along with this program. If not, see . 246 | 247 | Also add information on how to contact you by electronic and paper mail. 248 | 249 | If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: 250 | 251 | Copyright (C) 252 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 253 | This is free software, and you are welcome to redistribute it 254 | under certain conditions; type `show c' for details. 255 | 256 | The hypothetical commands `show w` and `show c` should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an *about box*. 257 | 258 | You should also get your employer (if you work as a programmer) or school, if any, to sign a *copyright disclaimer* for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see [http://www.gnu.org/licenses/](http://www.gnu.org/licenses/). 259 | 260 | The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read [http://www.gnu.org/philosophy/why-not-lgpl.html](http://www.gnu.org/philosophy/why-not-lgpl.html). -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | cartography with complex survey data -------------------------------------------------------------------------------- /how to map the american community survey.R: -------------------------------------------------------------------------------- 1 | # # # # # # # # # # # # # # # # # 2 | # # set the working directory # # 3 | # # # # # # # # # # # # # # # # # 4 | 5 | # setwd( "C:/My Directory/SWMAP/" ) 6 | 7 | 8 | # # # # # # # # # # # # # # # # 9 | # # example survey data set # # 10 | # # # # # # # # # # # # # # # # 11 | 12 | # american community survey 13 | 14 | 15 | # # # # # # # # # # # # # # # # # # # # # 16 | # # different from other maps because # # 17 | # # # # # # # # # # # # # # # # # # # # # 18 | 19 | # displays a non-ordinal categorical variable 20 | # crosses the international date line 21 | 22 | 23 | # # # # # # # # # # # # # # # # # # 24 | # # smallest level of geography # # 25 | # # # # # # # # # # # # # # # # # # 26 | 27 | # state, public use microdata areas 28 | 29 | 30 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # 31 | # # asdfree.com blog post for this survey microdata # # 32 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # 33 | 34 | # http://www.asdfree.com/search/label/american%20community%20survey%20%28acs%29 35 | 36 | 37 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 38 | # # r code repository for setup and analysis examples # # 39 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 40 | 41 | # https://github.com/ajdamico/asdfree/tree/master/American%20Community%20Survey 42 | 43 | 44 | # # # # # # # # # # # # # 45 | # # value of interest # # 46 | # # # # # # # # # # # # # 47 | 48 | # disproportionate shares of veterans of foreign wars (categorical) 49 | 50 | 51 | # # # # # # # 52 | # # flaws # # 53 | # # # # # # # 54 | 55 | # map presents shares that are *disproportionately* higher than the statewide average. 56 | # in absolute numbers, gulf war veterans outnumber other veteran categories in four of the five pumas. 57 | 58 | 59 | # # # # # # # # # # # # # # # # # # # # # 60 | # # step 1: load the survey microdata # # 61 | 62 | # remove the # in order to run this install.packages line only once 63 | # install.packages( "MonetDBLite" ) 64 | 65 | library(downloader) 66 | 67 | # download the 2013 american community survey microdata onto the local disk 68 | # path.to.7z <- "7za" # # only macintosh and *nix users need this line 69 | single.year.datasets.to.download <- 2013 70 | three.year.datasets.to.download <- NULL 71 | five.year.datasets.to.download <- NULL 72 | source_url( "https://raw.githubusercontent.com/ajdamico/asdfree/master/American%20Community%20Survey/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE ) 73 | 74 | gc() 75 | 76 | # # end of step 1 # # 77 | # # # # # # # # # # # 78 | 79 | 80 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 81 | # # step 2: conduct your analysis of interest at the smallest geography allowed # # 82 | 83 | library(survey) 84 | library(DBI) # load the DBI package (implements the R-database coding) 85 | library(MonetDBLite) 86 | library(scales) 87 | 88 | # connect to the database 89 | dbfolder <- paste0( getwd() , "/MonetDB" ) 90 | 91 | db <- dbConnect( MonetDBLite::MonetDBLite() , dbfolder ) 92 | 93 | # # # # run your analysis commands # # # # 94 | 95 | # subset the design to only alaska before actually constructing the design 96 | acs.alaska <- dbGetQuery( db , 'select * from acs2013_1yr_m where st = 2' ) 97 | # note: this is not allowed for taylor-series linearized designs 98 | 99 | # disconnect from the current monet database 100 | dbDisconnect( db ) 101 | 102 | # construct a svrepdesign object 103 | alaska.design <- 104 | svrepdesign( 105 | weight = ~pwgtp , 106 | repweights = 'pwgtp[1-9]' , 107 | scale = 4 / 80 , 108 | rscales = rep( 1 , 80 ) , 109 | mse = TRUE , 110 | data = acs.alaska 111 | ) 112 | 113 | # create a denominator variable indicating any period of service 114 | alaska.design <- update( alaska.design , vet = as.numeric( vps > 0 ) ) 115 | 116 | # create a categorical variable indicating era of service 117 | alaska.design <- 118 | update( 119 | alaska.design , 120 | gulf = as.numeric( vps %in% 1:5 ) , 121 | vietnam = as.numeric( vps %in% 6:8 ) , 122 | other = as.numeric( vps %in% 9:15 ) 123 | ) 124 | 125 | 126 | 127 | # statewide era of service shares 128 | sw <- svyratio( ~ gulf + vietnam + other , ~ vet , alaska.design , na.rm = TRUE ) 129 | 130 | # puma-specific era of service shares 131 | ps <- svyby( ~ gulf + vietnam + other , denominator = ~ vet , by = ~ puma , alaska.design , svyratio , na.rm = TRUE ) 132 | 133 | # find the disproportionate shares 134 | ds <- ps[ , 2:4 ] - matrix( coef( sw ) , 5 , 3 , byrow = T ) 135 | 136 | # so look at this table. 137 | ds 138 | 139 | # pumas 101 and 300 have veterans that disproportionately served during the gulf wars (up to the present) 140 | 141 | # pumas 102 and 200 have veterans that disproportionately served during the vietnam war 142 | 143 | # puma 400 has veterans that disproportionately served during another era 144 | 145 | # hold on to these disproportionate shares and the standard errors of the original ratios. 146 | alaska.pumas <- cbind( ps[ 1 ] , ds , ps[ , 5:7 ] ) 147 | 148 | # note that the standard error of the ratio statistic 149 | # and the standard error of the ratio of 150 | # the difference between statewide and puma-level statistics 151 | # are probably not the same. well, i'm sure of it. 152 | # if you're an academic statistician, you might be mad at me 153 | # for making this half-assed calculation right here. 154 | 155 | # github makes it easy to patch and edit and update other people's code. 156 | 157 | # go for it ;) 158 | 159 | # remove those slashvets from the column names 160 | names( alaska.pumas ) <- gsub( "/vet" , "" , names( alaska.pumas ) ) 161 | 162 | # these are the small area statistics to be mapped 163 | print( alaska.pumas ) 164 | # the standard errors are a measure of precision, 165 | # their inverse will serve as the mapping weights 166 | 167 | # make this object easier to type 168 | sas <- alaska.pumas 169 | 170 | # # end of step 2 # # 171 | # # # # # # # # # # # 172 | 173 | 174 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 175 | # # step 3: download and import necessary geographic crosswalks # # 176 | 177 | library(downloader) 178 | library(maptools) 179 | 180 | # load the download_cached and related functions 181 | # to prevent re-downloading of files once they've been downloaded. 182 | source_url( 183 | "https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" , 184 | prompt = FALSE , 185 | echo = FALSE 186 | ) 187 | 188 | 189 | # create a temporary file containing the census bureau's 190 | # 2010 census tract to 2010 puma crosswalk 191 | # then download the file. 192 | ctpxw.tf <- tempfile() 193 | 194 | download_cached( 195 | "http://www2.census.gov/geo/docs/maps-data/data/rel/2010_Census_Tract_to_2010_PUMA.txt" , 196 | ctpxw.tf , 197 | mode = 'wb' 198 | ) 199 | # note: to re-download a file from scratch, add the parameter usecache = FALSE 200 | 201 | # import this csv file into an R data.frame object 202 | ctpxw <- read.csv( ctpxw.tf ) 203 | 204 | # match the column names of sf1 and of the `sas` output 205 | names( ctpxw ) <- c( 'state' , 'county' , 'tract' , 'puma' ) 206 | 207 | # immediately limit this to alaskan census tracts 208 | ak.ctpxw <- subset( ctpxw , state == 2 ) 209 | 210 | # clear up RAM 211 | rm( ctpxw ) ; gc() 212 | 213 | 214 | # create a temporary file containing the census bureau's 215 | # 2010 census summary file #1 for alaska 216 | # then download the file. 217 | sf1ak.tf <- tempfile() 218 | 219 | download_cached( 220 | "ftp://ftp2.census.gov/census_2010/04-Summary_File_1/Alaska/ak2010.sf1.zip" , 221 | sf1ak.tf , 222 | mode = 'wb' 223 | ) 224 | # note: to re-download a file from scratch, add the parameter usecache = FALSE 225 | 226 | # unzip the summary file #1 files 227 | sf1ak.uz <- unzip( sf1ak.tf , exdir = tempdir() ) 228 | 229 | # file layout from http://www.census.gov/prod/cen2010/doc/sf1.pdf#page=18 230 | sf1ak <- read.fwf( sf1ak.uz[ grep( "akgeo2010" , sf1ak.uz ) ] , c( -8 , 3 , -16 , 2 , 3 , -22 , 6 , 1 , 4 , -253 , 9 , -9 , 11 , 12 ) ) 231 | 232 | # add columns names matching the census bureau, so it's easy to read 233 | names( sf1ak ) <- c( "sumlev" , "state" , "county" , "tract" , "blkgrp" , "block" , "pop100" , "intptlat" , "intptlon" ) 234 | 235 | # summary level 101 has census tracts and census blocks 236 | sf1ak.101 <- subset( sf1ak , sumlev == "101" ) 237 | 238 | # merge these files together 239 | sf1ak.101 <- merge( sf1ak.101 , ak.ctpxw ) 240 | # the number of records and population sums serve 241 | # as a check to confirm that this merge worked 242 | 243 | # one record per census block in alaska. see? same number. 244 | nrow( sf1ak.101 ) 245 | # https://www.census.gov/geo/maps-data/data/tallies/census_block_tally.html 246 | 247 | # and guess what? the total alaska population matches as well. 248 | sum( sf1ak.101$pop100 ) 249 | # http://quickfacts.census.gov/qfd/states/02000.html 250 | 251 | # clear up RAM 252 | rm( sf1ak ) ; gc() 253 | 254 | 255 | # so now we have a data.frame object with 256 | # one record per census block, 257 | # and also with the geography (puma) 258 | # that matches the american community survey 259 | head( sf1ak.101 ) 260 | 261 | # and guess what? 262 | # we've now got the census 2010 weighted populations (field pop100) 263 | # and also each census block's centroid latitude & longitude (fields intptlat + intptlon) 264 | 265 | # # end of step 3 # # 266 | # # # # # # # # # # # 267 | 268 | 269 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 270 | # # step 4: merge the results of your survey analysis with the small-area geography # # 271 | 272 | # confirm that we've created all possible geographies correctly. 273 | 274 | # the number of records in our small area statistics.. 275 | sas.row <- nrow( sas ) 276 | 277 | # ..should equal the number of unique-match-merged records.. 278 | mrow <- nrow( merge( unique( sf1ak.101[ "puma" ] ) , sas ) ) 279 | 280 | # ..and it does/they do. 281 | stopifnot( sas.row == mrow ) 282 | 283 | # now the census block-level alaska census data *could* merge if you wanted it to. 284 | 285 | 286 | # but you don't. yet. 287 | 288 | 289 | # the standard error (the `se.` fields) are measures of precision. 290 | print( sas ) 291 | # the smaller the standard error, the more confident you should be 292 | # that the estimate at a particular geography is correct. 293 | 294 | 295 | # so invert them. you heard me. invert them. 296 | sas$invse.gulf <- 1 / sas$se.gulf 297 | sas$invse.vietnam <- 1 / sas$se.vietnam 298 | sas$invse.other <- 1 / sas$se.other 299 | # a smaller standard error indicates more precision. 300 | 301 | # for our purposes, precision can be considered weight! # 302 | 303 | # now we've got the weight that we should give each of our estimates # 304 | 305 | # distribute that weight across all census blocks # 306 | 307 | 308 | # aggregate the 2010 census block populations to the geographies that you have. 309 | popsum <- aggregate( sf1ak.101$pop100 , by = ( sf1ak.101[ "puma" ] ) , sum ) 310 | 311 | # make the column name meaningful 312 | names( popsum )[ names( popsum ) == 'x' ] <- 'popsum' 313 | 314 | # merge the popsum onto the sasfile 315 | sas <- merge( sas , popsum ) 316 | 317 | # now. merge 318 | # the disproportionate veteran era in each puma (the variable of interest) 319 | # the inverted standard errors (the total weight of the broad geography) 320 | # the population sum (the total population of all census blocks that are part of that geography) 321 | 322 | x <- merge( sf1ak.101 , sas ) 323 | 324 | # confirm no record loss 325 | stopifnot( nrow( x ) == nrow( sf1ak.101 ) ) 326 | 327 | 328 | # (this is the fun part) 329 | # calculate the weight at each census block 330 | x$weight.gulf <- x$invse.gulf * ( x$pop100 / x$popsum ) 331 | x$weight.vietnam <- x$invse.vietnam * ( x$pop100 / x$popsum ) 332 | x$weight.other <- x$invse.other * ( x$pop100 / x$popsum ) 333 | 334 | # note that weight of all census blocks put together 335 | # sums to the `invse` on the original analysis file 336 | stopifnot( sum( x$weight.gulf ) == sum( sas$invse.gulf ) ) 337 | stopifnot( sum( x$weight.vietnam ) == sum( sas$invse.vietnam ) ) 338 | stopifnot( sum( x$weight.other ) == sum( sas$invse.other ) ) 339 | 340 | # remove records with zero population across all three measures 341 | x <- subset( x , weight.gulf > 0 | weight.vietnam > 0 | weight.other > 0 ) 342 | 343 | # scale all weights so that they average to one 344 | x$weight.gulf <- x$weight.gulf / mean( x$weight.gulf ) 345 | x$weight.vietnam <- x$weight.vietnam / mean( x$weight.viet ) 346 | x$weight.other <- x$weight.other / mean( x$weight.other ) 347 | 348 | 349 | # you're done preparing your data. 350 | # keep only the columns you need. 351 | x <- x[ , c( 'gulf', 'vietnam' , 'other' , 'weight.gulf' , 'weight.vietnam' , 'weight.other' , 'intptlat' , 'intptlon' ) ] 352 | 353 | # pop quiz: which states are the furthest north, east, south, west? 354 | # if you guessed alaska, maine, hawaii, alaska, you are wrong! 355 | # the answer is alaska, alaska, hawaii, alaska. 356 | 357 | # a few of the aleutians cross the international date line. 358 | 359 | # do you want to keep the edges of the aleutian islands in your map? 360 | # of course you do! here's an ultra-simple recode to keep them gridded together. 361 | x <- transform( x , intptlon = ifelse( intptlon > 0 , intptlon - 360 , intptlon ) ) 362 | 363 | 364 | # # end of step 4 # # 365 | # # # # # # # # # # # 366 | 367 | 368 | # # # # # # # # # # # # 369 | # # step 5: outline # # 370 | 371 | library(maptools) 372 | library(raster) 373 | library(rgeos) 374 | library(rgdal) 375 | library(ggplot2) 376 | 377 | # make a character vector containing the shapefiles to download 378 | shftd <- 379 | c( 380 | # download the clipped alaska public use microdata area map, described 381 | # https://www.census.gov/geo/maps-data/maps/2010puma/st02_ak.html 382 | 'http://www2.census.gov/geo/tiger/GENZ2013/cb_2013_02_puma10_500k.zip' , 383 | 384 | # download the clipped nationwide state outlines 385 | 'http://www2.census.gov/geo/tiger/GENZ2013/cb_2013_us_state_500k.zip' , 386 | 387 | # download the roads in alaska 388 | 'http://www2.census.gov/geo/tiger/TIGER2013/PRISECROADS/tl_2013_02_prisecroads.zip' 389 | ) 390 | 391 | # initiate a function to download and import all census bureau shapefiles 392 | daiacbsf <- 393 | function( fn , myproj = "+init=epsg:2163" ){ 394 | tf <- tempfile() 395 | 396 | # # note: to re-download a file from scratch, add the parameter usecache = FALSE # # 397 | download_cached( fn , tf , mode = 'wb' ) 398 | 399 | # unzip the downloaded file to a temporary directory 400 | shp.uz <- unzip( tf , exdir = tempdir() ) 401 | 402 | # figure out which filename ends with "shp" 403 | sfname <- grep( 'shp$' , shp.uz , value = TRUE ) 404 | 405 | # read in the shapefile, using the correct layer 406 | sf <- readOGR( sfname , layer = gsub( "\\.shp" , "" , basename( sfname ) ) ) 407 | 408 | # project this shapefile immediately 409 | # this projection (and a few others) keeps 410 | # the aleutian islands that cross the 411 | # international date line easy to work with. 412 | spTransform( sf , CRS( myproj ) ) 413 | } 414 | 415 | # run all downloads at once, store the result in a list. 416 | asf <- sapply( shftd , daiacbsf ) 417 | 418 | # pull out the clipped state borders of alaska only 419 | alaska.borders <- subset( asf[[2]] , STATEFP == '02' ) 420 | 421 | # plot as-is. see how the aleutians screw up the map? 422 | plot( alaska.borders ) 423 | 424 | # add puma boundaries 425 | plot( asf[[1]] , add = TRUE ) 426 | 427 | # refresh the map with state borders only 428 | plot( alaska.borders ) 429 | 430 | # add roads 431 | plot( asf[[3]] , add = TRUE , col = 'red' ) 432 | 433 | # draw a rectangle 15% bigger than the original state 434 | ak.shp.blank <- as( 1.3 * extent( alaska.borders ) , "SpatialPolygons" ) 435 | 436 | # calculate the difference between the rectangle and the actual shape 437 | ak.shp.diff <- gDifference( ak.shp.blank , alaska.borders ) 438 | # this will be used to cover up points outside of alaska's state borders 439 | 440 | # this box will later blank out the surrounding area 441 | plot( ak.shp.diff ) 442 | 443 | # # end of step 5 # # 444 | # # # # # # # # # # # 445 | 446 | 447 | # # # # # # # # # # # # # # # # # # 448 | # # step 6: tie knots and krige # # 449 | 450 | library(sqldf) 451 | 452 | # # warning warning # # # # warning warning # # 453 | # alaska has a vast geography and highly skewed population centers 454 | # kriging functions might not converge. that's why there are other options ;) 455 | # # warning warning # # # # warning warning # # 456 | 457 | 458 | # how many knots should you make? # 459 | 460 | # knots are the computationally-intensive part of this process, 461 | # choose as many as your computer and your patience can handle. 462 | 463 | # you should aim for between 100 - 999 knots, 464 | # but numbers closer to 1,000 will overload smaller computers 465 | 466 | # you could let the `fields` package attempt to guess knots for you, 467 | # xknots <- cover.design( cbind( x$intptlon , x$intptlat ) , 100 )$design 468 | # but with census microdata, you've already got easy access to a relevant geographic grouping 469 | 470 | # the sqldf() function doesn't like `.` in data.frame object names 471 | sf1s <- sf1ak.101 472 | 473 | # exactamundo same transform operation as you saw previously on `x` 474 | sf1s <- transform( sf1s , intptlon = ifelse( intptlon > 0 , intptlon - 360 , intptlon ) ) 475 | 476 | # within each county x census tract 477 | # calculate the population-weighted mean of the coordinates 478 | ct.knots <- 479 | sqldf( 480 | "select 481 | county , tract , 482 | sum( pop100 ) as pop100 , 483 | sum( pop100 * intptlon ) / sum( pop100 ) as intptlon , 484 | sum( pop100 * intptlat ) / sum( pop100 ) as intptlat 485 | from sf1s 486 | group by 487 | county , tract" 488 | ) 489 | # note: this screws up coordinates that cross the international date line 490 | # or the equator. in the united states, only alaska's aleutian islands do this 491 | # and we're mapping alaska, aren't we? good thing we fixed it, huh? 492 | 493 | gc() 494 | 495 | # interpolation option one # 496 | library(fields) 497 | 498 | krig.fit.gulf <- 499 | Krig( 500 | cbind( x$intptlon , x$intptlat ) , 501 | Y = x$gulf , 502 | weights = x$weight.gulf , 503 | knots = cbind( ct.knots$intptlon , ct.knots$intptlat ) 504 | # if you prefer to use cover.design, all you'd need is this knots= line instead: 505 | # knots = xknots 506 | ) 507 | 508 | gc() 509 | 510 | krig.fit.vietnam <- 511 | Krig( 512 | cbind( x$intptlon , x$intptlat ) , 513 | x$vietnam , 514 | weights = x$weight.vietnam , 515 | knots = cbind( ct.knots$intptlon , ct.knots$intptlat ) 516 | # if you prefer to use cover.design, all you'd need is this knots= line instead: 517 | # knots = xknots 518 | ) 519 | 520 | gc() 521 | 522 | krig.fit.other <- 523 | Krig( 524 | cbind( x$intptlon , x$intptlat ) , 525 | x$other , 526 | weights = x$weight.other , 527 | knots = cbind( ct.knots$intptlon , ct.knots$intptlat ) 528 | # if you prefer to use cover.design, all you'd need is this knots= line instead: 529 | # knots = xknots 530 | ) 531 | 532 | gc() 533 | 534 | # that is: what is the (weighted) relationship between 535 | # your variable of interest (veteran service eras) and 536 | # the x/y points on a grid? 537 | 538 | # check this out! 539 | surface( krig.fit.gulf ) 540 | surface( krig.fit.vietnam ) 541 | surface( krig.fit.other ) 542 | # you're almost there! 543 | 544 | 545 | # interpolation option two # 546 | library(mgcv) 547 | 548 | 549 | gam.gulf <- 550 | gam( 551 | gulf ~ s( intptlon , intptlat ) , 552 | weights = weight.gulf , 553 | data = x 554 | ) 555 | 556 | gam.vietnam <- 557 | gam( 558 | vietnam ~ s( intptlon , intptlat ) , 559 | weights = weight.vietnam , 560 | data = x 561 | ) 562 | 563 | gam.other <- 564 | gam( 565 | other ~ s( intptlon , intptlat ) , 566 | weights = weight.other , 567 | data = x 568 | ) 569 | 570 | # # end of step 6 # # 571 | # # # # # # # # # # # 572 | 573 | 574 | # # # # # # # # # # # # # # # # # # # # 575 | # # step 7: make a grid and predict # # 576 | 577 | # use as fine of a grid as your computer can handle 578 | grid.length <- 750 579 | # # note: smaller grids will render faster 580 | # # (so they're better if you're just playing around) 581 | # # but larger grids will prevent your final plot from 582 | # # being too pixelated, even when zooming in 583 | 584 | 585 | x.range <- c( min( x$intptlon ) , max( x$intptlon ) ) 586 | y.range <- c( min( x$intptlat ) , max( x$intptlat ) ) 587 | 588 | # add five percent on each side 589 | x.diff <- abs( x.range[ 2 ] - x.range[ 1 ] ) * 0.2 590 | y.diff <- abs( y.range[ 2 ] - y.range[ 1 ] ) * 0.2 591 | 592 | x.range[ 1 ] <- x.range[ 1 ] - x.diff 593 | x.range[ 2 ] <- x.range[ 2 ] + x.diff 594 | y.range[ 1 ] <- y.range[ 1 ] - y.diff 595 | y.range[ 2 ] <- y.range[ 2 ] + y.diff 596 | 597 | 598 | grd <- krig.grd <- gam.grd <- 599 | expand.grid( 600 | intptlon = seq( x.range[ 1 ] , x.range[ 2 ] , length = grid.length ) , 601 | intptlat = seq( y.range[ 1 ] , y.range[ 2 ] , length = grid.length ) 602 | ) 603 | 604 | 605 | # along your rectangular grid, 606 | # what are the predicted values of 607 | # each veteran era category 608 | krig.grd$gulf <- predict( krig.fit.gulf , krig.grd[ , 1:2 ] ) 609 | 610 | krig.grd$vietnam <- predict( krig.fit.vietnam , krig.grd[ , 1:2 ] ) 611 | 612 | krig.grd$other <- predict( krig.fit.other , krig.grd[ , 1:2 ] ) 613 | 614 | gam.grd$gulf <- predict( gam.gulf , gam.grd[ , 1:2 ] ) 615 | 616 | gam.grd$vietnam <- predict( gam.vietnam , gam.grd[ , 1:2 ] ) 617 | 618 | gam.grd$other <- predict( gam.other , gam.grd[ , 1:2 ] ) 619 | 620 | 621 | 622 | # remember that these values have been re-scaled 623 | # as how disproportionate they are from the state-wide averages. 624 | # therefore, negative values are possible. 625 | sapply( krig.grd , summary ) 626 | sapply( gam.grd , summary ) 627 | 628 | # what we're really hoping for is that 629 | # the overall mean averages out to zero 630 | sum( sapply( gam.grd , summary )[ 4 , 3:5 ] ) 631 | 632 | # in general, these predictions at each point should approximately sum to zero 633 | summary( rowSums( krig.grd[ , 3:5 ] ) ) 634 | summary( rowSums( gam.grd[ , 3:5 ] ) ) 635 | 636 | # # end of step 7 # # 637 | # # # # # # # # # # # 638 | 639 | 640 | # # # # # # # # # # # # # # # # # # # # # # 641 | # # step 8: limit information and color # # 642 | 643 | 644 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 645 | # # # warning # # # warning # # # # # # warning # # # # # # warning # # # # # # warning # # 646 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 647 | # if your data is are not binomial, then by mapping with a single image, you lose clarity # 648 | # if you have three levels of information and you generate two maps, you can get an idea # 649 | # about the entire distribution of the variable. if you attempt encoding three levels or # 650 | # more into a single map, you will explode. just kidding rofl lmao but you will lose info # 651 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 652 | # # # warning # # # warning # # # # # # warning # # # # # # warning # # # # # # warning # # 653 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 654 | 655 | 656 | library(scales) 657 | 658 | # from among the three categories, find the maximum disproportionate share 659 | krig.grd$svccat <- c( 'gulf' , 'vietnam' , 'other' )[ apply( krig.grd[ , 3:5 ] , 1 , which.max ) ] 660 | 661 | # save only that max 662 | krig.grd$statistic <- apply( krig.grd[ , 3:5 ] , 1 , max ) 663 | 664 | # it's important to note that i've thrown out a lot of information here 665 | krig.grd <- krig.grd[ , c( 'intptlon' , 'intptlat' , 'statistic' , 'svccat' ) ] 666 | 667 | # do any points not make sense? 668 | summary( krig.grd$statistic ) 669 | 670 | # yup, the minimum is below zero. 671 | krig.grd$statistic <- pmax( 0 , krig.grd$statistic ) 672 | 673 | 674 | # from among the three categories, find the maximum disproportionate share 675 | gam.grd$svccat <- c( 'gulf' , 'vietnam' , 'other' )[ apply( gam.grd[ , 3:5 ] , 1 , which.max ) ] 676 | 677 | # save only that max 678 | gam.grd$statistic <- apply( gam.grd[ , 3:5 ] , 1 , max ) 679 | 680 | # it's important to note that i've thrown out a lot of information here 681 | gam.grd <- gam.grd[ , c( 'intptlon' , 'intptlat' , 'statistic' , 'svccat' ) ] 682 | 683 | # again, do any points not make sense? 684 | summary( gam.grd$statistic ) 685 | 686 | # another point below zero. 687 | gam.grd$statistic <- pmax( 0 , gam.grd$statistic ) 688 | 689 | # our complex sample survey-computed statistics rely on categories, 690 | # but the final map only shows the _highest_ disproportionate item 691 | # from each puma. for example, 692 | 693 | # puma 300 is slightly disproportionately more gulf veterans 694 | # and it's also near evenly-split between being 695 | # slightly disproportionately less vietnam vets and 696 | # slightly disproportionately less other era vets 697 | 698 | # puma 101 is disproportionately more gulf vets too, 699 | # but it has very heavily disproportionately fewer vietnam vets 700 | # and has close to state-average veterans from other eras. 701 | sas 702 | 703 | # only the "disproportionately more" share variable gets retained 704 | # in these predictions. all other information gets thrown away. 705 | # this is the nature of mapping categorical variables 706 | 707 | # if you are intent on showing a multi-color gradient with all information, 708 | # you can use the rgb() function, but fair warning: 709 | # the values mush together quickly and your map will probably look like ass. 710 | # i tried building color gradients to map multi-dimensional categorical values 711 | # like the multi-category values in 712 | ps 713 | # but on the color gradient, the red/green/blue values on the palette tend to mush together. 714 | # for example, on this plot right here.. 715 | plot( 1:5 , rep( 1 , 5 ) , cex = 3 , pch = 16 , col = mapply( rgb , ps[ , 2 ] , ps[ , 3 ] , ps[ , 4 ] ) ) 716 | # red is gulf veterans, green is vietnam veterans, blue is other veterans. 717 | # the colors end up just looking drab. 718 | 719 | # even when re-scaled.. 720 | rsps <- apply( ps[ , 2:4 ] , 2 , rescale ) 721 | # ..the points with high relative rates in two categories (because they're the lowest in the third) 722 | # have a lot of mixture (puma 300) and are therefore indecipherable. what is the color brown here? 723 | plot( 1:5 , rep( 1 , 5 ) , cex = 3 , pch = 16 , col = mapply( rgb , rsps[ , 1 ] , rsps[ , 2 ] , rsps[ , 3 ] ) ) 724 | # high vietnam era and also high gulf era service. 725 | text( 1:5 , rep( 1.2 , 5 ) , ps[ , 1 ] ) 726 | # multi-dimensional categorical variable coloring is a nightmare. 727 | 728 | # you have to simplify it. 729 | # simplifying it means throwing out information. 730 | 731 | # now where were we? 732 | 733 | 734 | library(RColorBrewer) 735 | 736 | # draw three gradients 737 | tg <- 738 | lapply( 739 | brewer.pal( 3 , 'Set1' ) , 740 | function( z ) colorRampPalette( c( 'white' , z ) )( 101 ) 741 | ) 742 | 743 | # check out each of these three colors, mapped from opaque to intense. 744 | plot( rep( 0:100 , 3 ) , rep( c( -1 , 0 , 1 ) , each = 101 ) , col = unlist( tg ) , pch = 16 , cex = 3 ) 745 | 746 | # draw an alternate three gradients 747 | # that start at ~20% ( that is: 25 / 125 ) 748 | # and also use a different palette from colorbrewer2.org 749 | tag <- 750 | lapply( 751 | brewer.pal( 3 , 'Dark2' ) , 752 | function( z ) colorRampPalette( c( 'white' , z ) )( 125 )[ 25:125 ] 753 | ) 754 | 755 | # check out each of these three colors, mapped from opaque to intense. 756 | plot( rep( 0:100 , 3 ) , rep( c( -1 , 0 , 1 ) , each = 101 ) , col = unlist( tag ) , pch = 16 , cex = 3 ) 757 | 758 | 759 | # # rescale both of the interpolated grids 760 | krig.grd$statistic <- krig.grd$statistic * ( 1 / max( krig.grd$statistic ) ) 761 | gam.grd$statistic <- gam.grd$statistic * ( 1 / max( gam.grd$statistic ) ) 762 | # note that the re-scaling gets done across all categories, 763 | # and not individually within each category. 764 | 765 | # add the hex color identifier 766 | krig.grd$color.value <- 767 | ifelse( krig.grd$svccat == 'gulf' , tg[[1]][ round( krig.grd$statistic * 100 ) ] , 768 | ifelse( krig.grd$svccat == 'vietnam' , tg[[2]][ round( krig.grd$statistic * 100) ] , 769 | ifelse( krig.grd$svccat == 'other' , tg[[3]][ round( krig.grd$statistic * 100 ) ] , 770 | NA ) ) ) 771 | 772 | # awwwwwwww yeah, something's happening now. 773 | plot( krig.grd$intptlon , krig.grd$intptlat , col = krig.grd$color.value , pch = 16 , cex = 3 ) 774 | 775 | # add the alternate hex color identifier 776 | krig.grd$alt.color <- 777 | ifelse( krig.grd$svccat == 'gulf' , tag[[1]][ round( krig.grd$statistic * 100 ) ] , 778 | ifelse( krig.grd$svccat == 'vietnam' , tag[[2]][ round( krig.grd$statistic * 100) ] , 779 | ifelse( krig.grd$svccat == 'other' , tag[[3]][ round( krig.grd$statistic * 100 ) ] , 780 | NA ) ) ) 781 | 782 | # that looks a bit better to me 783 | plot( krig.grd$intptlon , krig.grd$intptlat , col = krig.grd$alt.color , pch = 16 , cex = 3 ) 784 | 785 | 786 | # lower-bound the alternate color to remove the white lines 787 | krig.grd$bound.color <- 788 | ifelse( krig.grd$svccat == 'gulf' , tag[[1]][ pmax( 5 , round( krig.grd$statistic * 100 ) ) ] , 789 | ifelse( krig.grd$svccat == 'vietnam' , tag[[2]][ pmax( 5 , round( krig.grd$statistic * 100) ) ] , 790 | ifelse( krig.grd$svccat == 'other' , tag[[3]][ pmax( 5 , round( krig.grd$statistic * 100 ) ) ] , 791 | NA ) ) ) 792 | 793 | # that's smoothing by hand for you. 794 | plot( krig.grd$intptlon , krig.grd$intptlat , col = krig.grd$bound.color , pch = 16 , cex = 3 ) 795 | 796 | 797 | # put that color band on the `gam.grd` data.frame as well 798 | gam.grd$bound.color <- 799 | ifelse( gam.grd$svccat == 'gulf' , tag[[1]][ pmax( 5 , round( gam.grd$statistic * 100 ) ) ] , 800 | ifelse( gam.grd$svccat == 'vietnam' , tag[[2]][ pmax( 5 , round( gam.grd$statistic * 100) ) ] , 801 | ifelse( gam.grd$svccat == 'other' , tag[[3]][ pmax( 5 , round( gam.grd$statistic * 100 ) ) ] , 802 | NA ) ) ) 803 | 804 | 805 | # # end of step 8 # # 806 | # # # # # # # # # # # 807 | 808 | 809 | # # # # # # # # # # # # # # # # # # # # # 810 | # # step 9: ggplot and choose options # # 811 | 812 | library(ggplot2) 813 | library(mapproj) 814 | library(scales) 815 | 816 | 817 | # initiate the krige-based plot 818 | krig.grd$color.column <- as.factor( krig.grd$bound.color ) 819 | 820 | krg.plot <- 821 | ggplot( data = krig.grd , aes( x = intptlon , y = intptlat ) ) + 822 | geom_point( shape = 15 , colour = krig.grd$color.column ) + 823 | scale_fill_manual( values = unique( krig.grd$bound.color ) ) 824 | 825 | 826 | # initiate the gam-based plot 827 | gam.grd$color.column <- as.factor( gam.grd$bound.color ) 828 | 829 | gam.plot <- 830 | ggplot( data = gam.grd , aes( x = intptlon , y = intptlat ) ) + 831 | geom_point( shape = 15 , colour = gam.grd$color.column ) + 832 | scale_fill_manual( values = unique( gam.grd$bound.color ) ) 833 | 834 | # view both grids! 835 | krg.plot 836 | gam.plot 837 | 838 | 839 | # initiate the entire plot 840 | the.plot <- 841 | 842 | # choose only one of the two interpolation grids 843 | krg.plot + 844 | # gam.plot + 845 | 846 | # blank out the legend and axis labels 847 | theme( 848 | legend.position = "none" , 849 | axis.title.x = element_blank() , 850 | axis.title.y = element_blank() 851 | ) + 852 | 853 | xlab( "" ) + ylab( "" ) + 854 | 855 | # force the x and y axis limits at the shape of the city and don't do anything special for off-map values 856 | scale_x_continuous( limits = c( -191 , -127 ) , breaks = NULL , oob = squish ) + 857 | # since we're going to add lots of surrounding-area detail! 858 | scale_y_continuous( limits = c( 50 , 73 ) , breaks = NULL , oob = squish ) + 859 | 860 | theme( 861 | panel.grid.major = element_blank(), 862 | panel.grid.minor = element_blank(), 863 | panel.background = element_blank(), 864 | panel.border = element_blank(), 865 | axis.ticks = element_blank() 866 | ) 867 | 868 | # print the plot to the screen 869 | the.plot 870 | # this is the bottom layer. 871 | 872 | 873 | # initiate an aleutian islands-focused wrap-around function 874 | s360 <- function( z ){ z[ z$long > 0 , 'long' ] <- z[ z$long > 0 , 'long' ] - 360 ; z } 875 | 876 | 877 | # # alaskan state borders # # 878 | 879 | # convert the alaskan borders to longlat, 880 | # prepare for ggplot2 with `fortify` 881 | # wrap edge points around 882 | ab <- s360( fortify( spTransform( alaska.borders , CRS( "+proj=longlat" ) ) ) ) 883 | 884 | # store this information in a layer 885 | state.border.layer <- geom_path( data = ab , aes( x = long , y = lat , group = group ) , colour = 'darkgrey' ) 886 | 887 | # plot the result 888 | the.plot + state.border.layer 889 | 890 | 891 | # # alaskan main roads # # 892 | 893 | # convert the alaskan borders to longlat, 894 | # prepare for ggplot2 with `fortify` 895 | # wrap edge points around 896 | akr <- s360( fortify( spTransform( asf[[3]] , CRS( "+proj=longlat" ) ) ) ) 897 | 898 | # store this information in a layer 899 | state.roads.layer <- geom_path( data = akr , aes( x = long , y = lat , group=group ) , colour = 'darkred' ) 900 | 901 | # plot the result 902 | the.plot + state.border.layer + state.roads.layer 903 | 904 | # # end of step 9 # # 905 | # # # # # # # # # # # 906 | 907 | 908 | # # # # # # # # # # # # # # # # # # # # # 909 | # # step 10: project, blank, and save # # 910 | 911 | library(ggplot2) 912 | library(scales) 913 | library(raster) 914 | library(plyr) 915 | library(rgeos) 916 | 917 | 918 | # exclude outer alaska if you hate the wilderness or something 919 | the.plot + state.border.layer + coord_cartesian( xlim = c( -155 , max( x$intptlon ) ) , ylim = c( min( x$intptlat ) , 70 ) ) 920 | 921 | # distort the map with simple latitude/longitude scaling 922 | the.plot + state.border.layer + coord_fixed( 2.5 ) 923 | 924 | # this looks crappy, who knows what it is 925 | the.plot + state.border.layer + coord_equal() 926 | 927 | # check out a bunch of other options # 928 | the.plot + state.border.layer + coord_map( project = "cylequalarea" , mean( x$intptlat ) ) 929 | 930 | # here's the one that makes the most sense for alaska 931 | the.plot + state.border.layer + coord_map( project = "conic" , mean( x$intptlat ) , orientation = c( 90 , 0 , -141 ) ) 932 | 933 | # see ?mapproject and the ?coord_* functions for a zillion alternatives 934 | 935 | # store this projection, but not the state border 936 | the.plot <- the.plot + coord_map( project = "conic" , mean( x$intptlat ) , orientation = c( 90 , 0 , -141 ) ) 937 | # into `the.plot` 938 | 939 | 940 | # force the difference shapefile's projection 941 | proj4string( ak.shp.diff ) <- "+init=epsg:2163" 942 | 943 | # initiate the outside blanking layer 944 | outside <- s360( fortify( spTransform( ak.shp.diff , CRS( "+proj=longlat" ) ) ) ) 945 | 946 | # fix islands piecing together 947 | outside2 <- ddply( outside , .( piece ) , function( x ) rbind( x , outside[ 1 , ] ) ) 948 | 949 | # convert this fortified object to a ggplot layer 950 | outside.layer <- geom_polygon( data = outside2 , aes( x = long , y = lat , group = id ) , fill = 'white' ) 951 | 952 | # plot this -- the layer doesn't work, does it? 953 | the.plot + outside.layer 954 | 955 | # five points need to change so we have a real bounding box. 956 | subset( outside , lat < 45 | lat > 75 | long < -190 | long > -125 ) 957 | 958 | # move all of them counter-clockwise by hand 959 | outside[ outside$order %in% c( 1 , 5 ) , 'long' ] <- -116.6568 960 | # outside[ outside$order %in% c( 1 , 5 ) , 'lat' ] <- 20 961 | 962 | # outside[ outside$order %in% 4 , 'long' ] <- -220 963 | outside[ outside$order %in% 4 , 'lat' ] <- 37.56767 964 | 965 | outside[ outside$order %in% 3 , 'long' ] <- -195.4295 966 | # outside[ outside$order %in% 3 , 'lat' ] <- 100 967 | 968 | # outside[ outside$order %in% 2 , 'long' ] <- -100 969 | outside[ outside$order %in% 2 , 'lat' ] <- 79.36447 970 | 971 | 972 | # fix islands piecing together 973 | outside2 <- ddply( outside , .( piece ) , function( x ) rbind( x , outside[ 1 , ] ) ) 974 | 975 | # convert this fortified object to a ggplot layer 976 | outside.layer <- geom_polygon( data = outside2 , aes( x = long , y = lat , group = id ) , fill = 'white' ) 977 | 978 | # plot this. 979 | the.plot + outside.layer 980 | # that's not so bad, i guess. 981 | 982 | # i don't care for the state border layer, 983 | # but if you want the state border layer, 984 | # use this save line: 985 | final.plot <- the.plot + outside.layer + state.border.layer 986 | # otherwise use this save line: 987 | # final.plot <- the.plot + outside.layer 988 | # you can airbrush the outside blue border 989 | # in microsoft paint or something 990 | # if you want, right? like a boss. 991 | 992 | 993 | # use cairo-png as your bitmap type 994 | options( bitmapType = "cairo" ) 995 | 996 | # uncomment this block to save the file to your current working directory 997 | 998 | # ggsave( 999 | # "2013 alaskan veteran service eras.png" , 1000 | # plot = final.plot , 1001 | # scale = 3 1002 | # ) 1003 | 1004 | # happy? 1005 | 1006 | # # end of step ten # # 1007 | # # # # # # # # # # # # 1008 | -------------------------------------------------------------------------------- /how to map the consumer expenditure survey.R: -------------------------------------------------------------------------------- 1 | # # # # # # # # # # # # # # # # # 2 | # # set the working directory # # 3 | # # # # # # # # # # # # # # # # # 4 | 5 | # setwd( "C:/My Directory/SWMAP/" ) 6 | 7 | 8 | # # # # # # # # # # # # # # # # 9 | # # example survey data set # # 10 | # # # # # # # # # # # # # # # # 11 | 12 | # consumer expenditure survey 13 | 14 | 15 | # # # # # # # # # # # # # # # # # # # # # 16 | # # different from other maps because # # 17 | # # # # # # # # # # # # # # # # # # # # # 18 | 19 | # this is a good starting point, probably 20 | # the simplest map you can make with survey data 21 | 22 | 23 | # # # # # # # # # # # # # # # # # # 24 | # # smallest level of geography # # 25 | # # # # # # # # # # # # # # # # # # 26 | 27 | # state, some metropolitan statistical areas 28 | 29 | 30 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # 31 | # # asdfree.com blog post for this survey microdata # # 32 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # 33 | 34 | # http://www.asdfree.com/search/label/consumer%20expenditure%20survey%20%28ce%29 35 | 36 | 37 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 38 | # # r code repository for setup and analysis examples # # 39 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 40 | 41 | # https://github.com/ajdamico/asdfree/tree/master/Consumer%20Expenditure%20Survey 42 | 43 | 44 | # # # # # # # # # # # # # 45 | # # value of interest # # 46 | # # # # # # # # # # # # # 47 | 48 | # transportation expenditure as a share of total expenditure 49 | 50 | 51 | # # # # # # # 52 | # # flaws # # 53 | # # # # # # # 54 | 55 | # major downloads required. let this run overnight. 56 | 57 | 58 | # # # # # # # # # # # # # # # # # # # # # 59 | # # step 1: load the survey microdata # # 60 | 61 | library(downloader) 62 | 63 | # download the consumer expenditure survey microdata onto the local disk 64 | source_url( "https://raw.githubusercontent.com/ajdamico/asdfree/master/Consumer%20Expenditure%20Survey/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE ) 65 | 66 | # # end of step 1 # # 67 | # # # # # # # # # # # 68 | 69 | 70 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 71 | # # step 2: download and import necessary geographic crosswalks # # 72 | 73 | 74 | # options( encoding = "windows-1252" ) # # only macintosh and *nix users need this line 75 | 76 | 77 | library(downloader) 78 | library(sqldf) 79 | 80 | # load the download_cached and related functions 81 | # to prevent re-downloading of files once they've been downloaded. 82 | source_url( 83 | "https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" , 84 | prompt = FALSE , 85 | echo = FALSE 86 | ) 87 | 88 | 89 | # create a temporary file containing the census bureau's 90 | # 2010 census summary file #1, then download the file. 91 | sf1.tf <- tempfile() 92 | 93 | # initiate an empty data.frame object 94 | # to store all summary file blocks 95 | sf <- NULL 96 | 97 | # download the census bureau's 2010 summary file one # 98 | 99 | # this takes a long time. 100 | # depending on your internet speed, 101 | # you might need to let this loop run overnight. 102 | 103 | # but this code will cache the downloads onto your local disk 104 | # when you download these files once, 105 | # they'll never need to be downloaded again. 106 | 107 | # loop through every state and dc 108 | for ( state.number in 1:51 ){ 109 | 110 | # after downloading all fifty states, get dc as well 111 | if( state.number == 51 ){ 112 | sn <- "District_of_Columbia" 113 | sa <- "dc" 114 | } else { 115 | sn <- gsub( " " , "_" , state.name[ state.number ] ) 116 | sa <- tolower( state.abb[ state.number ] ) 117 | } 118 | 119 | # create a single-element character string containing the ftp path 120 | ftp.loc <- 121 | paste0( 122 | "ftp://ftp2.census.gov/census_2010/04-Summary_File_1/" , 123 | sn , 124 | "/" , 125 | sa , 126 | "2010.sf1.zip" 127 | ) 128 | 129 | # download the current state's summary file 130 | download_cached( ftp.loc , sf1.tf , mode = 'wb' ) 131 | # note: to re-download a file from scratch, add the parameter usecache = FALSE 132 | 133 | # unzip the summary file #1 files 134 | sf1.uz <- unzip( sf1.tf , exdir = tempdir() ) 135 | 136 | # file layout from http://www.census.gov/prod/cen2010/doc/sf1.pdf#page=18 137 | sf1 <- read.fwf( sf1.uz[ grep( "geo2010" , sf1.uz ) ] , c( -8 , 3 , -14 , 1 , -1 , 2 , 3 , -22 , 6 , 1 , 4 , -47 , 5 , 2 , -5 , 3 , -191 , 9 , -9 , 11 , 12 ) ) 138 | 139 | # add columns names matching the census bureau, so it's easy to read 140 | names( sf1 ) <- c( "sumlev" , "region" , "state" , "county" , "tract" , "blkgrp" , "block" , "cbsa" , "cbsasc" , "csa" , "pop100" , "intptlat" , "intptlon" ) 141 | 142 | # summary level 101 has metro areas, urban/rural status, and census blocks 143 | sf101 <- subset( sf1 , sumlev == "101" ) 144 | 145 | # within each census tract x cbsa/csa combo, 146 | # calculate the population-weighted mean of the coordinates 147 | sfs <- 148 | sqldf( 149 | "select 150 | region , state , county , tract , cbsa , cbsasc , csa , 151 | count(*) as census_blocks , 152 | sum( pop100 ) as pop100 , 153 | sum( pop100 * intptlon ) / sum( pop100 ) as intptlon , 154 | sum( pop100 * intptlat ) / sum( pop100 ) as intptlat 155 | from sf101 156 | group by 157 | region , state , county , tract , cbsa , cbsasc , csa" ) 158 | # note: this screws up coordinates that cross the international date line 159 | # or the equator. in the united states, only alaska's aleutian islands do this 160 | # and those geographies will be thrown out later. so it doesn't matter. 161 | 162 | # the above consolidation step isn't necessary if you have a huge computer 163 | # and a lot of time.. but it makes all of the kriging and rendering computations 164 | # work much faster, and mapping at the census tract- versus census block-level 165 | # really doesn't make much of a damn difference. 166 | 167 | # stack these blocks in with all the other states 168 | sf <- rbind( sf , sfs ) 169 | 170 | # remove the single-state data.frame objects and clear up RAM 171 | rm( sf101 , sf1 , sfs ) ; gc() 172 | 173 | # remove the unzipped files from your local disk 174 | file.remove( sf1.uz , sf1.tf ) 175 | 176 | } 177 | 178 | # one record per census block in every state. see? same number. 179 | tapply( sf$census_blocks , sf$state , sum ) 180 | # https://www.census.gov/geo/maps-data/data/tallies/census_block_tally.html 181 | 182 | # and guess what? the population by state matches as well. 183 | tapply( sf$pop100 , sf$state , sum ) 184 | # http://en.wikipedia.org/wiki/2010_United_States_Census#State_rankings 185 | 186 | # remove columns you actually don't need 187 | sf <- sf[ , !( names( sf ) %in% c( 'sumlev' , 'block' , 'county' , 'region' , 'tract' , 'blkgrp' , 'cbsasc' ) ) ] 188 | 189 | # remove records with zero population 190 | sf <- subset( sf , pop100 > 0 ) 191 | 192 | # clear up RAM 193 | gc() 194 | 195 | 196 | # so now we have a data.frame object with 197 | # one record per census tract, 198 | # and also with each of the geography-levels 199 | # that match the consumer expenditure survey 200 | head( sf ) 201 | 202 | # and guess what? 203 | # we've now got the census 2010 weighted populations (field pop100) 204 | # and also each census tract's centroid latitude & longitude (fields intptlat + intptlon) 205 | 206 | # add the consumer expenditure survey results' 207 | # geographic identifiers to the census tract data.frame 208 | 209 | # # align psu variables # # 210 | 211 | # note: the primary sampling units available in the ces microdata 212 | # http://www.bls.gov/cex/2013/csxintvwdata.pdf#page=9 213 | # do not perfectly map to combined statistical areas or 214 | # core-based statistical areas, so 215 | # (1) match the geographies that might be matchable 216 | # (2) combine los angeles in both data.frame objects 217 | # (3) make all other records 9999 218 | sf <- 219 | transform( 220 | sf , 221 | 222 | psu = 223 | # new york, ny 224 | ifelse( state %in% 36 & csa %in% 408 , 1109 , 225 | # new york, ct 226 | ifelse( state %in% 9 & csa %in% 408 , 1110 , 227 | # new york, nj 228 | ifelse( state %in% 34 & csa %in% 408 , 1111 , 229 | # philadelphia 230 | ifelse( csa %in% 428 , 1102 , 231 | # boston 232 | ifelse( csa %in% 148 , 1103 , 233 | # chicago 234 | ifelse( csa %in% 176 , 1207 , 235 | # detroit 236 | ifelse( csa %in% 220 , 1208 , 237 | # cleveland 238 | ifelse( csa %in% 184 , 1210 , 239 | # minneapolis 240 | ifelse( csa %in% 378 , 1211 , 241 | # washington dc 242 | ifelse( cbsa %in% 47900 , 1312 , 243 | # baltimore 244 | ifelse( cbsa %in% 12580 , 1313 , 245 | # dallas 246 | ifelse( csa %in% 206 , 1316 , 247 | # houston 248 | ifelse( csa %in% 288 , 1318 , 249 | # atlanta 250 | ifelse( csa %in% 122 , 1319 , 251 | # miami 252 | ifelse( csa %in% 370 , 1320 , 253 | # los angeles - orange 254 | ifelse( csa %in% 348 , 1000 , 255 | # los angeles suburbs 256 | ifelse( csa %in% 348 , 1000 , 257 | # san francisco 258 | ifelse( csa %in% 488 , 1422 , 259 | # seattle 260 | ifelse( csa %in% 500 , 1423 , 261 | # san diego 262 | ifelse( cbsa %in% 41740 , 1424 , 263 | # phoenix 264 | ifelse( cbsa %in% 38060 , 1429 , 265 | 266 | 9999 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 267 | 268 | ) 269 | 270 | # # align metro status variables # # 271 | sf <- transform( sf , smsastat = ifelse( cbsa %in% 99999 , 2 , 1 ) ) 272 | 273 | # which geographies are available amongst all census blocks 274 | sf.available.geographies <- unique( sf[ , c( 'state' , 'psu' , 'smsastat' ) ] ) 275 | # note: this really should be the universe, but it's not because 276 | # the consumer expenditure survey's psus don't appear to perfectly map to census areas 277 | 278 | # # end of step 2 # # 279 | # # # # # # # # # # # 280 | 281 | 282 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 283 | # # step 3: prepare your analysis of interest at the smallest geography allowed # # 284 | 285 | library(survey) 286 | library(plyr) 287 | library(stringr) 288 | 289 | # following the analysis examples in the r code repository -- 290 | # # https://github.com/ajdamico/asdfree/blob/master/Consumer%20Expenditure%20Survey/2011%20fmly%20intrvw%20-%20analysis%20examples.R 291 | # -- calculate the transportation share of total expenditure at the smallest available geographic area 292 | 293 | # load all five quarters of 2013 microdata 294 | load( "./2013/intrvw/fmli131x.rda" ) 295 | load( "./2013/intrvw/fmli132.rda" ) 296 | load( "./2013/intrvw/fmli133.rda" ) 297 | load( "./2013/intrvw/fmli134.rda" ) 298 | load( "./2013/intrvw/fmli141.rda" ) 299 | 300 | # load all five quarters of 2012 microdata 301 | load( "./2012/intrvw/fmli121x.rda" ) 302 | load( "./2012/intrvw/fmli122.rda" ) 303 | load( "./2012/intrvw/fmli123.rda" ) 304 | load( "./2012/intrvw/fmli124.rda" ) 305 | load( "./2012/intrvw/fmli131.rda" ) 306 | 307 | # stack all ten quarters 308 | fmly <- rbind.fill( fmli121x , fmli122 , fmli123 , fmli124 , fmli131 , fmli131x , fmli132 , fmli133 , fmli134 , fmli141 ) 309 | 310 | # before anything else, the los angeles suburban split isn't possible on the census side 311 | fmly <- 312 | transform( 313 | fmly , 314 | psu = 315 | # align with los angeles area in the `sf` object 316 | ifelse( psu %in% 1419:1420 , 1000 , 317 | ifelse( psu == '' , 9999 , 318 | as.numeric( psu ) ) ) 319 | ) 320 | 321 | # coerce states to match `sf` as well 322 | fmly$state <- as.numeric( fmly$state ) 323 | 324 | # set blanks to 99s 325 | fmly[ is.na( fmly$state ) , 'state' ] <- 99 326 | 327 | # extract available geographies 328 | fmly.available.geographies <- unique( fmly[ , c( 'state' , 'psu' , 'smsastat' ) ] ) 329 | 330 | # merge this with the sf file's available geographies 331 | ag <- merge( sf.available.geographies , fmly.available.geographies ) 332 | 333 | # create a flag with geographies to keep 334 | ag$keep <- 1 335 | 336 | # merge the available geographies back on 337 | fmly <- merge( fmly , ag , all = TRUE ) 338 | 339 | # anyone with a missing flag needs their geography blanked out 340 | fmly[ is.na( fmly$keep ) , 'state' ] <- 99 341 | fmly[ is.na( fmly$keep ) , 'psu' ] <- 9999 342 | 343 | # remove the flag 344 | fmly$keep <- NULL 345 | 346 | # create a character vector containing 45 variable names (wtrep01, wtrep02, ... wtrep44 and finlwt21) 347 | wtrep <- c( paste0( "wtrep" , str_pad( 1:44 , 2 , pad = "0" ) ) , "finlwt21" ) 348 | 349 | # immediately loop through each weight column (stored in the wtrep vector) 350 | for ( i in wtrep ){ 351 | # overwrite all missing values (NA) with zeroes 352 | fmly[ is.na( fmly[ , i ] ) , i ] <- 0 353 | 354 | # since we've pooled two years, divide all weights by two 355 | fmly[ , i ] <- fmly[ , i ] / 2 356 | } 357 | 358 | # create a new variable in the fmly data table called 'totalexp' 359 | # that contains the sum of the total expenditure from the current and previous quarters 360 | fmly$totalexp <- rowSums( fmly[ , c( "totexppq" , "totexpcq" ) ] , na.rm = TRUE ) 361 | 362 | # immediately convert missing values (NA) to zeroes 363 | fmly[ is.na( fmly$totalexp ) , "totalexp" ] <- 0 364 | 365 | # same for transportation 366 | fmly$transexp <- rowSums( fmly[ , c( "transpq" , "transcq" ) ] , na.rm = TRUE ) 367 | fmly[ is.na( fmly$transexp ) , "transexp" ] <- 0 368 | 369 | 370 | # turn on replicate-weighted mean squared errors 371 | options( survey.replicates.mse = TRUE ) 372 | # this matches the official census bureau published methods 373 | 374 | # construct a replicate-weighted survey design object 375 | fmly.design <- 376 | svrepdesign( 377 | repweights = "wtrep[0-9]+" , 378 | weights = ~finlwt21 , 379 | data = fmly 380 | ) 381 | 382 | 383 | # the family tables are no longer necessary 384 | rm( fmly , fmli121x , fmli122 , fmli123 , fmli124 , fmli131 , fmli131x , fmli132 , fmli133 , fmli134 , fmli141 ) ; gc() 385 | # remove them and clear up RAM 386 | 387 | 388 | # calculate the 2012-2013 nationwide ratio of transportation spending as a share of total spending 389 | svyratio( ~ transexp , ~ totalexp , fmly.design ) 390 | 391 | # note: this is almost the same number as the bls-published 2011 share: 392 | # http://www.bls.gov/cex/2011/share/cusize.pdf 393 | 394 | # the smallest geography reasonably extracted 395 | # from this survey microdata set will be 396 | # state + psu + metro status all combined 397 | full.table <- data.frame( svytable( ~ state + psu + smsastat , fmly.design ) ) 398 | # this crosstabulation includes too many zeroes, 399 | # so store the result in a data.frame object 400 | # and only print the non-zero records 401 | subset( full.table , Freq > 0 ) 402 | 403 | # simply use both of those geographies in the by= argument 404 | # of the `svyby` command, and re-calculate the 405 | # transportation expenditure shares 406 | smallest.area.statistics <- 407 | svyby( 408 | ~ transexp , 409 | denominator = ~ totalexp , 410 | by = ~ state + psu + smsastat , 411 | fmly.design , 412 | svyratio 413 | ) 414 | # this is the same command as the nationwide calculation above, 415 | # except these results have been broken into smaller areas. 416 | 417 | # these are the statistics to be mapped 418 | print( smallest.area.statistics ) 419 | # the standard errors are a measure of precision, 420 | # their inverse will serve as the mapping weights 421 | 422 | # make this object easier to type.. 423 | sas <- smallest.area.statistics 424 | 425 | # ..and also easier to read 426 | names( sas )[ names( sas ) == 'transexp/totalexp' ] <- 'share' 427 | names( sas )[ names( sas ) == 'se.transexp/totalexp' ] <- 'se' 428 | 429 | # remove objects you no longer need.. 430 | rm( fmly.design ) ; gc() 431 | # ..and clear up RAM 432 | 433 | # # end of step 3 # # 434 | # # # # # # # # # # # 435 | 436 | 437 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 438 | # # step 4: merge the results of your survey analysis with the small-area geography # # 439 | 440 | # save a continental states copy of the `sf` table for later 441 | sfsave <- sf[ !( sf$state %in% c( 2 , 15 ) ), c( 'state' , 'csa' , 'cbsa' , 'pop100' , 'intptlat' , 'intptlon' ) ] 442 | # trust me, you'll need this. 443 | 444 | # merge the available geographies on to the census tract file too 445 | sf <- merge( sf , ag , all = TRUE ) 446 | 447 | # note that alaska and hawaii need to be manually removed, 448 | # but some of the geographies in these areas will be collapsed 449 | # into the 99/9999 categories. so pre-collapse, identify them. 450 | sf$akhi <- sf$state %in% c( 2 , 15 ) 451 | 452 | # anyone with a missing flag needs their geography blanked out 453 | sf[ is.na( sf$keep ) , 'state' ] <- 99 454 | sf[ is.na( sf$keep ) , 'psu' ] <- 9999 455 | 456 | # remove the flag 457 | sf$keep <- NULL 458 | 459 | # integers are overflowing 460 | sf$pop100 <- as.numeric( sf$pop100 ) 461 | 462 | # continue being as sparse as possible. remove columns you no longer need. 463 | sf <- sf[ , ( names( sf ) %in% c( 'state' , 'psu' , 'smsastat' , 'pop100' , 'intptlat' , 'intptlon' , 'akhi' ) ) ] 464 | 465 | # clear up RAM 466 | gc() 467 | 468 | # confirm that we've created all possible geographies correctly. 469 | 470 | # the number of records in our small area statistics.. 471 | sas.row <- nrow( sas ) 472 | 473 | # ..should equal the number of unique-match-merged records.. 474 | mrow <- nrow( merge( unique( sf[ , c( 'state' , 'psu' , 'smsastat' ) ] ) , sas ) ) 475 | 476 | # ..and it does/they do. 477 | stopifnot( sas.row == mrow ) 478 | 479 | # now the census tract-level nationwide census data *could* merge if you wanted it to. 480 | 481 | 482 | # but you don't. yet. 483 | 484 | 485 | # the standard error (the `se` field) is a measure of precision. 486 | print( sas ) 487 | # the smaller the standard error, the more confident you should be 488 | # that the estimate at a particular geography is correct. 489 | 490 | 491 | # so invert it. you heard me. invert it. 492 | sas$invse <- 1 / sas$se 493 | # a smaller standard error indicates more precision. 494 | 495 | # for our purposes, precision can be considered weight! # 496 | 497 | # now we've got the weight that we should give each of our estimates # 498 | 499 | # distribute that weight across all census tracts # 500 | 501 | 502 | # aggregate the 2010 census block populations to the geographies that you have. 503 | popsum <- aggregate( sf$pop100 , by = ( sf[ , c( 'state' , 'psu' , 'smsastat' ) ] ) , sum ) 504 | 505 | # make the column name meaningful 506 | names( popsum )[ names( popsum ) == 'x' ] <- 'popsum' 507 | 508 | # merge the popsum onto the sasfile 509 | sas <- merge( sas , popsum ) 510 | 511 | # now. merge 512 | # the transportation share of total expenditure (the variable of interest) 513 | # the inverted standard error (the total weight of the broad geography) 514 | # the population sum (the total population of all census tracts that are part of that geography) 515 | 516 | x <- merge( sf , sas ) 517 | 518 | # confirm no record loss 519 | stopifnot( nrow( x ) == nrow( sf ) ) 520 | 521 | # clear up RAM 522 | rm( sf ) ; gc() 523 | 524 | # (this is the fun part) 525 | # calculate the weight of each census tract 526 | x$weight <- x$invse * ( x$pop100 / x$popsum ) 527 | 528 | # note that weight of all census tracts put together 529 | # sums to the `invse` on the original analysis file 530 | stopifnot( all.equal( sum( x$weight ) , sum( sas$invse ) ) ) 531 | 532 | # scale all weights so that they average to one 533 | x$weight <- x$weight / mean( x$weight ) 534 | 535 | 536 | # now that all weights have been computed, 537 | # remove alaska and hawaii 538 | x <- subset( x , !( akhi ) ) 539 | # note that those states need to be included up until this point 540 | # otherwise their populations won't scoop up their 541 | # respective shares of any multi-state statistics 542 | 543 | 544 | # you're done preparing your data. 545 | # keep only the columns you need. 546 | x <- x[ , c( 'share' , 'weight' , 'intptlat' , 'intptlon' ) ] 547 | 548 | # # end of step 4 # # 549 | # # # # # # # # # # # 550 | 551 | 552 | # # # # # # # # # # # # # # # # # # # # # # # 553 | # # step 5: decide on your map parameters # # 554 | 555 | library(ggplot2) 556 | library(scales) 557 | library(mapproj) 558 | library(RColorBrewer) 559 | 560 | # before you ever touch surface smoothing or kriging, 561 | # make some decisions about how you generally want 562 | # your map to look: the projection and coloring 563 | 564 | # the options below simply use hadley wickham's ggplot2 565 | # with the census tract-level transportation spending shares and centroids 566 | 567 | 568 | # initiate the simple map 569 | us.map <- 570 | qplot( 571 | intptlon , 572 | intptlat , 573 | data = x , 574 | colour = share , 575 | xlab = NULL , 576 | ylab = NULL 577 | ) 578 | 579 | 580 | # remove all map crap. 581 | us.map <- 582 | us.map + 583 | 584 | scale_x_continuous( breaks = NULL ) + 585 | 586 | scale_y_continuous( breaks = NULL ) + 587 | 588 | theme( 589 | legend.position = "none" , 590 | panel.grid.major = element_blank(), 591 | panel.grid.minor = element_blank(), 592 | panel.background = element_blank(), 593 | panel.border = element_blank(), 594 | axis.ticks = element_blank() 595 | ) 596 | 597 | 598 | # print the map without any projection 599 | us.map 600 | 601 | # print the map with an albers projection. 602 | us.map + coord_map( project = "albers" , lat0 = min( x$intptlat ) , lat1 = max( x$intptlat ) ) 603 | # see ?mapproject for a zillion alternatives 604 | 605 | # if you like that projection, store it in the map object. 606 | us.map <- 607 | us.map + 608 | coord_map( project = "albers" , lat0 = min( x$intptlat ) , lat1 = max( x$intptlat ) ) 609 | 610 | 611 | # check out some purty colors. 612 | 613 | # from http://colorbrewer2.org/ 614 | 615 | # three sequential color schemes 616 | Greys.3.p <- colorRampPalette( rev( brewer.pal( 3 , "Greys" ) ) ) 617 | YlGnBu.3.p <- colorRampPalette( rev( brewer.pal( 3 , "YlGnBu" ) ) ) 618 | YlGnBu.9.p <- colorRampPalette( rev( brewer.pal( 9 , "YlGnBu" ) ) ) 619 | 620 | # three diverging color schemes 621 | PRGn.11.p <- colorRampPalette( rev( brewer.pal( 11 , "PRGn" ) ) ) 622 | RdYlBu.3.p <- colorRampPalette( rev( brewer.pal( 3 , "RdYlBu" ) ) ) 623 | RdYlBu.11.p <- colorRampPalette( rev( brewer.pal( 11 , "RdYlBu" ) ) ) 624 | 625 | # print all six 626 | us.map + scale_colour_gradientn( colours = Greys.3.p( 100 ) ) 627 | us.map + scale_colour_gradientn( colours = YlGnBu.3.p( 100 ) ) 628 | us.map + scale_colour_gradientn( colours = YlGnBu.9.p( 100 ) ) 629 | 630 | us.map + scale_colour_gradientn( colours = PRGn.11.p( 100 ) ) 631 | us.map + scale_colour_gradientn( colours = RdYlBu.3.p( 100 ) ) 632 | us.map + scale_colour_gradientn( colours = RdYlBu.11.p( 100 ) ) 633 | 634 | # clear up RAM 635 | rm( us.map ) ; gc() 636 | 637 | # # end of step 5 # # 638 | # # # # # # # # # # # 639 | 640 | 641 | # # # # # # # # # # # # 642 | # # step 6: outline # # 643 | 644 | library(maptools) 645 | library(raster) 646 | library(rgeos) 647 | library(stringr) 648 | library(plyr) 649 | library(ggplot2) 650 | 651 | shpus.tf <- tempfile() 652 | 653 | # use the census bureau's cartographic boundary files 654 | # instead of the regular tiger shapefiles 655 | # unless you want to display transportation shares in the ocean. 656 | 657 | download_cached( 658 | "http://www2.census.gov/geo/tiger/GENZ2013/cb_2013_us_state_500k.zip" , 659 | shpus.tf , 660 | mode = 'wb' 661 | ) 662 | 663 | shpus.uz <- unzip( shpus.tf , exdir = tempdir() ) 664 | 665 | us.shp <- readShapePoly( shpus.uz[ grep( 'shp$' , shpus.uz ) ] ) 666 | 667 | # remove alaska, hawaii, and all territories 668 | us.shp <- subset( us.shp , !( STUSPS %in% c( 'AK' , 'HI' , 'PR' , 'AS' , 'VI' , 'GU' , 'MP' ) ) ) 669 | 670 | # draw a rectangle 5% bigger than the original 48-state shape 671 | us.shp.out <- as( 1.1 * extent( us.shp ), "SpatialPolygons" ) 672 | 673 | # # end of step 6 # # 674 | # # # # # # # # # # # 675 | 676 | 677 | # # # # # # # # # # # # # # # # # # 678 | # # step 7: tie knots and krige # # 679 | 680 | library(sqldf) 681 | 682 | # how many knots should you make? # 683 | 684 | # knots are the computationally-intensive part of this process, 685 | # choose as many as your computer and your patience can handle. 686 | 687 | # you should aim for between 100 - 999 knots, 688 | # but numbers closer to 1,000 will overload smaller computers 689 | 690 | # you could let the `fields` package attempt to guess knots for you, 691 | # xknots <- cover.design( cbind( x$intptlon , x$intptlat ) , 100 )$design 692 | # but with census microdata, you've already got easy access to a relevant geographic grouping 693 | 694 | # the continental united states contains 695 | length( unique( sfsave$cbsa ) ) 696 | # unique core-based statistical areas and 697 | length( unique( sfsave$csa ) ) 698 | # unique combined statistical areas, including `99999` values. 699 | 700 | # but you should probably distinguish the `99999` values across states. 701 | # if you have a powerful computer, you could try creating knots table 702 | # that crosses states by cbsas, but for smaller computers (and quicker processing) 703 | nrow( unique( sfsave[ , c( 'state' , 'csa' ) ] ) ) 704 | # here are 207 beeeeautiful knots just for you. 705 | 706 | 707 | # within each state x csa, 708 | # calculate the population-weighted mean of the coordinates 709 | # and (for smoothing) the weighted share at each state-csa centroid 710 | us.knots <- 711 | sqldf( 712 | "select 713 | state , csa , 714 | sum( pop100 ) as pop100 , 715 | sum( pop100 * intptlon ) / sum( pop100 ) as intptlon , 716 | sum( pop100 * intptlat ) / sum( pop100 ) as intptlat 717 | from sfsave 718 | group by 719 | state , csa" 720 | ) 721 | # note: this screws up coordinates that cross the international date line 722 | # or the equator. in the united states, only alaska's aleutian islands do this 723 | # and those geographies will be thrown out later. so it doesn't matter. 724 | 725 | # how many knots have you gots? 726 | nrow( us.knots ) 727 | 728 | # you can look at the weighted centroids of those csas by state 729 | plot( us.knots$intptlon , us.knots$intptlat ) 730 | 731 | # clear up RAM 732 | rm( sfsave ) ; gc() 733 | 734 | # interpolation option one # 735 | library(fields) 736 | 737 | krig.fit <- 738 | Krig( 739 | cbind( x$intptlon , x$intptlat ) , 740 | x$share , 741 | weights = x$weight , 742 | knots = cbind( us.knots$intptlon , us.knots$intptlat ) 743 | # if you prefer to use cover.design, all you'd need is this knots= line instead: 744 | # knots = xknots 745 | ) 746 | 747 | # that is: what is the (weighted) relationship between 748 | # your variable of interest (transportation share of total expenditure) and 749 | # the x/y points on a grid? 750 | 751 | # check this out! 752 | surface( krig.fit ) 753 | # you're almost there! 754 | 755 | 756 | # interpolation option two # 757 | library(mgcv) 758 | 759 | gam.fit <- 760 | gam( 761 | share ~ s(intptlon , intptlat ) , 762 | weights = weight , 763 | data = x 764 | ) 765 | 766 | 767 | # for the third alternative, keep reading. 768 | 769 | 770 | # # end of step 7 # # 771 | # # # # # # # # # # # 772 | 773 | 774 | # # # # # # # # # # # # # # # # # # # # 775 | # # step 8: make a grid and predict # # 776 | 777 | library(raster) 778 | 779 | x.range <- bbox( us.shp )[ 1 , ] 780 | y.range <- bbox( us.shp )[ 2 , ] 781 | 782 | # add one percent on each side 783 | x.diff <- abs( x.range[ 2 ] - x.range[ 1 ] ) * 0.01 784 | y.diff <- abs( y.range[ 2 ] - y.range[ 1 ] ) * 0.01 785 | 786 | x.range[ 1 ] <- x.range[ 1 ] - x.diff 787 | x.range[ 2 ] <- x.range[ 2 ] + x.diff 788 | y.range[ 1 ] <- y.range[ 1 ] - y.diff 789 | y.range[ 2 ] <- y.range[ 2 ] + y.diff 790 | 791 | # choose the number of ticks (in each direction) on your grid 792 | grid.length <- 500 793 | # grid.length <- 700 794 | # # note: smaller grids will render much much faster 795 | # # (so they're better if you're just playing around) 796 | # # but larger grids will prevent your final plot from 797 | # # being too pixelated, even when zooming in 798 | 799 | 800 | # create some grid data.frame objects, one for each interpolation type 801 | grd <- gam.grd <- krig.grd <- 802 | expand.grid( 803 | intptlon = seq( from = x.range[1] , to = x.range[2] , length = grid.length ) , 804 | intptlat = seq( from = y.range[1] , to = y.range[2] , length = grid.length ) 805 | ) 806 | 807 | 808 | # along your rectangular grid, 809 | # what are the predicted values of 810 | # the transportation spending share? 811 | for ( i in split( seq( nrow( grd ) ) , ceiling( seq( nrow( grd ) ) / 20000 ) ) ){ 812 | krig.grd[ i , 'kout' ] <- as.numeric( predict( krig.fit , krig.grd[ i , c( 'intptlon' , 'intptlat' ) ] ) ) 813 | gc() 814 | } 815 | 816 | # alternate grid using gam.fit 817 | for ( i in split( seq( nrow( grd ) ) , ceiling( seq( nrow( grd ) ) / 20000 ) ) ){ 818 | gam.grd[ i , 'gamout' ] <- as.numeric( predict( gam.fit , gam.grd[ i , c( 'intptlon' , 'intptlat' ) ] ) ) 819 | gc() 820 | } 821 | 822 | # interpolation option three # 823 | library(spatstat) 824 | 825 | smoout <- 826 | Smooth( 827 | ppp( 828 | x$intptlon , 829 | x$intptlat , 830 | x.range , 831 | y.range , 832 | marks = x$share 833 | ) , 834 | # here's a good starting point for sigma, but screw around with this value. 835 | sigma = 2 , 836 | weights = x$weight 837 | ) 838 | 839 | smoo.grd <- 840 | expand.grid( 841 | intptlon = seq( from = smoout$xrange[1] , to = smoout$xrange[2] , length = smoout$dim[1] ) , 842 | intptlat = seq( from = smoout$yrange[1] , to = smoout$yrange[2] , length = smoout$dim[2] ) 843 | ) 844 | 845 | smoo.grd$smoout <- as.numeric( t( smoout$v ) ) 846 | 847 | # # end of step 8 # # 848 | # # # # # # # # # # # 849 | 850 | 851 | # # # # # # # # # # # # # # # # # # # # # 852 | # # step 9: ggplot and choose options # # 853 | 854 | library(ggplot2) 855 | library(mapproj) 856 | 857 | 858 | # # # psa # # # 859 | # capping your outliers might drastically change your map. 860 | # if you find the 25th percentile and 75th percentile with 861 | # summary( krig.grd$kout ) 862 | # and then replace all `kout` values below the 25th or above the 75th 863 | # with those capped percentile endpoints, i promise promise promise 864 | # your maps will appear quite different. you could cap at the 25th and 75th with.. 865 | # grd.sum <- summary( krig.grd$kout ) 866 | # krig.grd[ krig.grd$kout > grd.sum[ 5 ] , 'kout' ] <- grd.sum[ 5 ] 867 | # krig.grd[ krig.grd$kout < grd.sum[ 2 ] , 'kout' ] <- grd.sum[ 2 ] 868 | # # # end # # # 869 | 870 | # you don't want to cap at the 25th and 75th? 871 | # well consider one other idea: at least cap at the 5th and 95th of the nation 872 | # this will also increase the visible gradient ultimately plotted. 873 | 874 | # for example, the lowest krigged value is negative. 875 | summary( krig.grd$kout ) 876 | # that's obviously not right. 877 | 878 | # if a numeric vector has values below the 5th percentile or above the 95th percentile, cap 'em 879 | minnmax.at.0595 <- 880 | function( z ){ 881 | q0595 <- quantile( z , c( 0.05 , 0.95 ) ) 882 | z[ z < q0595[ 1 ] ] <- q0595[ 1 ] 883 | z[ z > q0595[ 2 ] ] <- q0595[ 2 ] 884 | z 885 | } 886 | 887 | # min and max all numeric values. this makes the gradient much more visible. 888 | krig.grd$kout <- minnmax.at.0595( krig.grd$kout ) 889 | gam.grd$gamout <- minnmax.at.0595( gam.grd$gamout ) 890 | smoo.grd$smoout <- minnmax.at.0595( smoo.grd$smoout ) 891 | 892 | 893 | # initiate the krige-based plot 894 | krg.plot <- 895 | ggplot( data = krig.grd , aes( x = intptlon , y = intptlat ) ) + 896 | geom_tile( data = krig.grd , aes( fill = kout ) ) 897 | 898 | # initiate the gam-based plot 899 | gam.plot <- 900 | ggplot( data = gam.grd , aes( x = intptlon , y = intptlat ) ) + 901 | geom_tile( data = gam.grd , aes( fill = gamout ) ) 902 | 903 | # initiate the smooth-based plot 904 | smooth.plot <- 905 | ggplot( data = smoo.grd , aes( x = intptlon , y = intptlat ) ) + 906 | geom_tile( data = smoo.grd , aes( fill = smoout ) ) 907 | 908 | # view all three grids! 909 | krg.plot 910 | gam.plot 911 | smooth.plot 912 | 913 | # choose a projection. here's one using albers on the continental united states borders 914 | co <- coord_map( project = "albers" , lat0 = min( x$intptlat ) , lat1 = max( x$intptlat ) ) 915 | # but save this puppy for laytur 916 | # because printing the projected plot takes much more time than printing the unprojected one 917 | 918 | # initiate the entire plot 919 | the.plot <- 920 | 921 | # choose only one of the three interpolation grids 922 | krg.plot + 923 | # gam.plot + 924 | # smooth.plot + 925 | 926 | # blank out the legend and axis labels 927 | theme( 928 | legend.position = "none" , 929 | axis.title.x = element_blank() , 930 | axis.title.y = element_blank() 931 | ) + 932 | 933 | # blank out other plot elements 934 | 935 | scale_x_continuous( limits = x.range , breaks = NULL , oob = squish ) + 936 | 937 | scale_y_continuous( limits = y.range , breaks = NULL , oob = squish ) + 938 | 939 | theme( 940 | panel.grid.major = element_blank(), 941 | panel.grid.minor = element_blank(), 942 | panel.background = element_blank(), 943 | panel.border = element_blank(), 944 | axis.ticks = element_blank() 945 | ) 946 | 947 | # print the plot to the screen 948 | the.plot 949 | 950 | # # end of step 9 # # 951 | # # # # # # # # # # # 952 | 953 | 954 | # # # # # # # # # # # # # # # # # # 955 | # # step 10: blank, color, save # # 956 | 957 | library(ggplot2) 958 | library(scales) 959 | library(raster) 960 | library(plyr) 961 | library(RColorBrewer) 962 | library(rgeos) 963 | 964 | 965 | # draw a rectangle 15% bigger than the original state 966 | us.shp.blank <- as( 1.3 * extent( us.shp ), "SpatialPolygons" ) 967 | 968 | # compute the difference between connecticut and the rectangle 15% beyond the borders 969 | us.shp.diff <- gDifference( us.shp.blank , us.shp ) 970 | 971 | # prepare the difference layer for ggplot2 972 | outside <- fortify( us.shp.diff ) 973 | 974 | # fix the islands 975 | outside2 <- ddply( outside , .(piece) , function(x) rbind( x , outside[ 1 , ] ) ) 976 | 977 | # blank out coastal areas 978 | blank.layer <- 979 | geom_polygon( 980 | data = outside2 , 981 | aes( x = long , y = lat , group = id ) , 982 | fill = 'white' 983 | ) 984 | 985 | # closer, eh? 986 | the.plot + blank.layer 987 | 988 | # store this plot 989 | the.plot <- the.plot + blank.layer 990 | 991 | # location of all state borders 992 | sbo <- fortify( us.shp ) 993 | 994 | # here's a layer with the continental united states borders 995 | sbo.layer <- geom_path( data = sbo , aes( x = long , y = lat , group = group ) , colour = 'lightgray' ) 996 | 997 | # print that result to the screen. 998 | the.plot + sbo.layer 999 | # okay if we stick with this one? 1000 | the.plot <- the.plot + sbo.layer 1001 | # good. now it's part of the plot. 1002 | 1003 | # print all six 1004 | the.plot + scale_fill_gradientn( colours = Greys.3.p( 100 ) ) 1005 | the.plot + scale_fill_gradientn( colours = YlGnBu.3.p( 100 ) ) 1006 | the.plot + scale_fill_gradientn( colours = YlGnBu.9.p( 100 ) ) 1007 | 1008 | the.plot + scale_fill_gradientn( colours = PRGn.11.p( 100 ) ) 1009 | the.plot + scale_fill_gradientn( colours = RdYlBu.3.p( 100 ) ) 1010 | the.plot + scale_fill_gradientn( colours = RdYlBu.11.p( 100 ) ) 1011 | 1012 | # ooh i like that one mom, can we keep it can we keep it? 1013 | final.plot <- the.plot + scale_fill_gradientn( colours = RdYlBu.11.p( 100 ) ) 1014 | 1015 | final.plot 1016 | 1017 | # would you like to save this game? 1018 | 1019 | # use cairo-png as your bitmap type 1020 | options( bitmapType = "cairo" ) 1021 | 1022 | # uncomment this block to save the file to your current working directory 1023 | 1024 | # ggsave( 1025 | # "2012-2013 transportation spending as a share of total spending - unprojected.png" , 1026 | # plot = final.plot , 1027 | # type = "cairo-png" 1028 | # ) 1029 | 1030 | # add the projection 1031 | projected.plot <- final.plot + co 1032 | 1033 | # # # # # # # # # # # # # # 1034 | # warning warning warning # 1035 | 1036 | # this next save-line takes a few hours. 1037 | # leave it running overnight 1038 | 1039 | # warning warning warning # 1040 | # # # # # # # # # # # # # # 1041 | 1042 | 1043 | # uncomment this block to save the file to your current working directory 1044 | 1045 | # ggsave( 1046 | # "2012-2013 transportation spending as a share of total spending - projected.png" , 1047 | # plot = projected.plot , 1048 | # type = "cairo-png" 1049 | # ) 1050 | 1051 | # # end of step ten # # 1052 | # # # # # # # # # # # # 1053 | -------------------------------------------------------------------------------- /how to map the current population survey.R: -------------------------------------------------------------------------------- 1 | # # # # # # # # # # # # # # # # # 2 | # # set the working directory # # 3 | # # # # # # # # # # # # # # # # # 4 | 5 | # setwd( "C:/My Directory/SWMAP/" ) 6 | 7 | 8 | # # # # # # # # # # # # # # # # 9 | # # example survey data set # # 10 | # # # # # # # # # # # # # # # # 11 | 12 | # current population survey, march supplement 13 | 14 | 15 | # # # # # # # # # # # # # # # # # # # # # 16 | # # different from other maps because # # 17 | # # # # # # # # # # # # # # # # # # # # # 18 | 19 | # allows values outside of the displayed geography to influence the coloring 20 | # overlays multiple water layers (no longer shapefiles) to blank out water areas 21 | 22 | 23 | # # # # # # # # # # # # # # # # # # 24 | # # smallest level of geography # # 25 | # # # # # # # # # # # # # # # # # # 26 | 27 | # state, core-based statistical areas 28 | # (for connecticut, massachusetts, and rhode island: new england city and town areas) 29 | 30 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # 31 | # # asdfree.com blog post for this survey microdata # # 32 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # 33 | 34 | # http://www.asdfree.com/search/label/current%20population%20survey%20%28cps%29 35 | 36 | 37 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 38 | # # r code repository for setup and analysis examples # # 39 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 40 | 41 | # https://github.com/ajdamico/asdfree/tree/master/Current%20Population%20Survey 42 | 43 | 44 | # # # # # # # # # # # # # 45 | # # value of interest # # 46 | # # # # # # # # # # # # # 47 | 48 | # poverty rate (warning: the final map created by this script is of no value. fix it by pooling years or choosing larger geographies in step #2) 49 | 50 | 51 | # # # # # # # 52 | # # flaws # # 53 | # # # # # # # 54 | 55 | # this map displays nothing but survey noise. run it on a different year, you'll see completely different results. not enough sample size. 56 | # a legitimate map with the cps would pool years or aggregate small areas into larger ones. when possible, compare findings to other estimates. 57 | # for this particular statistic (poverty rate), compare your results to the county-level maps you can create on http://www.census.gov/did/www/saipe/ 58 | 59 | 60 | # # # # # # # # # # # # # # # # # # # # # 61 | # # step 1: load the survey microdata # # 62 | 63 | # remove the # in order to run this install.packages line only once 64 | # install.packages( "MonetDBLite" ) 65 | 66 | library(downloader) 67 | 68 | # download the 2013 current population survey microdata onto the local disk 69 | cps.years.to.download <- 2013 70 | source_url( "https://raw.github.com/ajdamico/asdfree/master/Current%20Population%20Survey/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE ) 71 | 72 | # # end of step 1 # # 73 | # # # # # # # # # # # 74 | 75 | 76 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 77 | # # step 2: conduct your analysis of interest at the smallest geography allowed # # 78 | 79 | library(survey) 80 | library(DBI) # load the DBI package (implements the R-database coding) 81 | library(MonetDBLite) 82 | 83 | # following the analysis examples in the r code repository -- 84 | # # https://github.com/ajdamico/asdfree/blob/master/Current%20Population%20Survey/2012%20asec%20-%20analysis%20examples.R 85 | # -- calculate the poverty rate at the smallest available geographic area 86 | # within the state of connecticut 87 | 88 | # turn on replicate-weighted mean squared errors 89 | options( survey.replicates.mse = TRUE ) 90 | # this matches the official census bureau published methods 91 | 92 | # name the database files in the "MonetDB" folder of the current working directory 93 | dbfolder <- paste0( getwd() , "/MonetDB" ) 94 | 95 | # construct a replicate-weighted, database-backed survey design object 96 | cps.design <- 97 | svrepdesign( 98 | weights = ~marsupwt, 99 | repweights = "pwwgt[1-9]", 100 | type = "Fay", 101 | rho = ( 1 - 1 / sqrt( 4 ) ), 102 | data = "asec13" , 103 | combined.weights = TRUE , 104 | dbtype = "MonetDBLite" , 105 | dbname = dbfolder 106 | ) 107 | 108 | # restrict the survey object to connecticut *plus adjacent state* records only 109 | cpas.design <- 110 | subset( 111 | cps.design , 112 | gestfips %in% c( 113 | 9 , # connecticut 114 | 25 , # massachusetts 115 | 36 , # new york 116 | 44 # rhode island 117 | ) 118 | ) 119 | 120 | # the original national survey object is no longer necessary 121 | rm( cps.design ) ; gc() 122 | # remove it and clear up RAM 123 | 124 | 125 | # calculate the 2012 connecticut state-wide poverty rate 126 | svymean( ~ as.numeric( povll %in% 1:3 ) , subset( cpas.design , gestfips == 9 & pov_univ == 1 ) ) 127 | # using only records in the poverty universe 128 | 129 | # note: this estimate and standard error precisely matches 130 | # the census bureau's table 19 (historical poverty by state) 131 | # https://www.census.gov/hhes/www/poverty/data/historical/hstpov19.xls 132 | 133 | 134 | # # examine which geographies are available # # 135 | 136 | # the current population survey identifies records 137 | # from six different core-based statistical areas (cbsa) 138 | svytable( ~ gtcbsa , cpas.design ) 139 | 140 | # the current population survey identifies records 141 | # from both metro and non-metro respondents 142 | svytable( ~ gtmetsta , cpas.design ) 143 | 144 | # the current population survey identifies records 145 | # from every state in the nation, 146 | # but we've restricted this design to 147 | # connecticut + adjacent states 148 | svytable( ~ gestfips , cpas.design ) 149 | 150 | # the smallest geography reasonably extracted 151 | # from this survey microdata set will be 152 | # cbsa + metro status combined 153 | svytable( ~ gtcbsa + gtmetsta + gestfips , cpas.design ) 154 | 155 | # simply use both of those geographies in the by= argument 156 | # of the `svyby` command, and re-calculate the poverty rates 157 | smallest.area.statistics <- 158 | svyby( 159 | ~ as.numeric( povll %in% 1:3 ) , 160 | ~ gtcbsa + gtmetsta + gestfips , 161 | subset( cpas.design , pov_univ == 1 ) , 162 | svymean 163 | ) 164 | # this is the same command as the statewide calculation above, 165 | # except these results have been broken into smaller areas. 166 | 167 | # these are the statistics to be mapped 168 | print( smallest.area.statistics ) 169 | # the standard errors are a measure of precision, 170 | # their inverse will serve as the mapping weights 171 | 172 | # make this object easier to type.. 173 | sas <- smallest.area.statistics 174 | 175 | # ..and also easier to read 176 | names( sas )[ names( sas ) == 'as.numeric(povll %in% 1:3)' ] <- 'povrate' 177 | 178 | # # end of step 2 # # 179 | # # # # # # # # # # # 180 | 181 | 182 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 183 | # # step 3: download and import necessary geographic crosswalks # # 184 | 185 | library(downloader) 186 | 187 | # load the download_cached and related functions 188 | # to prevent re-downloading of files once they've been downloaded. 189 | source_url( 190 | "https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" , 191 | prompt = FALSE , 192 | echo = FALSE 193 | ) 194 | 195 | # create a temporary file containing the census bureau's 196 | # 2010 census summary file #1 for all four states 197 | # then download the file. 198 | sf1.tf <- tempfile() 199 | 200 | # create two vectors with the names and abbreviations of all four states to download 201 | sn <- c( "Connecticut" , "Massachusetts" , "New_York" , "Rhode_Island" ) 202 | sa <- c( "ct" , "ma" , "ny" , "ri" ) 203 | 204 | # create an empty data.frame 205 | sf1.stack <- NULL 206 | 207 | # loop through all four connecticut-adjacent states 208 | for ( i in 1:4 ){ 209 | 210 | # create a single-element character string containing the ftp path 211 | ftp.loc <- 212 | paste0( 213 | "ftp://ftp2.census.gov/census_2010/04-Summary_File_1/" , 214 | sn[ i ] , 215 | "/" , 216 | sa[ i ] , 217 | "2010.sf1.zip" 218 | ) 219 | 220 | # download the current state's summary file 221 | download_cached( ftp.loc , sf1.tf , mode = 'wb' ) 222 | # note: to re-download a file from scratch, add the parameter usecache = FALSE 223 | 224 | # unzip the summary file #1 files to the current working directory 225 | sf1.uz <- unzip( sf1.tf , exdir = tempdir() ) 226 | 227 | # file layout from http://www.census.gov/prod/cen2010/doc/sf1.pdf#page=18 228 | sf1 <- 229 | read.fwf( 230 | sf1.uz[ grep( "geo2010" , sf1.uz ) ] , 231 | c( -8 , 3 , -16 , 2 , 3 , -4 , 5 , -4 , 5 , -4 , 6 , 1 , 4 , -47 , 5 , -10 , 5 , -186 , 9 , -9 , 11 , 12 , -116 , 1 , 1 ) 232 | ) 233 | 234 | # add columns names matching the census bureau, so it's easy to read 235 | names( sf1 ) <- c( "sumlev" , "state" , "county" , "cousub" , "place" , "tract" , "blkgrp" , "block" , "cbsa" , "necta" , "pop100" , "intptlat" , "intptlon" , "memi" , "nmemi" ) 236 | 237 | # summary level 101 has NECTA and census blocks 238 | sf1.101 <- subset( sf1 , sumlev == "101" ) 239 | 240 | # stack all four states into one object 241 | sf1.stack <- rbind( sf1.stack , sf1.101 ) 242 | 243 | # remove some data.frames and clear up RAM 244 | rm( sf1.101 , sf1 ) ; gc() 245 | 246 | } 247 | 248 | # just as a check, limit the summary file #1 to connecticut. 249 | sf1ct <- subset( sf1.stack , state == 9 ) 250 | 251 | # hold on to all unique counties in connecticut 252 | ascc <- unique( sf1ct[ , c( 'state' , 'county' ) ] ) 253 | 254 | # one record per census block in connecticut. see? same number. 255 | nrow( sf1ct ) 256 | # https://www.census.gov/geo/maps-data/data/tallies/census_block_tally.html 257 | 258 | # and guess what? the total connecticut population matches as well. 259 | sum( sf1ct$pop100 ) 260 | # http://quickfacts.census.gov/qfd/states/09000.html 261 | 262 | 263 | # so now we have a data.frame object with 264 | # one record per census block, 265 | # and also with the two geography-levels 266 | # that match the current population survey 267 | head( sf1.stack ) 268 | # in connecticut, rhode island, and massachusetts 269 | # necta is the cbsa and 270 | # nmemi indicates metropolitan status 271 | 272 | # and guess what? 273 | # we've now got the census 2010 weighted populations (field pop100) 274 | # and also each census block's centroid latitude & longitude (fields intptlat + intptlon) 275 | 276 | # # end of step 3 # # 277 | # # # # # # # # # # # 278 | 279 | 280 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 281 | # # step 4: merge the results of your survey analysis with the small-area geography # # 282 | 283 | # add the current population survey results' 284 | # geographic identifiers to the connecticut census block data.frame 285 | sf1.merge <- 286 | 287 | transform( 288 | 289 | sf1.stack , 290 | 291 | gestfips = state , 292 | 293 | gtcbsa = 294 | # if the record is in new york state, 295 | # check if the cbsa is in the current population survey, and if it is, keep it. 296 | ifelse( ( state %in% 36 ) & cbsa %in% unique( sas$gtcbsa ) , cbsa , 297 | 298 | # if the record is in connecticut, rhode island, or massachusetts, 299 | # check if the new england city and town area (necta) is in the cps result, 300 | # and if it is, keep it. otherwise zero-it. 301 | ifelse( !( state %in% 36 ) & necta %in% unique( sas$gtcbsa ) , necta , 0 ) ) , 302 | 303 | gtmetsta = 304 | # check whether to use the new england variable or not.. 305 | ifelse( state %in% 36 , 306 | 307 | # if the census block is metro, `gtmetsta` is a one. otherwise it's a two. 308 | ifelse( memi == 1 , 1 , 2 ) , 309 | 310 | ifelse( nmemi == 1 , 1 , 2 ) 311 | ) 312 | 313 | ) 314 | 315 | # confirm that we've created all possible geographies correctly. 316 | merge( unique( sf1.merge[ , c( 'gtcbsa' , 'gtmetsta' , 'gestfips' ) ] ) , sas , all = TRUE ) 317 | # nope. there are missings. 318 | 319 | # recode anyone in new york state not in a cbsa to be non-metro 320 | sf1.merge[ 321 | sf1.merge$gestfips == 36 & 322 | sf1.merge$gtmetsta == 1 & 323 | sf1.merge$gtcbsa == 0 , 324 | 'gtmetsta' ] <- 2 325 | 326 | # recode 100% of rhode island to be a part of the providence cbsa 327 | sf1.merge[ sf1.merge$gestfips == 44 , 'gtcbsa' ] <- 77200 328 | sf1.merge[ sf1.merge$gestfips == 44 , 'gtmetsta' ] <- 1 329 | 330 | # recode connecticut residents in the 331 | # worcester, ma or springfield, ma cbsas to unavailable cbsas 332 | sf1.merge[ sf1.merge$gestfips == 9 & sf1.merge$gtcbsa %in% c( 78100 , 79600 ) , 'gtcbsa' ] <- 0 333 | 334 | 335 | # the number of records in our small area statistics.. 336 | sas.row <- nrow( sas ) 337 | 338 | # ..should equal the number of unique-match-merged records.. 339 | mrow <- nrow( merge( unique( sf1.merge[ , c( 'gtcbsa' , 'gtmetsta' , 'gestfips' ) ] ) , sas ) ) 340 | 341 | # ..and it does/they do. 342 | stopifnot( sas.row == mrow ) 343 | 344 | # now the census block-level connecticut+adjacent state census data 345 | # *could* merge if you wanted it to. 346 | 347 | 348 | # but you don't. yet. 349 | 350 | 351 | # the standard error (the `se` field) is a measure of precision. 352 | print( sas ) 353 | # the smaller the standard error, the more confident you should be 354 | # that the estimate at a particular geography is correct. 355 | 356 | 357 | # so invert it. you heard me. invert it. 358 | sas$invse <- 1 / sas$se 359 | # a smaller standard error indicates more precision. 360 | 361 | # for our purposes, precision can be considered weight! # 362 | 363 | # now we've got the weight that we should give each of our estimates # 364 | 365 | # distribute that weight across all census blocks # 366 | 367 | 368 | # aggregate the 2010 census block populations to the geographies that you have. 369 | popsum <- 370 | aggregate( 371 | sf1.merge$pop100 , 372 | by = ( sf1.merge[ , c( 'gestfips' , 'gtcbsa' , 'gtmetsta' ) ] ) , 373 | sum 374 | ) 375 | 376 | # make the column name meaningful 377 | names( popsum )[ names( popsum ) == 'x' ] <- 'popsum' 378 | 379 | # merge the popsum onto the sasfile 380 | sas <- merge( sas , popsum ) 381 | 382 | # now. merge 383 | # the poverty rate (the variable of interest) 384 | # the inverted standard error (the total weight of the broad geography) 385 | # the population sum (the total population of all census blocks that are part of that geography) 386 | 387 | x <- merge( sf1.merge , sas ) 388 | 389 | # confirm no record loss 390 | stopifnot( nrow( x ) == nrow( sf1.merge ) ) 391 | 392 | 393 | # (this is the fun part) 394 | # calculate the weight of each census block 395 | x$weight <- x$invse * ( x$pop100 / x$popsum ) 396 | 397 | # note that weight of all census blocks put together 398 | # sums to the `invse` on the original analysis file 399 | stopifnot( all.equal( sum( x$weight ) , sum( sas$invse ) ) ) 400 | 401 | # remove records with zero population 402 | x <- subset( x , weight > 0 ) 403 | 404 | # scale all weights so that they average to one 405 | x$weight <- x$weight / mean( x$weight ) 406 | 407 | # you're done preparing your data. 408 | # keep only the columns you need. 409 | x <- x[ , c( 'povrate' , 'weight' , 'intptlat' , 'intptlon' , 'gestfips' ) ] 410 | # be sure to save the state identifier for easy subsets 411 | 412 | # # end of step 4 # # 413 | # # # # # # # # # # # 414 | 415 | 416 | # # # # # # # # # # # # # # # # # # # # # # # 417 | # # step 5: decide on your map parameters # # 418 | 419 | library(ggplot2) 420 | library(scales) 421 | library(mapproj) 422 | 423 | 424 | # before you ever touch surface smoothing or kriging, 425 | # make some decisions about how you generally want 426 | # your map to look: the projection and coloring 427 | 428 | # the options below simply use hadley wickham's ggplot2 429 | # with the census block-level poverty rates and centroids 430 | 431 | 432 | # initiate the simple map 433 | ct.map <- 434 | qplot( 435 | intptlon , 436 | intptlat , 437 | data = subset( x , gestfips == 9 ) , 438 | colour = povrate , 439 | xlab = NULL , 440 | ylab = NULL 441 | ) 442 | 443 | 444 | # remove all map crap. 445 | ct.map <- 446 | ct.map + 447 | 448 | scale_x_continuous( breaks = NULL ) + 449 | 450 | scale_y_continuous( breaks = NULL ) + 451 | 452 | theme( 453 | legend.position = "none" , 454 | panel.grid.major = element_blank(), 455 | panel.grid.minor = element_blank(), 456 | panel.background = element_blank(), 457 | panel.border = element_blank(), 458 | axis.ticks = element_blank() 459 | ) 460 | 461 | 462 | # print the map without any projection 463 | ct.map 464 | 465 | # print the map with an albers projection. 466 | ct.map + coord_map( project = "albers" , lat0 = min( x$intptlat ) , lat1 = max( x$intptlat ) ) 467 | # see ?mapproject for a zillion alternatives 468 | 469 | # if you like that projection, store it in the map object. 470 | ct.map <- 471 | ct.map + 472 | coord_map( project = "albers" , lat0 = min( x$intptlat ) , lat1 = max( x$intptlat ) ) 473 | 474 | 475 | # check out some purty colors. 476 | ct.map + scale_colour_gradient( low = 'green' , high = 'red' ) 477 | 478 | ct.map + scale_colour_gradient( low = 'white' , high = 'blue' ) 479 | 480 | ct.map + scale_colour_gradient( low = muted( 'blue' ) , high = muted( 'red' ) ) 481 | 482 | # notice how the dotted delineations match the census bureau's 2006 necta definitions 483 | # http://www2.census.gov/geo/maps/metroarea/us_wall/Dec2006/necta_1206_large.gif 484 | 485 | # clear up RAM 486 | rm( ct.map ) ; gc() 487 | 488 | # # end of step 5 # # 489 | # # # # # # # # # # # 490 | 491 | 492 | # # # # # # # # # # # # 493 | # # step 6: outline # # 494 | 495 | library(maptools) 496 | library(raster) 497 | library(rgeos) 498 | library(stringr) 499 | library(plyr) 500 | library(ggplot2) 501 | 502 | shpstate.tf <- tempfile() 503 | 504 | # connnecticut borders the ocean, 505 | # so use the census bureau's cartographic boundary files 506 | # instead of the regular tiger shapefiles 507 | # unless you want to display poverty rates in the ocean. 508 | 509 | download_cached( 510 | "http://www2.census.gov/geo/tiger/GENZ2013/cb_2013_us_state_500k.zip" , 511 | shpstate.tf , 512 | mode = 'wb' 513 | ) 514 | 515 | shpstate.uz <- unzip( shpstate.tf , exdir = tempdir() ) 516 | 517 | state.shp <- readShapePoly( shpstate.uz[ grep( 'shp$' , shpstate.uz ) ] ) 518 | 519 | ct.shp <- subset( state.shp , STATEFP == '09' ) 520 | 521 | # draw a rectangle 5% bigger than the original state 522 | ct.shp.out <- as( 1.1 * extent( ct.shp ), "SpatialPolygons" ) 523 | 524 | # clear up RAM 525 | rm( state.shp ) ; gc() 526 | 527 | # # end of step 6 # # 528 | # # # # # # # # # # # 529 | 530 | 531 | # # # # # # # # # # # # # # # # # # 532 | # # step 7: tie knots and krige # # 533 | 534 | library(sqldf) 535 | 536 | # how many knots should you make? # 537 | 538 | # knots are the computationally-intensive part of this process, 539 | # choose as many as your computer and your patience can handle. 540 | 541 | # you should aim for between 100 - 999 knots, 542 | # but numbers closer to 1,000 will overload smaller computers 543 | 544 | # you could let the `fields` package attempt to guess knots for you, 545 | # xknots <- cover.design( cbind( x$intptlon , x$intptlat ) , 100 )$design 546 | # but with census microdata, you've already got easy access to a relevant geographic grouping 547 | 548 | # the state of connecticut has 833 census tracts. 549 | # https://www.census.gov/geo/maps-data/data/tallies/tractblock.html 550 | # if you have a powerful computer, you could group based on weighted tracts 551 | # however, to keep the processing requirements lower, 552 | # i'll use county subdivisions 553 | 554 | # for the knotting, note that adjacent states are no longer necessary, 555 | # so subset the census summary file #1 to only connecticut and nearby census blocks. 556 | 557 | # the sqldf() function doesn't like `.` in data.frame object names 558 | sf1s <- sf1.stack 559 | 560 | # within each county x county subdivision, 561 | # calculate the population-weighted mean of the coordinates 562 | ct.knots <- 563 | sqldf( 564 | "select 565 | state , county , cousub , 566 | sum( pop100 ) as pop100 , 567 | sum( pop100 * intptlon ) / sum( pop100 ) as intptlon , 568 | sum( pop100 * intptlat ) / sum( pop100 ) as intptlat 569 | from sf1s 570 | group by 571 | state , county , cousub" 572 | ) 573 | # note: this screws up coordinates that cross the international date line 574 | # or the equator. in the united states, only alaska's aleutian islands do this 575 | # and those geographies will be thrown out later. so it doesn't matter. 576 | 577 | # how many knots have you gots? 578 | nrow( ct.knots ) 579 | # too many, because you've included 580 | # all of the adjacent states 581 | 582 | # retain only knots within the bounding box 583 | ct.knots <- 584 | subset( 585 | ct.knots , 586 | ( bbox( ct.shp.out )[ 1 , 1 ] < intptlon ) & 587 | ( bbox( ct.shp.out )[ 2 , 1 ] < intptlat ) & 588 | ( bbox( ct.shp.out )[ 1 , 2 ] > intptlon ) & 589 | ( bbox( ct.shp.out )[ 2 , 2 ] > intptlat ) 590 | ) 591 | 592 | # retain only points within the bounding box 593 | x <- 594 | subset( 595 | x , 596 | ( bbox( ct.shp.out )[ 1 , 1 ] < intptlon ) & 597 | ( bbox( ct.shp.out )[ 2 , 1 ] < intptlat ) & 598 | ( bbox( ct.shp.out )[ 1 , 2 ] > intptlon ) & 599 | ( bbox( ct.shp.out )[ 2 , 2 ] > intptlat ) 600 | ) 601 | 602 | # count again 603 | nrow( ct.knots ) 604 | # that's more like it. 605 | 606 | # you can look at the weighted centroids of those county subdivisions 607 | plot( ct.knots$intptlon , ct.knots$intptlat ) 608 | # and look at that, bits of long island will be influencing our results 609 | # since it's within a 5% range of the state of connecticut box 610 | 611 | # clear up RAM 612 | rm( sf1.stack , sf1s ) ; gc() 613 | 614 | # interpolation option one # 615 | library(fields) 616 | 617 | krig.fit <- 618 | Krig( 619 | cbind( x$intptlon , x$intptlat ) , 620 | x$povrate , 621 | weights = x$weight , 622 | knots = cbind( ct.knots$intptlon , ct.knots$intptlat ) 623 | # if you prefer to use cover.design, all you'd need is this knots= line instead: 624 | # knots = xknots 625 | ) 626 | 627 | # that is: what is the (weighted) relationship between 628 | # your variable of interest (poverty rate) and 629 | # the x/y points on a grid? 630 | 631 | # check this out! 632 | surface( krig.fit ) 633 | # you're almost there! 634 | 635 | 636 | # interpolation option two # 637 | library(mgcv) 638 | 639 | gam.fit <- 640 | gam( 641 | povrate ~ s(intptlon , intptlat ) , 642 | weights = weight , 643 | data = x 644 | ) 645 | 646 | 647 | # for the third alternative, keep reading. 648 | 649 | 650 | # # end of step 7 # # 651 | # # # # # # # # # # # 652 | 653 | 654 | # # # # # # # # # # # # # # # # # # # # 655 | # # step 8: make a grid and predict # # 656 | 657 | library(raster) 658 | 659 | x.range <- bbox( ct.shp.out )[ 1 , ] 660 | y.range <- bbox( ct.shp.out )[ 2 , ] 661 | 662 | # add five percent on each side 663 | x.diff <- abs( x.range[ 2 ] - x.range[ 1 ] ) * 0.05 664 | y.diff <- abs( y.range[ 2 ] - y.range[ 1 ] ) * 0.05 665 | 666 | x.range[ 1 ] <- x.range[ 1 ] - x.diff 667 | x.range[ 2 ] <- x.range[ 2 ] + x.diff 668 | y.range[ 1 ] <- y.range[ 1 ] - y.diff 669 | y.range[ 2 ] <- y.range[ 2 ] + y.diff 670 | 671 | # choose the number of ticks (in each direction) on your grid 672 | grid.length <- 500 673 | # # note: smaller grids will render faster 674 | # # (so they're better if you're just playing around) 675 | # # but larger grids will prevent your final plot from 676 | # # being too pixelated, even when zooming in 677 | 678 | 679 | # create some grid data.frame objects, one for each interpolation type 680 | grd <- gam.grd <- krig.grd <- 681 | expand.grid( 682 | intptlon = seq( from = x.range[1] , to = x.range[2] , length = grid.length ) , 683 | intptlat = seq( from = y.range[1] , to = y.range[2] , length = grid.length ) 684 | ) 685 | 686 | 687 | # along your rectangular grid, 688 | # what are the predicted values of 689 | # the poverty rate? 690 | krig.grd$kout <- predict( krig.fit , krig.grd ) 691 | 692 | # alternate grid using gam.fit 693 | gam.grd$gamout <- predict( gam.fit , gam.grd ) 694 | 695 | # interpolation option three # 696 | library(spatstat) 697 | 698 | smoout <- 699 | Smooth( 700 | ppp( 701 | x$intptlon , 702 | x$intptlat , 703 | x.range , 704 | y.range , 705 | marks = x$povrate 706 | ) , 707 | # here's a good starting point for sigma, but screw around with this value. 708 | sigma = ( max( x$povrate ) - min( x$povrate ) ) / 2 , 709 | weights = x$weight 710 | ) 711 | 712 | smoo.grd <- 713 | expand.grid( 714 | intptlon = seq( from = smoout$xrange[1] , to = smoout$xrange[2] , length = smoout$dim[1] ) , 715 | intptlat = seq( from = smoout$yrange[1] , to = smoout$yrange[2] , length = smoout$dim[2] ) 716 | ) 717 | 718 | smoo.grd$smoout <- as.numeric( t( smoout$v ) ) 719 | 720 | # # end of step 8 # # 721 | # # # # # # # # # # # 722 | 723 | 724 | # # # # # # # # # # # # # # # # # # # # # 725 | # # step 9: ggplot and choose options # # 726 | 727 | library(ggplot2) 728 | library(mapproj) 729 | 730 | 731 | # # # psa # # # 732 | # capping your outliers might drastically change your map. 733 | # if you find the 25th percentile and 75th percentile with 734 | # summary( krig.grd$kout ) 735 | # and then replace all `kout` values below the 25th or above the 75th 736 | # with those capped percentile endpoints, i promise promise promise 737 | # your maps will appear quite different. you could cap at the 25th and 75th with.. 738 | # grd.sum <- summary( krig.grd$kout ) 739 | # krig.grd[ krig.grd$kout > grd.sum[ 5 ] , 'kout' ] <- grd.sum[ 5 ] 740 | # krig.grd[ krig.grd$kout < grd.sum[ 2 ] , 'kout' ] <- grd.sum[ 2 ] 741 | # # # end # # # 742 | 743 | 744 | # you don't want to cap at the 25th and 75th? 745 | # well consider one other idea: cap within the minimum and maximum of the state 746 | # this will also increase the visible gradient ultimately plotted. 747 | 748 | # subset `x` to connecticut only 749 | ctx <- subset( x , gestfips == 9 ) 750 | 751 | # capture the connecticut statewide minimum and maximum known poverty rates 752 | state.sum <- summary( ctx$povrate ) 753 | 754 | # if a numeric vector has values below the minimum or above the maximum, cap 'em 755 | minnmax.at.statesum <- 756 | function( z , ss ){ 757 | z[ z < ss[ 1 ] ] <- ss[ 1 ] 758 | z[ z > ss[ 6 ] ] <- ss[ 6 ] 759 | z 760 | } 761 | 762 | # min and max all numeric values. this makes the gradient much more visible. 763 | krig.grd$kout <- minnmax.at.statesum( krig.grd$kout , state.sum ) 764 | gam.grd$gamout <- minnmax.at.statesum( gam.grd$gamout , state.sum ) 765 | smoo.grd$smoout <- minnmax.at.statesum( smoo.grd$smoout , state.sum ) 766 | # this moderates the effect of adjacent-state high-poverty areas. 767 | # you're not discounting the poverty rates in connecticut's border states, 768 | # but you're forcing the richest and poorest spots on the map to be within state borders 769 | 770 | 771 | # initiate the krige-based plot 772 | krg.plot <- 773 | ggplot( data = krig.grd , aes( x = intptlon , y = intptlat ) ) + 774 | geom_tile( data = krig.grd , aes( fill = kout ) ) 775 | 776 | # initiate the gam-based plot 777 | gam.plot <- 778 | ggplot( data = gam.grd , aes( x = intptlon , y = intptlat ) ) + 779 | geom_tile( data = gam.grd , aes( fill = gamout ) ) 780 | 781 | # initiate the smooth-based plot 782 | smooth.plot <- 783 | ggplot( data = smoo.grd , aes( x = intptlon , y = intptlat ) ) + 784 | geom_tile( data = smoo.grd , aes( fill = smoout ) ) 785 | 786 | # view all three grids! 787 | krg.plot 788 | gam.plot 789 | smooth.plot 790 | 791 | # choose a projection. here's one using albers on the connecticut borders 792 | co <- coord_map( project = "albers" , lat0 = min( ctx$intptlat ) , lat1 = max( ctx$intptlat ) ) 793 | 794 | # force this projection to work on all object types 795 | co2 <- co 796 | class(co2) <- c( "hoge" , class( co2 ) ) 797 | is.linear.hoge <- function(coord) TRUE 798 | 799 | # initiate the entire plot 800 | the.plot <- 801 | 802 | # choose only one of the three interpolation grids 803 | krg.plot + 804 | # gam.plot + 805 | # smooth.plot + 806 | 807 | # include the projection requirements 808 | co2 + 809 | coord_fixed() + 810 | 811 | # blank out the legend and axis labels 812 | theme( 813 | legend.position = "none" , 814 | axis.title.x = element_blank() , 815 | axis.title.y = element_blank() 816 | ) + 817 | 818 | # blank out other plot elements 819 | scale_x_continuous( limits = x.range , breaks = NULL , oob = squish ) + 820 | scale_y_continuous( limits = y.range , breaks = NULL , oob = squish ) + 821 | theme( 822 | panel.grid.major = element_blank(), 823 | panel.grid.minor = element_blank(), 824 | panel.background = element_blank(), 825 | panel.border = element_blank(), 826 | axis.ticks = element_blank() 827 | ) 828 | 829 | # print the plot to the screen 830 | the.plot 831 | 832 | # # end of step 9 # # 833 | # # # # # # # # # # # 834 | 835 | 836 | # # # # # # # # # # # # # # # # # # 837 | # # step 10: blank, color, save # # 838 | 839 | library(ggplot2) 840 | library(scales) 841 | library(raster) 842 | library(plyr) 843 | library(rgeos) 844 | 845 | # draw a rectangle 15% bigger than the original state 846 | ct.shp.blank <- as( 1.3 * extent( ct.shp ), "SpatialPolygons" ) 847 | 848 | # compute the difference between connecticut and the rectangle 15% beyond the borders 849 | ct.shp.diff <- gDifference( ct.shp.blank , ct.shp ) 850 | 851 | # prepare the difference layer for ggplot2 852 | outside <- fortify( ct.shp.diff ) 853 | 854 | # fix the islands 855 | outside2 <- ddply( outside , .(piece) , function(x) rbind( x , outside[ 1 , ] ) ) 856 | 857 | # blank out coastal areas 858 | blank.layer <- 859 | geom_polygon( 860 | data = outside2 , 861 | aes( x = long , y = lat , group = id ) , 862 | fill = 'white' 863 | ) 864 | 865 | # closer, eh? 866 | the.plot + blank.layer 867 | 868 | # store this plot 869 | the.plot <- the.plot + blank.layer 870 | 871 | # location of all water files within the state of connecticut 872 | water.files <- 873 | paste0( 874 | "ftp://ftp2.census.gov/geo/tiger/TIGER2013/AREAWATER/tl_2013_" , 875 | str_pad( ascc[ , 1 ] , 2 , pad = '0' ) , 876 | str_pad( ascc[ , 2 ] , 3 , pad = '0' ) , 877 | "_areawater.zip" 878 | ) 879 | 880 | # initiate another temporary file 881 | watemp <- tempfile() 882 | 883 | # start with all missing water 884 | all.water <- NULL 885 | 886 | # loop through each water.file 887 | for ( fn in water.files ){ 888 | 889 | # download the shapefile to the local disk 890 | download_cached( fn , watemp ) 891 | 892 | # unzip it into a temporary directory 893 | z <- unzip( watemp , exdir = tempdir() ) 894 | 895 | # read it in.. 896 | w <- readShapePoly( z[ grep( 'shp$' , z ) ] ) 897 | 898 | # ..prepare it for ggplot2.. 899 | wo <- fortify( w ) 900 | 901 | # ..fix any islands or weird outlying shapes 902 | w2 <- ddply( wo , .( piece ) , function( x ) rbind( x , wo[ 1 , ] ) ) 903 | 904 | # create a white-filled layer 905 | wl <- geom_polygon( data = w2 , aes( x = long , y = lat , group = group ) , fill = 'white' ) 906 | 907 | # add it to the current plot 908 | the.plot <- the.plot + wl 909 | 910 | } 911 | 912 | # print with all water blanked out. 913 | the.plot 914 | 915 | # print with the same purty colors 916 | the.plot + scale_fill_gradient( low = muted( 'blue' ) , high = muted( 'red' ) ) 917 | 918 | the.plot + scale_fill_gradient( low = 'white' , high = 'red' ) 919 | 920 | the.plot + scale_fill_gradient( low = 'green' , high = 'red' ) 921 | 922 | # ooh i like that one mom, can we keep it can we keep it? 923 | final.plot <- the.plot + scale_fill_gradient( low = 'green' , high = 'red' ) 924 | 925 | final.plot 926 | 927 | # would you like to save this game? 928 | 929 | # save the file to your current working directory 930 | ggsave( 931 | "2012 connecticut poverty rate.pdf" , 932 | plot = final.plot , 933 | scale = 2 934 | ) 935 | 936 | # # end of step ten # # 937 | # # # # # # # # # # # # 938 | -------------------------------------------------------------------------------- /how to map the demographic and health surveys.R: -------------------------------------------------------------------------------- 1 | # # # # # # # # # # # # # # # # # 2 | # # set the working directory # # 3 | # # # # # # # # # # # # # # # # # 4 | 5 | # setwd( "C:/My Directory/SWMAP/" ) 6 | 7 | 8 | # # # # # # # # # # # # # # # # 9 | # # example survey data set # # 10 | # # # # # # # # # # # # # # # # 11 | 12 | # demographic and health surveys 13 | 14 | 15 | # # # # # # # # # # # # # # # # # # # # # 16 | # # different from other maps because # # 17 | # # # # # # # # # # # # # # # # # # # # # 18 | 19 | # uses the prevR package, which automates the interpolation 20 | # kernel density estimation with adaptive bandwidth, rather than kriging 21 | 22 | 23 | # # # # # # # # # # # # # # # # # # 24 | # # smallest level of geography # # 25 | # # # # # # # # # # # # # # # # # # 26 | 27 | # exact interview locations within the country 28 | 29 | 30 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # 31 | # # asdfree.com blog post for this survey microdata # # 32 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # 33 | 34 | # http://www.asdfree.com/search/label/demographic%20and%20health%20surveys%20%28dhs%29 35 | 36 | 37 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 38 | # # r code repository for setup and analysis examples # # 39 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 40 | 41 | # https://github.com/ajdamico/asdfree/tree/master/Demographic%20and%20Health%20Surveys 42 | 43 | 44 | # # # # # # # # # # # # # 45 | # # value of interest # # 46 | # # # # # # # # # # # # # 47 | 48 | # infant mortality 49 | 50 | 51 | # # # # # # # 52 | # # flaws # # 53 | # # # # # # # 54 | 55 | # egypt's nile-dwelling population can distort estimates 56 | # requires microdata with gps coordinates, often not available 57 | 58 | 59 | # # # # # # # # # # # # # # # # # # # # # 60 | # # step 1: load the survey microdata # # 61 | 62 | library(downloader) 63 | 64 | # download the 2008 egyptian demographic and health survey microdata onto the local disk 65 | # note that this requires (free) registration before the download will work 66 | # http://dhsprogram.com/data/Access-Instructions.cfm 67 | your.username <- "username" 68 | your.password <- "password" 69 | your.project <- "project" 70 | source_url( "https://raw.github.com/ajdamico/asdfree/master/Demographic%20and%20Health%20Surveys/download%20and%20import.R" , prompt = FALSE , echo = TRUE ) 71 | 72 | 73 | # # end of step 1 # # 74 | # # # # # # # # # # # 75 | 76 | 77 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 78 | # # step 2: calculate interview cluster-specific values # # 79 | 80 | library(foreign) 81 | library(sqldf) 82 | 83 | # load the children's recode file as a data.frame `ch` 84 | ch <- read.dta( "./Egypt/Standard DHS 2008/Children's Recode/stata/EGKR5AFL.DTA" ) 85 | 86 | # retain only the columns necessary for the analysis 87 | # v001 cluster number 88 | # v005 Sample weight 89 | # b2 year of birth 90 | # b5 child is alive 91 | # b6 age at death (1xx in days, 2xx in months, 3xx in years) 92 | ch <- ch[ , c( 'v001' , 'v005' , 'b2' , 'b5' , 'b6' ) ] 93 | 94 | # limit the sample to only children born between 2004 and 2008 95 | # with a non-missing age at death 96 | ch <- subset( ch , b2 %in% 2004:2008 & !( b6 %in% 997:999 ) ) 97 | # that is, in the past three years. 98 | 99 | # create a binary infant mortality variable 100 | ch$im <- as.numeric( ch$b5 %in% 'no' & ch$b6 %in% 100:301 ) 101 | 102 | # create weigth variable 103 | ch$w <- ch$v005 / 1000000 104 | 105 | # note that this is very close to (but not exactly the same as) 106 | # the nationwide egyptian infant mortality rate given on their report 107 | # table 10.1 http://dhsprogram.com/pubs/pdf/FR220/FR220.pdf#page=148 108 | weighted.mean( ch$im , ch$w ) * 1000 109 | # their report says 24.5 per 1,000. 110 | # the current microdata shows 24.2 per 1,000. big whoop. 111 | 112 | # calculate four statistics, grouped by survey cluster 113 | # count, weighted count, infant deaths, weighted infant deaths 114 | cl <- sqldf( "select v001 as dhsclust , count(*) as denom , sum( im ) as numer , sum( w ) as wdenom , sum( im * w ) as wnumer from ch group by v001" ) 115 | 116 | # that was easy, huh? want to look at your resultant cluster-level information? 117 | head( cl ) 118 | 119 | tail( cl ) 120 | 121 | # # end of step 2 # # 122 | # # # # # # # # # # # 123 | 124 | 125 | # # # # # # # # # # # # # # # # # # # # # 126 | # # step 3: prepare your prevR object # # 127 | 128 | library(prevR) 129 | library(foreign) 130 | 131 | # what country are you mapping? egypt? 132 | # oh, okay. well then let's grab egypt's boundaries. 133 | bounds <- create.boundary( "Egypt" ) 134 | # that was it? that was too easy. 135 | 136 | # import the longitude and latitude data from the 137 | # geographical information system (gis) files (this is a special request) 138 | longlat <- read.dbf( "./Egypt/Standard DHS 2008/Supplemental/flat/EGGE5DFL.dbf" ) 139 | 140 | # convert all column names to lowercase 141 | names( longlat ) <- tolower( names( longlat ) ) 142 | 143 | # merge this cluster information onto the cluster-level results data.frame 144 | x <- merge( cl , longlat[ , c( 'dhsclust' , 'longnum' , 'latnum' , 'source' ) ] ) 145 | 146 | # confirm that every cluster that you have infant mortality information for 147 | # also has a longitude & latitude variable now 148 | stopifnot( nrow( x ) == nrow( cl ) ) 149 | 150 | # check how many clusters are missing coordinates 151 | miss.coord <- nrow( subset( x , source == 'MIS' ) ) 152 | 153 | # discard records with missing longitudes & latitudes 154 | x <- subset( x , source != 'MIS' ) 155 | 156 | # confirm you've tossed the correct number of records 157 | stopifnot( nrow( x ) + miss.coord == nrow( cl ) ) 158 | 159 | # identify which columns are integer types 160 | ic <- sapply( x , is.integer ) 161 | 162 | # coerce every integer column to numeric 163 | x[ , ic ] <- sapply( x[ , ic ] , as.numeric ) 164 | 165 | # create a prevR object like a professional. 166 | pro <- 167 | as.prevR( 168 | x , 169 | c( id = "dhsclust" , x = "longnum" , y = "latnum" , n = "denom" , wn = "wdenom" , pos = "numer" , wpos = "wnumer" ) , 170 | bounds 171 | ) 172 | 173 | # want to take a first glance at the sampling clusters 174 | # of the 2008 egypt demographic and health surveys? 175 | plot( pro ) 176 | # woah. 177 | 178 | # # end of step 3 # # 179 | # # # # # # # # # # # 180 | 181 | 182 | # # # # # # # # # # # # # # # # # 183 | # # step 4: make a simple map # # 184 | 185 | # compute bandwidths 186 | pro <- rings( pro , N = 1000 ) 187 | 188 | # compute surfaces 189 | pro.map <- kde( pro , N = 1000 , nb.cells = 250 ) 190 | 191 | # plot a simple map comparing 192 | # weighted and unweighted surfaces 193 | spplot( pro.map ) 194 | 195 | # re-create a simple weighted surface, 196 | # but with a prevR palette 197 | spplot( 198 | pro.map, 199 | 'k.wprev.N1000.RInf' , 200 | cuts = 100 , 201 | col.regions = prevR.colors.red( 101 ) , 202 | main = "regional trends of infant mortality" 203 | ) 204 | 205 | # # end of step 4 # # 206 | # # # # # # # # # # # 207 | 208 | 209 | # # # # # # # # # # # # 210 | # # step 5: outline # # 211 | 212 | library(rgeos) 213 | library(raster) 214 | library(rgdal) 215 | library(ggplot2) 216 | library(downloader) 217 | 218 | # load the download_cached and related functions 219 | # to prevent re-downloading of files once they've been downloaded. 220 | source_url( 221 | "https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" , 222 | prompt = FALSE , 223 | echo = FALSE 224 | ) 225 | 226 | # # note on outline geography selection 227 | # # i am intentionally using sites that host 228 | # # data from many/most/every country worldwide 229 | # # so that this script can be easily extended 230 | # # to whatever country you're working on ;) 231 | 232 | # # # map of the world # # # 233 | 234 | # initiate a temporary file 235 | tf <- tempfile() 236 | 237 | # use eurostat's map of the world 238 | world.fn <- "http://epp.eurostat.ec.europa.eu/cache/GISCO/geodatafiles/CNTR_2014_03M_SH.zip" 239 | 240 | # store it to the local disk 241 | download_cached( world.fn , tf ) 242 | 243 | # unzip it 244 | world.uz <- unzip( tf , exdir = tempdir() ) 245 | 246 | # identify the shapefile 247 | world.sfn <- grep( 'CNTR_RG(.*)shp$' , world.uz , value = TRUE ) 248 | 249 | # read it in 250 | world.shp <- readOGR( world.sfn , layer = gsub( "\\.shp" , "" , basename( world.sfn ) ) ) 251 | world.shp <- spTransform( world.shp , CRS( "+proj=longlat" ) ) 252 | 253 | # here's the outline of every country in the world 254 | plot( world.shp , fill = 'gray' ) 255 | 256 | # # # map of the nile # # # 257 | 258 | # use ucdavis's map of egypt's waterways 259 | ucd.fn <- "http://biogeo.ucdavis.edu/data/diva/wat/EGY_wat.zip" 260 | 261 | # store it to the local disk 262 | download_cached( ucd.fn , tf ) 263 | 264 | # unzip it 265 | ucd.uz <- unzip( tf , exdir = tempdir() ) 266 | 267 | # this file contains lots of information. 268 | ucd.uz 269 | 270 | # identify the waterways shapefile 271 | waterways.sfn <- grep( "water_areas(.*)shp$" , ucd.uz , value = TRUE ) 272 | 273 | # read it in 274 | waterways.shp <- readOGR( waterways.sfn , layer = gsub( "\\.shp" , "" , basename( waterways.sfn ) ) ) 275 | waterways.shp <- spTransform( waterways.shp , CRS( "+proj=longlat" ) ) 276 | 277 | # here's the outline of all water in egypt 278 | plot( waterways.shp ) 279 | 280 | # here's the outline of only the nile + lakes + suez canal 281 | plot( subset( waterways.shp , grepl( "LAKE|CANAL|NILE" , NAME ) ) ) 282 | 283 | # # # keep going into sudan # # # 284 | 285 | # use ucdavis's map of sudan's waterways 286 | sucd.fn <- "http://biogeo.ucdavis.edu/data/diva/wat/SDN_wat.zip" 287 | 288 | # store it to the local disk 289 | download_cached( sucd.fn , tf ) 290 | 291 | # unzip it 292 | sucd.uz <- unzip( tf , exdir = tempdir() ) 293 | 294 | # this file contains lots of information. 295 | sucd.uz 296 | 297 | # identify the waterways shapefile 298 | sudanwater.sfn <- grep( "water_areas(.*)shp$" , sucd.uz , value = TRUE ) 299 | 300 | # read it in 301 | sudanwater.shp <- readOGR( sudanwater.sfn , layer = gsub( "\\.shp" , "" , basename( sudanwater.sfn ) ) ) 302 | sudanwater.shp <- spTransform( sudanwater.shp , CRS( "+proj=longlat" ) ) 303 | 304 | # here's the outline of all water in sudan 305 | plot( sudanwater.shp ) 306 | 307 | # here's the outline of the nile into sudan 308 | plot( subset( sudanwater.shp , grepl( "NILE|LAKE" , NAME ) ) ) 309 | 310 | 311 | # # # map of egyptian states # # # 312 | 313 | # use uc davis's administrative regions 314 | admin.fn <- "http://biogeo.ucdavis.edu/data/diva/adm/EGY_adm.zip" 315 | 316 | # store it to the local disk 317 | download_cached( admin.fn , tf ) 318 | 319 | # unzip it 320 | admin.uz <- unzip( tf , exdir = tempdir() ) 321 | 322 | # this file contains a few different levels 323 | # of administrative borders. 324 | admin.uz 325 | 326 | # identify the national and state border shapefiles 327 | nation.sfn <- grep( "adm0(.*)shp$" , admin.uz , value = TRUE ) 328 | states.sfn <- grep( "adm1(.*)shp$" , admin.uz , value = TRUE ) 329 | 330 | # read in both 331 | nation.shp <- readOGR( nation.sfn , layer = gsub( "\\.shp" , "" , basename( nation.sfn ) ) ) 332 | states.shp <- readOGR( states.sfn , layer = gsub( "\\.shp" , "" , basename( states.sfn ) ) ) 333 | 334 | nation.shp <- spTransform( nation.shp , CRS( "+proj=longlat" ) ) 335 | states.shp <- spTransform( states.shp , CRS( "+proj=longlat" ) ) 336 | 337 | # # ready to stack all four maps? 338 | 339 | # calculate the bounding box of egypt, 10% bigger than the country 340 | bb <- bbox( as( 1.05 * extent( states.shp ) , "SpatialPolygons" ) ) 341 | 342 | # initiate the lowest layer (the world) 343 | plot( world.shp , xlim = bb[ 1 , ] , ylim = bb[ 2 , ] , col = 'gray' , fill = TRUE , border = 'white' ) 344 | 345 | # turn gray off for egypt only, then add state boundaries in gray 346 | plot( states.shp , add = TRUE , col = 'white' , fill = TRUE , border = 'gray' ) 347 | 348 | # add the nile 349 | plot( subset( waterways.shp , grepl( "NILE" , NAME ) ) , add = TRUE , col = 'lightblue' , border = 'lightblue' ) 350 | 351 | # you can add the national border in black if you want 352 | plot( nation.shp , add = TRUE , border = 'black' ) 353 | # but i don't think it's necessary 354 | 355 | # # not bad for a start, huh? 356 | 357 | # draw a rectangle 100% bigger than the original state 358 | eg.shp.blank <- as( 1.5 * extent( states.shp ) , "SpatialPolygons" ) 359 | 360 | # draw a rectangle 0% bigger than the original state 361 | eg.box <- as( extent( states.shp ) , "SpatialPolygons" ) 362 | 363 | 364 | # # worldwide coastlines # # 365 | 366 | coast.fn <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/physical/ne_10m_ocean.zip" 367 | 368 | # store it to the local disk 369 | download_cached( coast.fn , tf ) 370 | 371 | # unzip it 372 | coast.uz <- unzip( tf , exdir = tempdir() ) 373 | 374 | # this file contains every coastline everywhere 375 | coast.uz 376 | 377 | # identify the worldwide ocean 378 | coast.sfn <- grep( "ocean(.*)shp$" , coast.uz , value = TRUE ) 379 | 380 | # read in the ocean 381 | coast.shp <- readOGR( coast.sfn , layer = gsub( "\\.shp" , "" , basename( coast.sfn ) ) ) 382 | 383 | # put the ocean in longlat format 384 | coast.shp <- spTransform( coast.shp , CRS( "+proj=longlat" ) ) 385 | 386 | # limit the egyptian coast shapefile 387 | # to only points within the larger bounding box 388 | eg.coast <- gIntersection( eg.shp.blank , coast.shp ) 389 | # so you're not carrying around the whole world's ocean 390 | 391 | # one more note: `XK` is a disputed region. see the difference? 392 | plot( subset( world.shp , CNTR_ID %in% 'EG' ) ) 393 | plot( subset( world.shp , ( CNTR_ID %in% c( 'EG' , 'XK' ) ) ) ) 394 | 395 | # these are huge objects, so in order to conserve ram 396 | # fortify what you need for ggplot2 and toss all other objects. 397 | fcoast <- fortify( eg.coast ) ; rm( eg.coast ) 398 | # keep everything but the disputed region plus egypt proper 399 | wshape <- fortify( subset( world.shp , !( CNTR_ID %in% c( 'EG' , 'XK' ) ) ) ) ; rm( world.shp ) 400 | fnation <- fortify( nation.shp ) 401 | fstate <- fortify( states.shp ) ; rm( states.shp ) 402 | fnile <- fortify( subset( waterways.shp , grepl( "NILE|LAKE|CANAL" , NAME ) ) ) ; rm( waterways.shp ) 403 | snile <- fortify( subset( sudanwater.shp , grepl( "NILE|LAKE" , NAME ) ) ) ; rm( sudanwater.shp ) 404 | # got 'em all, i think. clear up RAM 405 | gc() 406 | 407 | # # end of step 5 # # 408 | # # # # # # # # # # # 409 | 410 | 411 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 412 | # # step 6: re-create the prevR object on a rectangle # # 413 | 414 | library(prevR) 415 | 416 | # re-create a prevR object like a professional. 417 | bbpro <- 418 | as.prevR( 419 | x , 420 | c( id = "dhsclust" , x = "longnum" , y = "latnum" , n = "denom" , wn = "wdenom" , pos = "numer" , wpos = "wnumer" ) , 421 | as( eg.box , "SpatialPolygons" ) 422 | ) 423 | # note the use of `nation.shp` 424 | # which has slightly cleaner borders 425 | # than the borders obtained from 426 | # the create.boundary function 427 | 428 | your.N <- 1000 429 | 430 | # re-compute bandwidths 431 | bbpro <- rings( bbpro , N = your.N ) 432 | 433 | # re-compute surfaces 434 | bbpro.map <- kde( bbpro , N = your.N , nb.cells = 250 ) 435 | 436 | # coerce this result to a data.frame object 437 | map.df <- na.omit( as.data.frame( bbpro.map ) ) 438 | 439 | # name your variable something less mathy 440 | map.df$im <- map.df[ , paste0( "k.wprev.N" , your.N , ".RInf" ) ] 441 | 442 | # sort and move on. 443 | map.df <- map.df[ order( map.df$x , map.df$y ) , ] 444 | 445 | # # end of step 6 # # 446 | # # # # # # # # # # # 447 | 448 | 449 | # # # # # # # # # # # # # # # # # # # # # # # 450 | # # step 7: decide on your map parameters # # 451 | 452 | library(ggplot2) 453 | library(scales) 454 | library(mapproj) 455 | 456 | 457 | # initiate the simple map 458 | first.map <- 459 | qplot( 460 | x , 461 | y , 462 | data = map.df , 463 | colour = im , 464 | xlab = NULL , 465 | ylab = NULL 466 | ) 467 | 468 | # look at that.. not bad not bad 469 | first.map 470 | 471 | # give this map a geom_tile layer 472 | eg.map <- 473 | ggplot( 474 | map.df , 475 | aes( x = x , y = y ) 476 | ) + geom_tile( aes( fill = im ) ) 477 | 478 | # set the bounding box limits that 479 | # you and i agreed to earlier in the script 480 | # oh, also, remove all map crap 481 | eg.map <- 482 | eg.map + 483 | 484 | xlab( "" ) + ylab( "" ) + 485 | 486 | scale_x_continuous( limits = bb[ 1 , ] , breaks = NULL , oob = squish ) + 487 | 488 | scale_y_continuous( limits = bb[ 2 , ] , breaks = NULL , oob = squish ) + 489 | 490 | theme( 491 | legend.position = "none" , 492 | panel.grid.major = element_blank(), 493 | panel.grid.minor = element_blank(), 494 | panel.background = element_blank(), 495 | panel.border = element_blank(), 496 | axis.ticks = element_blank() 497 | ) 498 | 499 | # cleaner still. 500 | eg.map 501 | 502 | # # end of step 7 # # 503 | # # # # # # # # # # # 504 | 505 | 506 | # # # # # # # # # # # 507 | # # step 8: color # # 508 | 509 | library(ggplot2) 510 | 511 | eg.map + scale_fill_gradient( low = 'green' , high = 'red' ) 512 | 513 | eg.map + scale_fill_gradient( low = 'white' , high = 'blue' ) 514 | 515 | eg.map + scale_fill_gradient( low = muted( 'blue' ) , high = muted( 'red' ) ) 516 | 517 | # and some prevR-custom colors. 518 | eg.map + scale_fill_gradientn( colours = prevR.colors.gray( 20 ) ) 519 | 520 | eg.map + scale_fill_gradientn( colours = prevR.colors.blue( 20 ) ) 521 | 522 | eg.map + scale_fill_gradientn( colours = prevR.colors.red( 20 ) ) 523 | 524 | # let's save that default 525 | eg.map <- eg.map + scale_fill_gradientn( colours = prevR.colors.red( 20 ) ) 526 | 527 | # # end of step 8 # # 528 | # # # # # # # # # # # 529 | 530 | 531 | # # # # # # # # # # # # # # # # # # # # 532 | # # step 9: draw borders and waters # # 533 | 534 | library(plyr) 535 | library(ggplot2) 536 | 537 | # store this information in a layer 538 | national.border.layer <- geom_path( data = fnation , aes( x = long , y = lat , group = group ) , colour = 'lightgray' ) 539 | 540 | # plot the result if you like 541 | eg.map + national.border.layer 542 | 543 | # store this information in a layer 544 | state.border.layer <- geom_path( data = fstate , aes( x = long , y = lat , group = group ) , colour = 'lightgray' ) 545 | 546 | # plot the result if you like 547 | eg.map + state.border.layer 548 | 549 | # # international borders # # 550 | 551 | # store this information in a layer 552 | international.border.layer <- geom_polygon( data = wshape , aes( x = long , y = lat , group = group ) , fill = 'lightgray' , color = 'white' ) 553 | 554 | # plot the result if you like 555 | eg.map + international.border.layer 556 | 557 | # # coastal areas to blank # # 558 | 559 | # fix islands piecing together 560 | fcoast2 <- ddply( fcoast , .( piece ) , function( x ) rbind( x , fcoast[ 1 , ] ) ) 561 | 562 | # convert this fortified object to a ggplot layer 563 | ocean.layer <- geom_polygon( data = fcoast2 , aes( x = long , y = lat , group = id ) , fill = 'lightblue' ) 564 | 565 | eg.map + ocean.layer 566 | 567 | # construct the nile layer 568 | nile.layer <- 569 | geom_polygon( 570 | data = fnile , 571 | aes( x = long , y = lat , group = group ) , 572 | color = 'lightblue' , fill = 'lightblue' 573 | ) 574 | 575 | # everything you do with the egyptian nile, 576 | # also do with the sudanese nile.. 577 | # to keep the river flowing off of the map 578 | snile.layer <- 579 | geom_polygon( 580 | data = snile , 581 | aes( x = long , y = lat , group = group ) , 582 | color = 'lightblue' , fill = 'lightblue' 583 | ) 584 | 585 | 586 | # closer, eh? 587 | eg.map + nile.layer + snile.layer 588 | 589 | 590 | # # external rectangle to blank # # 591 | 592 | # initiate an external rectangle at the edges of the bounding box 593 | # to blank out the furthest extent of the grid 594 | orect <- 595 | geom_rect( 596 | xmin = bb[ 1 , 1 ] , 597 | xmax = bb[ 1 , 2 ] , 598 | ymin = bb[ 2 , 1 ] , 599 | ymax = bb[ 2 , 2 ] , 600 | color = 'white' , 601 | fill = NA , 602 | size = 4 603 | ) 604 | 605 | eg.map + orect 606 | 607 | # # end of step 9 # # 608 | # # # # # # # # # # # 609 | 610 | 611 | # # # # # # # # # # # # # # # # # 612 | # # step 10: project and save # # 613 | 614 | library(ggplot2) 615 | library(mapproj) 616 | 617 | # save each of the layers, 618 | # in order from bottom to top 619 | final.map <- 620 | eg.map + 621 | international.border.layer + 622 | nile.layer + snile.layer + 623 | ocean.layer + 624 | orect 625 | 626 | # here's the final plot 627 | final.map 628 | 629 | # save the file to your current working directory 630 | ggsave( 631 | "2004-2008 infant mortality rate - unprojected.png" , 632 | plot = final.map 633 | ) 634 | # but that's unprojected. you might prefer a projected map. 635 | 636 | # # # pick your projection # # # 637 | 638 | # here are lots of choices. choose wisely. 639 | # final.map + coord_map( project = "albers" , lat0 = bb[ 2 , 1 ] , lat1 = bb[ 2 , 2 ] ) 640 | # final.map + coord_fixed() 641 | # final.map + coord_cartesian() 642 | # final.map + coord_map( "gilbert" ) 643 | # final.map + coord_map( "lagrange" ) 644 | # final.map + coord_map( "stereographic" ) 645 | 646 | 647 | # project this pup. i prefer albers for egypt. 648 | projected.map <- final.map + coord_map( project = "albers" , lat0 = bb[ 2 , 1 ] , lat1 = bb[ 2 , 2 ] ) 649 | 650 | # would you like to save this game? 651 | 652 | # # # fair warning # # # 653 | # the projected map takes hours to render. 654 | # choose carefully from the shapes above, 655 | # then leave this save command running overnight. 656 | 657 | # save the projected plot, which takes longer doesn't it. 658 | ggsave( 659 | "2004-2008 infant mortality rate - projected.png" , 660 | plot = projected.map 661 | ) 662 | 663 | # # end of step ten # # 664 | # # # # # # # # # # # # 665 | -------------------------------------------------------------------------------- /how to map the new york city housing and vacancy survey.R: -------------------------------------------------------------------------------- 1 | # # # # # # # # # # # # # # # # # 2 | # # set the working directory # # 3 | # # # # # # # # # # # # # # # # # 4 | 5 | # setwd( "C:/My Directory/SWMAP/" ) 6 | 7 | 8 | # # # # # # # # # # # # # # # # 9 | # # example survey data set # # 10 | # # # # # # # # # # # # # # # # 11 | 12 | # new york city housing and vacancy survey 13 | 14 | 15 | # # # # # # # # # # # # # # # # # # # # # 16 | # # different from other maps because # # 17 | # # # # # # # # # # # # # # # # # # # # # 18 | 19 | # incorporates shapefiles with more than one starting projection 20 | # binds multiple shapefiles (not layers) to blank out water areas 21 | 22 | 23 | # # # # # # # # # # # # # # # # # # 24 | # # smallest level of geography # # 25 | # # # # # # # # # # # # # # # # # # 26 | 27 | # city boro and subboro areas 28 | # (subboro areas are neighborhoods that map to collections of census tracts) 29 | 30 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # 31 | # # asdfree.com blog post for this survey microdata # # 32 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # 33 | 34 | # http://www.asdfree.com/search/label/new%20york%20city%20housing%20and%20vacancy%20survey%20%28nychvs%29 35 | 36 | 37 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 38 | # # r code repository for setup and analysis examples # # 39 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 40 | 41 | # https://github.com/ajdamico/asdfree/tree/master/New%20York%20City%20Housing%20and%20Vacancy%20Survey 42 | 43 | 44 | # # # # # # # # # # # # # 45 | # # value of interest # # 46 | # # # # # # # # # # # # # 47 | 48 | # ratio of persons per room 49 | 50 | 51 | # # # # # # # 52 | # # flaws # # 53 | # # # # # # # 54 | 55 | # the standard errors calculated from this data set are incorrect, so 56 | # the weights used here are even less scientific than in the other maps. 57 | 58 | 59 | # # # # # # # # # # # # # # # # # # # # # 60 | # # step 1: load the survey microdata # # 61 | 62 | library(downloader) 63 | 64 | # download the 2011 new york city housing and vacancy survey microdata onto the local disk 65 | years.to.download <- 2011 66 | source_url( "https://raw.github.com/ajdamico/asdfree/master/New%20York%20City%20Housing%20and%20Vacancy%20Survey/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE ) 67 | 68 | # # end of step 1 # # 69 | # # # # # # # # # # # 70 | 71 | 72 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 73 | # # step 2: conduct your analysis of interest at the smallest geography allowed # # 74 | 75 | library(survey) 76 | 77 | # following the analysis examples in the r code repository -- 78 | # https://github.com/ajdamico/asdfree/blob/master/New%20York%20City%20Housing%20and%20Vacancy%20Survey/2011%20analysis%20examples.R 79 | # -- calculate the persons per room rate at the smallest available geographic area 80 | 81 | 82 | # note the large cautionary text that the standard errors for this survey are garbage # 83 | 84 | 85 | # set R to produce conservative standard errors instead of crashing 86 | # http://r-survey.r-forge.r-project.org/survey/exmample-lonely.html 87 | options( survey.lonely.psu = "adjust" ) 88 | # this setting matches the MISSUNIT option in SUDAAN 89 | 90 | # load the 2011 data files 91 | load( 'nychvs11.rda' ) 92 | 93 | # create a fake survey design object with just the `hhweight` 94 | # (household weight) variable 95 | occ.d <- svydesign( ~1 , data = occ , weights = ~hhweight ) 96 | # this svydesign() call is incorrect because nychvs 97 | # does not release its sampling clusters 98 | 99 | # this script is useful as an example of how to map city-wide survey data 100 | # but this actual new york city microdata has incorrect standard errors 101 | 102 | 103 | # the persons per room variable has two decimals 104 | occ.d <- update( occ.d , pproom = as.numeric( crowd100 ) / 100 ) 105 | 106 | # calculate the 2011 persons per room rate 107 | svymean( ~ pproom , occ.d ) 108 | 109 | # # examine which geographies are available # # 110 | 111 | # the new york city housing and vacancy survey identifies records 112 | # from subboro (neighborhood) areas within all five boros 113 | svytable( ~ borough + subboro , occ.d ) 114 | 115 | # simply use both of those geographies in the by= argument 116 | # of the `svyby` command, and re-calculate the poverty rates 117 | smallest.area.statistics <- 118 | svyby( 119 | ~ pproom , 120 | ~ borough + subboro , 121 | occ.d , 122 | svymean 123 | ) 124 | # this is the same command as the city-wide calculation above, 125 | # except these results have been broken into smaller areas. 126 | 127 | # these are the statistics to be mapped 128 | print( smallest.area.statistics ) 129 | # the standard errors are a measure of precision, 130 | # their inverse will serve as the mapping weights 131 | 132 | # again! don't forget that 133 | # the new york city housing and vacancy survey's 134 | # standard errors are not computable. 135 | 136 | # make this object easier to type 137 | sas <- smallest.area.statistics 138 | 139 | # # end of step 2 # # 140 | # # # # # # # # # # # 141 | 142 | 143 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 144 | # # step 3: download and import necessary geographic crosswalks # # 145 | 146 | library(downloader) 147 | 148 | # load the download_cached and related functions 149 | # to prevent re-downloading of files once they've been downloaded. 150 | source_url( 151 | "https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" , 152 | prompt = FALSE , 153 | echo = FALSE 154 | ) 155 | 156 | 157 | # create a temporary file containing the census bureau's 158 | # 2010 census summary file #1 for new york state 159 | # then download the file. 160 | sf1ny.tf <- tempfile() 161 | 162 | 163 | download_cached( 164 | "ftp://ftp2.census.gov/census_2010/04-Summary_File_1/New_York/ny2010.sf1.zip" , 165 | sf1ny.tf , 166 | mode = 'wb' 167 | ) 168 | # note: to re-download a file from scratch, add the parameter usecache = FALSE 169 | 170 | 171 | # unzip the summary file #1 files 172 | sf1ny.uz <- unzip( sf1ny.tf , exdir = tempdir() ) 173 | 174 | 175 | # file layout from http://www.census.gov/prod/cen2010/doc/sf1.pdf#page=18 176 | sf1ny <- 177 | read.fwf( 178 | sf1ny.uz[ grep( "nygeo2010" , sf1ny.uz ) ] , 179 | c( -8 , 3 , -16 , 2 , 3 , -4 , 5 , -4 , 5 , -4 , 6 , 1 , 4 , -106 , 5 , -142 , 9 , -9 , 11 , 12 ) 180 | ) 181 | 182 | # add columns names matching the census bureau, so it's easy to read 183 | names( sf1ny ) <- 184 | c( "sumlev" , "state" , "county" , "cousub" , "place" , "tract" , "blkgrp" , "block" , "zcta5" , "pop100" , "intptlat" , "intptlon" ) 185 | 186 | # summary level 101 has census tracts and census blocks 187 | sf1ny.101 <- subset( sf1ny , sumlev == "101" ) 188 | 189 | # one record per census block in new york state. see? same number. 190 | nrow( sf1ny.101 ) 191 | # https://www.census.gov/geo/maps-data/data/tallies/census_block_tally.html 192 | 193 | # and guess what? the total new york population matches as well. 194 | sum( sf1ny.101$pop100 ) 195 | # http://quickfacts.census.gov/qfd/states/36000.html 196 | 197 | 198 | # separately, read in the crosswalk between new york census tracts and nychvs subboro areas # 199 | # http://www.census.gov/housing/nychvs/data/2011/11subcom1.pdf 200 | # http://www.census.gov/housing/nychvs/data/2011/11subcom2.pdf 201 | nycsb.tf <- tempfile() 202 | 203 | download( "https://raw.githubusercontent.com/ajdamico/asdfree/master/New%20York%20City%20Housing%20and%20Vacancy%20Survey/boro%20and%20subboro%20to%20census%20tract%20crosswalk.csv" , nycsb.tf ) 204 | 205 | nychvs.subboro <- read.csv( nycsb.tf ) 206 | 207 | # also define nyc counties 208 | nyc.counties <- c( '005' , '047' , '061' , '081' , '085' ) 209 | 210 | # ahh, we also need a county fips code to new york city borough match. 211 | boro.to.county.fips <- 212 | data.frame( 213 | boroname = c( 'Bronx' , 'Brooklyn' , 'Manhattan' , 'Queens' , 'Staten Island' ) , 214 | county = as.numeric( nyc.counties ) 215 | ) 216 | 217 | # merge on the borough county fips codes 218 | bo <- merge( nychvs.subboro , boro.to.county.fips ) 219 | 220 | # confirm no record loss from the previous merge 221 | stopifnot( nrow( bo ) == nrow( nychvs.subboro ) ) 222 | 223 | # rename the `bo` data.frame's census tract column to match `sf1ny.101` 224 | names( bo )[ names( bo ) == 'ctract' ] <- 'tract' 225 | 226 | # rename the `bo` data.frame's boro column to match `sas` 227 | names( bo )[ names( bo ) == 'boro' ] <- 'borough' 228 | 229 | # merge this with the new york state summary file #1.. 230 | sf1.bo <- merge( sf1ny.101 , bo ) 231 | 232 | # ..and guess what? now we have a perfect match with new york city's 2010 population. 233 | sum( sf1.bo$pop100 ) 234 | # http://quickfacts.census.gov/qfd/states/36/3651000.html 235 | 236 | # so now we have a data.frame object with 237 | # one record per census block, 238 | # and also with the two geography-levels 239 | # that match the new york city housing and vacancy survey 240 | head( sf1.bo ) 241 | 242 | # and guess what? 243 | # we've now got the census 2010 weighted populations (field pop100) 244 | # and also each census block's centroid latitude & longitude (fields intptlat + intptlon) 245 | 246 | # # end of step 3 # # 247 | # # # # # # # # # # # 248 | 249 | 250 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 251 | # # step 4: merge the results of your survey analysis with the small-area geography # # 252 | 253 | # confirm that we've created all possible geographies correctly. 254 | 255 | # the number of records in our small area statistics.. 256 | sas.row <- nrow( sas ) 257 | 258 | # ..should equal the number of unique-match-merged records.. 259 | mrow <- nrow( merge( unique( sf1.bo[ , c( 'borough' , 'subboro' ) ] ) , sas ) ) 260 | 261 | # ..and it does/they do. 262 | stopifnot( sas.row == mrow ) 263 | 264 | # now the census block-level new york city census data *could* merge if you wanted it to. 265 | 266 | 267 | # but you don't. yet. 268 | 269 | 270 | # the standard error (the `se` field) is a measure of precision. 271 | print( sas ) 272 | # the smaller the standard error, the more confident you should be 273 | # that the estimate at a particular geography is correct. 274 | 275 | 276 | # so invert it. you heard me. invert it. 277 | sas$invse <- 1 / sas$se 278 | # a smaller standard error indicates more precision. 279 | 280 | # for our purposes, precision can be considered weight! # 281 | 282 | # now we've got the weight that we should give each of our estimates # 283 | 284 | # distribute that weight across all census blocks # 285 | 286 | # aggregate the 2010 census block populations to the geographies that you have. 287 | popsum <- aggregate( sf1.bo$pop100 , by = ( sf1.bo[ , c( 'borough' , 'subboro' ) ] ) , sum ) 288 | 289 | # make the column name meaningful 290 | names( popsum )[ names( popsum ) == 'x' ] <- 'popsum' 291 | 292 | # merge the popsum onto the sasfile 293 | sas <- merge( sas , popsum ) 294 | 295 | # now. merge 296 | # the persons per room rate (the variable of interest) 297 | # the inverted standard error (the total weight of the broad geography) 298 | # the population sum (the total population of all census blocks that are part of that geography) 299 | 300 | x <- merge( sf1.bo , sas ) 301 | 302 | # confirm no record loss 303 | stopifnot( nrow( x ) == nrow( sf1.bo ) ) 304 | 305 | 306 | # (this is the fun part) 307 | # calculate the weight of each census block 308 | x$weight <- x$invse * ( x$pop100 / x$popsum ) 309 | 310 | # note that weight of all census blocks put together 311 | # sums to the `invse` on the original analysis file 312 | stopifnot( all.equal( sum( x$weight ) , sum( sas$invse ) ) ) 313 | 314 | # remove records with zero population 315 | x <- subset( x , weight > 0 ) 316 | 317 | # scale all weights so that they average to one 318 | x$weight <- x$weight / mean( x$weight ) 319 | 320 | # you're done preparing your data. 321 | # keep only the columns you need. 322 | x <- x[ , c( 'pproom' , 'weight' , 'intptlat' , 'intptlon' ) ] 323 | 324 | # # end of step 4 # # 325 | # # # # # # # # # # # 326 | 327 | 328 | # # # # # # # # # # # # # # # # # # # # # # # 329 | # # step 5: decide on your map parameters # # 330 | 331 | library(maptools) 332 | library(ggplot2) 333 | library(rgdal) 334 | library(RColorBrewer) 335 | library(scales) 336 | 337 | # before you ever touch surface smoothing or kriging, 338 | # make some decisions about how you generally want 339 | # your map to look: the projection and coloring 340 | 341 | # the options below simply use hadley wickham's ggplot2 342 | # with the census block-level poverty rates and centroids 343 | 344 | 345 | # initiate a simple map 346 | nyc.map <- 347 | ggplot( data = x , aes( x = intptlon , y = intptlat ) ) + 348 | geom_point( data = x , aes( colour = pproom ) ) 349 | 350 | # remove all map crap. 351 | nyc.map <- 352 | nyc.map + 353 | 354 | xlab( "" ) + ylab( "" ) + 355 | 356 | # force the x and y axis limits at the shape of the city and don't do anything special for off-map values 357 | scale_x_continuous( limits = c( min( x$intptlon ) , max( x$intptlon ) ) , breaks = NULL , oob = squish ) + 358 | # since we're going to add lots of surrounding-area detail! 359 | scale_y_continuous( limits = c( min( x$intptlat ) , max( x$intptlat ) ) , breaks = NULL , oob = squish ) + 360 | 361 | theme( 362 | legend.position = "none" , 363 | panel.grid.major = element_blank(), 364 | panel.grid.minor = element_blank(), 365 | panel.background = element_blank(), 366 | panel.border = element_blank(), 367 | axis.ticks = element_blank() 368 | ) 369 | 370 | # print the map without any projection 371 | nyc.map 372 | 373 | 374 | # check out some purty colors. 375 | 376 | # from http://colorbrewer2.org/ 377 | 378 | # three sequential color schemes 379 | YlOrBr.3.p <- colorRampPalette( brewer.pal( 3 , "YlOrBr" ) ) 380 | YlOrBr.9.p <- colorRampPalette( brewer.pal( 9 , "YlOrBr" ) ) 381 | Purples.9.p <- colorRampPalette( brewer.pal( 9 , "Purples" ) ) 382 | 383 | # one diverging color schemes 384 | RdGy.11.p <- colorRampPalette( rev( brewer.pal( 11 , "RdGy" ) ) ) 385 | 386 | # here's what the map looks like in black n white. 387 | nyc.map + scale_colour_gradient( low = 'white' , high = 'black' ) 388 | 389 | # or one of the colorbrewer schemes 390 | nyc.map + scale_colour_gradientn( colours = YlOrBr.3.p( 100 ) ) 391 | nyc.map + scale_colour_gradientn( colours = YlOrBr.9.p( 100 ) ) 392 | nyc.map + scale_colour_gradientn( colours = RdGy.11.p( 100 ) ) 393 | nyc.map + scale_colour_gradientn( colours = Purples.9.p( 100 ) ) 394 | 395 | # download and read-in the new york city parks shapefile 396 | tf <- tempfile() 397 | 398 | download_cached( "http://www.nyc.gov/html/dpr/nycbigapps/DPR_Parks_001.zip" , tf ) 399 | 400 | z <- unzip( tf , exdir = tempdir() ) 401 | 402 | sfname <- z[ grep( 'shp$' , z ) ] 403 | 404 | # new york city's shapefiles are not in longlat projection format, so 405 | # use `readOGR` instead of `readShapePoly` here to capture the map projection 406 | parks.shp <- readOGR( sfname , layer = gsub( "\\.shp" , "" , basename( sfname ) ) ) 407 | 408 | # convert the shapefile to longlat (which matches the us census bureau's summary file #1) 409 | parks.shp <- spTransform( parks.shp , CRS( "+proj=longlat" ) ) 410 | # now `parks.shp` will overlay properly with us census bureau files. 411 | 412 | # prepare the parks shapefile for ggplot2 413 | parks <- fortify( parks.shp ) 414 | 415 | # the `parks.shp` has a weird circle in the upper-left. 416 | parks <- subset( parks , long > -75 ) 417 | # hack it off by removing all points to the west of -75 longitude 418 | 419 | # create a dark green layer of parks in new york city 420 | park.layer <- 421 | geom_polygon( 422 | data = parks , 423 | aes( x = long , y = lat , group = group ) , 424 | fill = '#abdda4' 425 | ) 426 | 427 | # that's what the map looks like so far.. 428 | nyc.map + park.layer 429 | 430 | # ..save it if you like. 431 | nyc.map <- nyc.map + park.layer 432 | 433 | # oh sorry, if you'd like black & white again, add this segment. 434 | nyc.map + scale_colour_gradient( low = 'white' , high = 'black' ) 435 | 436 | # download and read-in the new york city clipped-to-shoreline borough map 437 | download_cached( "http://www.nyc.gov/html/dcp/download/bytes/nybb_14c.zip" , tf ) 438 | # from http://www.nyc.gov/html/dcp/html/bytes/districts_download_metadata.shtml 439 | 440 | z <- unzip( tf , exdir = tempdir() ) 441 | 442 | sfname <- z[ grep( 'shp$' , z ) ] 443 | 444 | # once again, new york city's shapefiles are not in longlat projection format, so 445 | # use `readOGR` instead of `readShapePoly` here to capture the map projection 446 | boro.shp <- readOGR( sfname , layer = gsub( "\\.shp" , "" , basename( sfname ) ) ) 447 | 448 | # convert the shapefile to longlat (which matches the us census bureau's summary file #1) 449 | boro.shp <- spTransform( boro.shp , CRS( "+proj=longlat" ) ) 450 | # now `boro.shp` will overlay properly with us census bureau files. 451 | 452 | # prepare the boro shapefile for ggplot2 453 | boro <- fortify( boro.shp ) 454 | 455 | # create a borough-border around the city 456 | boro.layer <- 457 | geom_path( 458 | data = boro , 459 | aes( x = long , y = lat , group = group ) 460 | ) 461 | 462 | # and would you look at that? 463 | nyc.map + boro.layer 464 | # some of the city parks are in the ocean. we'll have to snip off those later. 465 | 466 | # let's save all of those attributes, including the coloring 467 | nyc.map <- nyc.map + boro.layer + scale_colour_gradient( low = 'white' , high = 'black' ) 468 | 469 | # and finally, try some projections! 470 | 471 | # which of these do you prefer? 472 | nyc.map + coord_map( "newyorker" , r = 0 ) 473 | nyc.map + coord_map( "newyorker" , r = 10 ) 474 | nyc.map + coord_map( "newyorker" , r = 20 ) 475 | nyc.map + coord_map( "newyorker" , r = 30 ) 476 | nyc.map + coord_map( "newyorker" , r = 40 ) 477 | 478 | # meh, i like the map better without any. 479 | nyc.map 480 | 481 | # # end of step 5 # # 482 | # # # # # # # # # # # 483 | 484 | 485 | # # # # # # # # # # # # 486 | # # step 6: outline # # 487 | 488 | library(raster) 489 | library(rgdal) 490 | library(RCurl) 491 | 492 | # new york city has lots of water areas within the city limits 493 | # instead of the regular census bureau tiger shapefiles, 494 | # the city government provides the cleanest clipped shapefile. 495 | 496 | # draw a rectangle 5% bigger than the city limits 497 | boro.shp.out <- as( 1.1 * extent( boro.shp ), "SpatialPolygons" ) 498 | 499 | # get the directory listing of the folder with all water layers from census 500 | water.ftp <- 'ftp://ftp2.census.gov/geo/tiger/TIGER2013/AREAWATER/' 501 | 502 | water.files <- 503 | paste0( 504 | water.ftp , 505 | strsplit( 506 | getURL( 507 | water.ftp , 508 | ftp.use.epsv = FALSE , 509 | ftplistonly = TRUE 510 | ) , 511 | '\\s+' )[[ 1 ]] 512 | ) 513 | 514 | # limit these files to new york, new jersey, and connecticut 515 | # you can get more counties than you need, the program will just run a bit slower. 516 | nynjct.water <- 517 | water.files[ 518 | # bergen, hudson, essex, passaic, morris, somerset, union, middlesex, monmouth in new jersey, or 519 | substr( water.files , 61 , 65 ) %in% c( '34003' , '34017' , '34013' , '34031' , '34027' , '34035' , '34039' , '34023' , '34025' ) | 520 | 521 | # rockland, westchester, nassau, suffolk, orange, putnam in new york state 522 | substr( water.files , 61 , 65 ) %in% c( '36087' , '36119' , '36059' , '36103' , '36071' , '36079' ) | 523 | 524 | # fairfield in connecticut 525 | substr( water.files , 61 , 65 ) %in% c( '09001' ) 526 | ] 527 | 528 | # download and extract to a temporary directory 529 | invisible( sapply( nynjct.water , function( x ) { 530 | path <- file.path( tempdir() , basename( x ) ) 531 | download_cached( x , destfile = path , mode = 'wb' ) 532 | unzip( path , exdir = file.path( tempdir() , 'watershps' ) ) 533 | } ) ) 534 | 535 | # read in all shps, and prepend shapefile name to identifiers 536 | shps <- lapply( sub( '\\.zip' , '' , basename( nynjct.water ) ) , function( x ) { 537 | shp <- readOGR( file.path( tempdir() , 'watershps' ) , x ) 538 | shp <- spChFIDs( shp , paste0( x , '_' , sapply( slot( shp , "polygons" ) , slot , "ID" ) ) ) 539 | shp 540 | }) 541 | # this step above removes potential duplicate identifiers 542 | 543 | # rbind to a single object 544 | water.shp <- do.call( rbind , as.list( shps ) ) 545 | 546 | # want to see all of the little waters of the region that can now be blanked out? 547 | plot( water.shp ) 548 | # these water layers do not come with the city's shapefile, so construct them independently. 549 | 550 | # prepare the surrounding region shapefile for ggplot2 551 | water <- fortify( water.shp ) 552 | 553 | 554 | lightblue.water.layer <- 555 | geom_polygon( 556 | data = water , 557 | aes( x = long , y = lat , group = group ) , 558 | fill = 'lightblue' 559 | ) 560 | 561 | # here's the surrounding area water in light blue so you can see it.. 562 | nyc.map + lightblue.water.layer 563 | 564 | # ..but we actually want to blank this on the final map, 565 | # so make it white on the layer that gets saved 566 | water.layer <- 567 | geom_polygon( 568 | data = water , 569 | aes( x = long , y = lat , group = group ) , 570 | fill = 'white' 571 | ) 572 | 573 | 574 | # outline the surrounding landmasses as well 575 | ccbf.tf <- tempfile() 576 | 577 | # new york city borders the ocean, 578 | # so use the census bureau's cartographic boundary files 579 | # instead of the regular tiger shapefiles 580 | 581 | download_cached( 582 | "http://www2.census.gov/geo/tiger/GENZ2013/cb_2013_us_county_500k.zip" , 583 | ccbf.tf , 584 | mode = 'wb' 585 | ) 586 | 587 | ccbf.uz <- unzip( ccbf.tf , exdir = tempdir() ) 588 | 589 | ccbf.shp <- readShapePoly( ccbf.uz[ grep( 'shp$' , ccbf.uz ) ] ) 590 | 591 | nynjct.shp <- 592 | subset( 593 | ccbf.shp , 594 | GEOID %in% 595 | c( '34003' , '34017' , '34013' , '34031' , '34027' , '34035' , '34039' , '34023' , '34025' , 596 | '36087' , '36119' , '36059' , '36103' , '36071' , '36079' , 597 | '09001' 598 | ) 599 | ) 600 | 601 | # prepare the surrounding region shapefile for ggplot2 602 | nynjct <- fortify( nynjct.shp ) 603 | 604 | # gray out surrounding landmasses 605 | nynjct.layer <- 606 | geom_polygon( 607 | data = nynjct , 608 | aes( x = long , y = lat , group = group ) , 609 | fill = 'lightgray' 610 | ) 611 | 612 | # and would you look at that? 613 | nyc.map + nynjct.layer 614 | # no? we will re-order the layers later. 615 | # but now you've got a blank-gray of the surrounding areas. 616 | 617 | # put it all together, what have you got? 618 | nyc.map + nynjct.layer + lightblue.water.layer 619 | # hey okay. that looks like crap but we can work with it. 620 | 621 | 622 | # # end of step 6 # # 623 | # # # # # # # # # # # 624 | 625 | 626 | # # # # # # # # # # # # # # # # # # 627 | # # step 7: tie knots and krige # # 628 | 629 | library(sqldf) 630 | 631 | # how many knots should you make? # 632 | 633 | # knots are the computationally-intensive part of this process, 634 | # choose as many as your computer and your patience can handle. 635 | 636 | # you should aim for between 100 - 999 knots, 637 | # but numbers closer to 1,000 will overload smaller computers 638 | 639 | # with census microdata, you've often already got easy access to a relevant geographic grouping 640 | # however for new york city, most of the groupings are either too big to be useful as knots, 641 | # or too small that they'll overload smaller computers. 642 | 643 | 644 | # the city of new york contains 645 | nrow( unique( sf1.bo[ , c( 'boroname' , 'neighborhood' ) ] ) ) 646 | # subboro areas, which is too small of a number of knots. 647 | 648 | # the city also has 649 | nrow( unique( sf1.bo[ , c( 'boroname' , 'tract' ) ] ) ) 650 | # census tracts, which will overload most computers 651 | 652 | # county subdivision and place codes do not exist within the city limits. 653 | nrow( unique( sf1.bo[ , c( 'boroname' , 'cousub' , 'place' ) ] ) ) 654 | 655 | # so be creative. take another look at the summary file #1 documentation 656 | # http://www.census.gov/prod/cen2010/doc/sf1.pdf#page=18 657 | # oh hayyyy how about zip code tabulation areas? that could work. 658 | nrow( unique( sf1.bo[ , c( 'boroname' , 'zcta5' ) ] ) ) 659 | # nope, too many for a computer with 3gb of ram. 660 | 661 | # if you have a powerful computer, you might try tying knots at the census tract-level 662 | # otherwise, the neighborhods defined by the new york city housing and vacancy survey work. 663 | 664 | # `sqldf` does not like periods in the data.frame name 665 | # # sf1_bo <- sf1.bo 666 | 667 | # within each borough x zcta5, 668 | # calculate the population-weighted mean of the coordinates 669 | # and (for smoothing) the weighted share at each borough-zcta5 centroid 670 | # # nyc.knots <- 671 | # # sqldf( 672 | # # "select 673 | # # boroname , zcta5 , 674 | # # sum( pop100 ) as pop100 , 675 | # # sum( pop100 * intptlon ) / sum( pop100 ) as intptlon , 676 | # # sum( pop100 * intptlat ) / sum( pop100 ) as intptlat 677 | # # from sf1_bo 678 | # # group by 679 | # # boroname , zcta5" 680 | # # ) 681 | # note: this screws up coordinates that cross the international date line 682 | # or the equator. in the united states, only alaska's aleutian islands do this 683 | # and those geographies will be thrown out later. so it doesn't matter. 684 | 685 | # clear up RAM 686 | rm( sf1.bo ) ; gc() 687 | 688 | 689 | # interpolation option one # 690 | library(fields) 691 | 692 | # instead, we can let the `fields` package attempt to guess knots for you, 693 | xknots <- cover.design( cbind( x$intptlon , x$intptlat ) , 200 )$design 694 | # # # # note: tying 200 knots instead of 100 knots takes about an hour longer 695 | # # # # but the final map looks a bit better. if you're just passing through, use 100. 696 | 697 | # you can look at the estimated knots 698 | plot( xknots ) 699 | 700 | krig.fit <- 701 | Krig( 702 | cbind( x$intptlon , x$intptlat ) , 703 | x$pproom , 704 | weights = x$weight , 705 | knots = xknots 706 | # if you computed the knots yourself, you'll need this knots= line instead: 707 | # knots = cbind( nyc.knots$intptlon , nyc.knots$intptlat ) 708 | ) 709 | 710 | # that is: what is the (weighted) relationship between 711 | # your variable of interest (persons per room) and 712 | # the x/y points on a grid? 713 | 714 | # check this out! 715 | surface( krig.fit ) 716 | # you're almost there! 717 | 718 | 719 | # interpolation option two # 720 | library(mgcv) 721 | 722 | gam.fit <- 723 | gam( 724 | pproom ~ s(intptlon , intptlat ) , 725 | weights = weight , 726 | data = x 727 | ) 728 | 729 | 730 | # for the third alternative, keep reading. 731 | 732 | 733 | # # end of step 7 # # 734 | # # # # # # # # # # # 735 | 736 | 737 | # # # # # # # # # # # # # # # # # # # # 738 | # # step 8: make a grid and predict # # 739 | 740 | library(raster) 741 | 742 | x.range <- bbox( boro.shp.out )[ 1 , ] 743 | y.range <- bbox( boro.shp.out )[ 2 , ] 744 | 745 | # add five percent on each side 746 | x.diff <- abs( x.range[ 2 ] - x.range[ 1 ] ) * 0.05 747 | y.diff <- abs( y.range[ 2 ] - y.range[ 1 ] ) * 0.05 748 | 749 | x.range[ 1 ] <- x.range[ 1 ] - x.diff 750 | x.range[ 2 ] <- x.range[ 2 ] + x.diff 751 | y.range[ 1 ] <- y.range[ 1 ] - y.diff 752 | y.range[ 2 ] <- y.range[ 2 ] + y.diff 753 | 754 | # choose the number of ticks (in each direction) on your grid 755 | grid.length <- 500 756 | # # note: smaller grids will render faster 757 | # # (so they're better if you're just playing around) 758 | # # but larger grids will prevent your final plot from 759 | # # being too pixelated, even when zooming in 760 | 761 | 762 | # create some grid data.frame objects, one for each interpolation type 763 | grd <- gam.grd <- krig.grd <- 764 | expand.grid( 765 | intptlon = seq( from = x.range[1] , to = x.range[2] , length = grid.length ) , 766 | intptlat = seq( from = y.range[1] , to = y.range[2] , length = grid.length ) 767 | ) 768 | 769 | 770 | # along your rectangular grid, 771 | # what are the predicted values of 772 | # the number of persons per room? 773 | krig.grd$kout <- predict( krig.fit , krig.grd ) 774 | 775 | # alternate grid using gam.fit 776 | gam.grd$gamout <- predict( gam.fit , gam.grd ) 777 | 778 | # interpolation option three # 779 | library(spatstat) 780 | 781 | smoout <- 782 | Smooth( 783 | ppp( 784 | x$intptlon , 785 | x$intptlat , 786 | x.range , 787 | y.range , 788 | marks = x$pproom 789 | ) , 790 | # here's a good starting point for sigma, but screw around with this value. 791 | sigma = 0.05 , 792 | weights = x$weight 793 | ) 794 | 795 | smoo.grd <- 796 | expand.grid( 797 | intptlon = seq( from = smoout$xrange[1] , to = smoout$xrange[2] , length = smoout$dim[1] ) , 798 | intptlat = seq( from = smoout$yrange[1] , to = smoout$yrange[2] , length = smoout$dim[2] ) 799 | ) 800 | 801 | smoo.grd$smoout <- as.numeric( t( smoout$v ) ) 802 | 803 | # # end of step 8 # # 804 | # # # # # # # # # # # 805 | 806 | 807 | # # # # # # # # # # # # # # # # # # # # # 808 | # # step 9: ggplot and choose options # # 809 | 810 | library(ggplot2) 811 | library(mapproj) 812 | library(scales) 813 | 814 | 815 | # # # psa # # # 816 | # capping your outliers might drastically change your map. 817 | # if you find the 25th percentile and 75th percentile with 818 | # summary( krig.grd$kout ) 819 | # and then replace all `kout` values below the 25th or above the 75th 820 | # with those capped percentile endpoints, i promise promise promise 821 | # your maps will appear quite different. you could cap at the 25th and 75th with.. 822 | # grd.sum <- summary( krig.grd$kout ) 823 | # krig.grd[ krig.grd$kout > grd.sum[ 5 ] , 'kout' ] <- grd.sum[ 5 ] 824 | # krig.grd[ krig.grd$kout < grd.sum[ 2 ] , 'kout' ] <- grd.sum[ 2 ] 825 | # # # end # # # 826 | 827 | 828 | # you don't want to cap at the 25th and 75th? 829 | # well consider one other idea: at least cap at the 5th and 95th of the nation 830 | # this will also increase the visible gradient ultimately plotted. 831 | 832 | # for example, the lowest krigged value is negative. 833 | summary( krig.grd$kout ) 834 | # that's obviously not right. 835 | 836 | # if a numeric vector has values below the 5th percentile or above the 75th percentile, cap 'em 837 | minnmax.at.0595 <- 838 | function( z ){ 839 | q0595 <- quantile( z , c( 0.05 , 0.95 ) ) 840 | z[ z < q0595[ 1 ] ] <- q0595[ 1 ] 841 | z[ z > q0595[ 2 ] ] <- q0595[ 2 ] 842 | z 843 | } 844 | 845 | # min and max all numeric values. 846 | # krig.grd$kout <- minnmax.at.0595( krig.grd$kout ) 847 | # gam.grd$gamout <- minnmax.at.0595( gam.grd$gamout ) 848 | # smoo.grd$smoout <- minnmax.at.0595( smoo.grd$smoout ) 849 | # sometimes this makes the gradient much more visible. 850 | # but for this statistic, it doesn't do much. 851 | 852 | # initiate the krige-based plot 853 | krg.plot <- 854 | ggplot( data = krig.grd , aes( x = intptlon , y = intptlat ) ) + 855 | geom_tile( data = krig.grd , aes( fill = kout ) ) 856 | 857 | # initiate the gam-based plot 858 | gam.plot <- 859 | ggplot( data = gam.grd , aes( x = intptlon , y = intptlat ) ) + 860 | geom_tile( data = gam.grd , aes( fill = gamout ) ) 861 | 862 | # initiate the smooth-based plot 863 | smooth.plot <- 864 | ggplot( data = smoo.grd , aes( x = intptlon , y = intptlat ) ) + 865 | geom_tile( data = smoo.grd , aes( fill = smoout ) ) 866 | 867 | # view all three grids! 868 | krg.plot 869 | gam.plot 870 | smooth.plot 871 | 872 | 873 | # initiate the entire plot 874 | the.plot <- 875 | 876 | # choose only one of the three interpolation grids 877 | krg.plot + 878 | # gam.plot + 879 | # smooth.plot + 880 | 881 | # blank out the legend and axis labels 882 | theme( 883 | legend.position = "none" , 884 | axis.title.x = element_blank() , 885 | axis.title.y = element_blank() 886 | ) + 887 | 888 | xlab( "" ) + ylab( "" ) + 889 | 890 | # force the x and y axis limits at the shape of the city and don't do anything special for off-map values 891 | scale_x_continuous( limits = c( min( grd$intptlon ) , max( grd$intptlon ) ) , breaks = NULL , oob = squish ) + 892 | # since we're going to add lots of surrounding-area detail! 893 | scale_y_continuous( limits = c( min( grd$intptlat ) , max( grd$intptlat ) ) , breaks = NULL , oob = squish ) + 894 | 895 | theme( 896 | panel.grid.major = element_blank(), 897 | panel.grid.minor = element_blank(), 898 | panel.background = element_blank(), 899 | panel.border = element_blank(), 900 | axis.ticks = element_blank() 901 | ) 902 | 903 | # print the plot to the screen 904 | the.plot 905 | # this is the bottom layer. 906 | 907 | # next the park layer, don't forget about the park layer! 908 | the.plot + park.layer 909 | 910 | # are you alright with saving that? save it. 911 | the.plot <- the.plot + park.layer 912 | 913 | # # end of step 9 # # 914 | # # # # # # # # # # # 915 | 916 | 917 | # # # # # # # # # # # # # # # # # # 918 | # # step 10: blank, color, save # # 919 | 920 | library(ggplot2) 921 | library(scales) 922 | library(raster) 923 | library(plyr) 924 | library(rgeos) 925 | 926 | # draw a rectangle 15% bigger than the original city 927 | boro.shp.blank <- as( 1.3 * extent( boro.shp ), "SpatialPolygons" ) 928 | 929 | # compute the difference between new york city and the rectangle 15% beyond the borders 930 | boro.shp.diff <- gDifference( boro.shp.blank , boro.shp ) 931 | 932 | # prepare the difference layer for ggplot2 933 | outside <- fortify( boro.shp.diff ) 934 | 935 | # fix any weird island polygons 936 | outside2 <- ddply( outside , .(piece) , function(x) rbind( x , outside[ 1 , ] ) ) 937 | 938 | # blank out waterways and coastal areas 939 | blank.layer <- 940 | geom_polygon( 941 | data = outside2 , 942 | aes( x = long , y = lat , group = id ) , 943 | fill = 'white' 944 | ) 945 | 946 | # closer, eh? 947 | the.plot + blank.layer 948 | # that blanks out absolutely everything outside of the city limits 949 | 950 | # store this plot 951 | the.plot <- the.plot + blank.layer 952 | 953 | # but actually, we should add back in the landmasses. like this. 954 | the.plot + nynjct.layer + water.layer 955 | 956 | # store this plot 957 | the.plot <- the.plot + nynjct.layer + water.layer 958 | 959 | # plus the borough outlines. do you want to outline the boroughs? 960 | the.plot + boro.layer 961 | 962 | # store this layer on top of everything. 963 | the.plot <- the.plot + boro.layer 964 | 965 | # print with the same purty colors 966 | the.plot + scale_fill_gradient( low = 'white' , high = 'black' ) 967 | the.plot + scale_fill_gradientn( colours = RdGy.11.p( 100 ) ) 968 | the.plot + scale_fill_gradientn( colours = Purples.9.p( 100 ) ) 969 | the.plot + scale_fill_gradientn( colours = YlOrBr.3.p( 100 ) ) 970 | the.plot + scale_fill_gradientn( colours = YlOrBr.9.p( 100 ) ) 971 | 972 | # ooh i like that one mom, can we keep it can we keep it? 973 | final.plot <- the.plot + scale_fill_gradientn( colours = YlOrBr.9.p( 100 ) ) 974 | 975 | # here's the final plot 976 | final.plot 977 | 978 | # would you like to save this game? 979 | 980 | # use cairo-png as your bitmap type 981 | options( bitmapType = "cairo" ) 982 | 983 | # save the file to your current working directory 984 | ggsave( 985 | "2011 new york city number of persons per room.png" , 986 | plot = final.plot , 987 | scale = 2 , 988 | type = "cairo-png" 989 | ) 990 | # happy? 991 | 992 | # save a silly globular-projected file to your current working directory, 993 | ggsave( 994 | "2011 new york city number of persons per room - globular.png" , 995 | plot = final.plot + coord_map( "globular" , orientation = c( 40.55 , -74.13 , -30 ) ) , 996 | scale = 2 , 997 | type = "cairo-png" 998 | ) 999 | # in case you're into that kinda stuff. 1000 | 1001 | # # end of step ten # # 1002 | # # # # # # # # # # # # 1003 | --------------------------------------------------------------------------------