├── 2012 ESS crosswalk of Russian Regions.csv
├── LICENSE.md
├── README.md
├── how to map the american community survey.R
├── how to map the consumer expenditure survey.R
├── how to map the current population survey.R
├── how to map the demographic and health surveys.R
├── how to map the european social survey.R
├── how to map the new york city housing and vacancy survey.R
└── how to map the pesquisa nacional por amostra de domicilios.R


/2012 ESS crosswalk of Russian Regions.csv:
--------------------------------------------------------------------------------
 1 | region,NAME_1,ID_1
 2 | RU17,Altay,2513
 3 | RU18,Amur,2514
 4 | RU12,Arkhangel'sk,2515
 5 | RU14,Astrakhan',2516
 6 | RU11,Belgorod,2518
 7 | RU11,Bryansk,2519
 8 | RU15,Chechnya,2521
 9 | RU16,Chelyabinsk,2522
10 | RU18,Chukot,2524
11 | RU13,Chuvash,2525
12 | RU18,Evenk,2528
13 | RU17,Irkutsk,2531
14 | RU11,Ivanovo,2532
15 | RU15,Kabardin-Balkar,2533
16 | RU12,Kaliningrad,2534
17 | RU11,Kaluga,2536
18 | RU18,Kamchatka,2537
19 | RU18,Koryak,2547
20 | RU15,Karachay-Cherkess,2538
21 | RU17,Kemerovo,2540
22 | RU18,Khabarovsk,2541
23 | RU13,Kirov,2544
24 | RU11,Kostroma,2548
25 | RU14,Krasnodar,2549
26 | RU17,Krasnoyarsk,2550
27 | RU16,Kurgan,2551
28 | RU11,Kursk,2552
29 | RU12,Leningrad,2553
30 | RU11,Lipetsk,2554
31 | RU18,Maga Buryatdan,2555
32 | RU11,Moskva,2558
33 | RU12,Murmansk,2559
34 | RU13,Nizhegorod,2561
35 | RU12,Novgorod,2563
36 | RU17,Novosibirsk,2564
37 | RU17,Omsk,2565
38 | RU13,Orenburg,2567
39 | RU11,Orel,2566
40 | RU13,Penza,2568
41 | RU13,Perm',2569
42 | RU13,Komi-Permyak,2546
43 | RU18,Primor'ye,2570
44 | RU12,Pskov,2571
45 | RU14,Adygey,2511
46 | RU17,Gorno-Altay,2529
47 | RU13,Bashkortostan,2517
48 | RU17,Buryat,2520
49 | RU15,Dagestan,2527
50 | RU15,Ingush,2530
51 | RU14,Kalmyk,2535
52 | RU12,Karelia,2539
53 | RU17,Khakass,2542
54 | RU12,Komi,2545
55 | RU13,Mariy-El,2556
56 | RU13,Mordovia,2557
57 | RU18,Sakha,2574
58 | RU13,Tatarstan,2582
59 | RU17,Tuva,2586
60 | RU14,Rostov,2572
61 | RU11,Ryazan',2573
62 | RU18,Sakhalin,2575
63 | RU13,Samara,2576
64 | RU13,Saratov,2577
65 | RU15,North Ossetia,2562
66 | RU11,Smolensk,2578
67 | RU12,City of St. Petersburg,2526
68 | RU15,Stavropol',2579
69 | RU16,Sverdlovsk,2580
70 | RU11,Tambov,2581
71 | RU17,Tomsk,2584
72 | RU11,Tula,2585
73 | RU11,Tver',2587
74 | RU16,Tyumen',2588
75 | RU13,Udmurt,2589
76 | RU13,Ul'yanovsk,2590
77 | RU11,Vladimir,2592
78 | RU14,Volgograd,2593
79 | RU12,Vologda,2594
80 | RU11,Voronezh,2595
81 | RU11,Yaroslavl',2597
82 | RU17,Chita,2523
83 | RU17,Aga Buryat,2512
84 | RU17,Khanty-Mansiy,2543
85 | RU17,Yamal-Nenets,2596
86 | RU17,Taymyr,2583
87 | RU17,Ust-Orda Buryat,2591
88 | RU18,Yevrey,2598
89 | RU12,Nenets,2560
90 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
  1 | # GNU GENERAL PUBLIC LICENSE
  2 | Version 3, 29 June 2007
  3 | 
  4 | Copyright (C) 2007 [Free Software Foundation, Inc.](http://fsf.org/)
  5 | 
  6 | Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
  7 | 
  8 | ## Preamble
  9 | The GNU General Public License is a free, copyleft license for software and other kinds of works.
 10 | 
 11 | The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to
 12 | share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.
 13 | 
 14 | When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for
 15 | them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.
 16 | 
 17 | To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if
 18 | you modify it: responsibilities to respect the freedom of others.
 19 | 
 20 | For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
 21 | 
 22 | Developers that use the GNU GPL protect your rights with two steps:
 23 | 
 24 | 1. assert copyright on the software, and
 25 | 2. offer you this License giving you legal permission to copy, distribute and/or modify it.
 26 | 
 27 | For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as
 28 | changed, so that their problems will not be attributed erroneously to authors of previous versions.
 29 | 
 30 | Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users.
 31 | 
 32 | Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free.
 33 | 
 34 | The precise terms and conditions for copying, distribution and modification follow.
 35 | 
 36 | ## TERMS AND CONDITIONS
 37 | 
 38 | ### 0. Definitions.
 39 | *This License* refers to version 3 of the GNU General Public License.
 40 | 
 41 | *Copyright* also means copyright-like laws that apply to other kinds of works, such as semiconductor masks.
 42 | 
 43 | *The Program* refers to any copyrightable work licensed under this License. Each licensee is addressed as *you*. *Licensees* and *recipients* may be individuals or organizations.
 44 | 
 45 | To *modify* a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a *modified version* of the earlier work or a work *based on* the earlier work.
 46 | 
 47 | A *covered work* means either the unmodified Program or a work based on the Program.
 48 | 
 49 | To *propagate* a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well.
 50 | 
 51 | To *convey* a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying.
 52 | 
 53 | An interactive user interface displays *Appropriate Legal Notices* to the extent that it includes a convenient and prominently visible feature that
 54 | 
 55 | 1. displays an appropriate copyright notice, and
 56 | 2. tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License.
 57 | 
 58 | If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion.
 59 | 
 60 | ### 1. Source Code.
 61 | The *source code* for a work means the preferred form of the work for making modifications to it. *Object code* means any non-source form of a work.
 62 | 
 63 | A *Standard Interface* means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language.
 64 | 
 65 | The *System Libraries* of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A *Major Component*, in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it.
 66 | 
 67 | The *Corresponding Source* for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.
 68 | 
 69 | The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source.
 70 | 
 71 | The Corresponding Source for a work in source code form is that same work.
 72 | 
 73 | ### 2. Basic Permissions.
 74 | All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law.
 75 | 
 76 | You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do
 77 | not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of
 78 | your copyrighted material outside their relationship with you.
 79 | 
 80 | Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary.
 81 | 
 82 | ### 3. Protecting Users' Legal Rights From Anti-Circumvention Law.
 83 | No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures.
 84 | 
 85 | When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures.
 86 | 
 87 | ### 4. Conveying Verbatim Copies.
 88 | You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program.
 89 | 
 90 | You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee.
 91 | 
 92 | ### 5. Conveying Modified Source Versions.
 93 | You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
 94 | 
 95 | - a) The work must carry prominent notices stating that you modified it, and giving a relevant date.
 96 | - b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to *keep intact all notices*.
 97 | - c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.
 98 | - d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so.
 99 | 
100 | A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an *aggregate* if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.
101 | 
102 | ### 6. Conveying Non-Source Forms.
103 | You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways:
104 | 
105 | - a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange.
106 | - b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either 	
107 | 	1. a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or
108 | 	2. access to copy the Corresponding Source from a network server at no charge.
109 | 
110 | - c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b.
111 | - d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements.
112 | - e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d.
113 | 
114 | A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work.
115 | 
116 | A *User Product* is either
117 | 
118 | 1. a *consumer product*, which means any tangible personal property which is normally used for personal, family, or household purposes, or
119 | 2. anything designed or sold for incorporation into a dwelling.
120 | 
121 | In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, *normally used* refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product.
122 | 
123 | *Installation Information* for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made.
124 | 
125 | If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM).
126 | 
127 | The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and
128 | protocols for communication across the network.
129 | 
130 | Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying.
131 | 
132 | ### 7. Additional Terms.
133 | *Additional permissions* are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions.
134 | 
135 | When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission.
136 | 
137 | Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms:
138 | 
139 | a. Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or
140 | b. Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or
141 | c. Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or
142 | d. Limiting the use for publicity purposes of names of licensors or authors of the material; or
143 | e. Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or
144 | f. Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors.
145 | 
146 | All other non-permissive additional terms are considered *further restrictions* within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying.
147 | 
148 | If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms.
149 | 
150 | Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way.
151 | 
152 | ### 8. Termination.
153 | You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11).
154 | 
155 | However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated
156 | 
157 | a. provisionally, unless and until the copyright holder explicitly and finally terminates your license, and
158 | b. permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation.
159 | 
160 | Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice.
161 | 
162 | Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10.
163 | 
164 | ### 9. Acceptance Not Required for Having Copies.
165 | You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so.
166 | 
167 | ### 10. Automatic Licensing of Downstream Recipients.
168 | Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License.
169 | 
170 | An *entity transaction* is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts.
171 | 
172 | You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it.
173 | 
174 | ### 11. Patents.
175 | A *contributor* is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's *contributor version*.
176 | 
177 | A contributor's *essential patent claims* are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, *control* includes the right to grant patent sublicenses in a manner consistent with the requirements of this License.
178 | 
179 | Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version.
180 | 
181 | In the following three paragraphs, a *patent license* is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to
182 | sue for patent infringement). To *grant* such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party.
183 | 
184 | If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either
185 | 
186 | 1. cause the Corresponding Source to be so available, or
187 | 2. arrange to deprive yourself of the benefit of the patent license for this particular work, or
188 | 3. arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients.
189 | 
190 | *Knowingly relying* means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid.
191 | 
192 | If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it.
193 | 
194 | A patent license is *discriminatory* if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license
195 | 
196 | a. in connection with copies of the covered work conveyed by you (or copies made from those copies), or
197 | b. primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007.
198 | 
199 | Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law.
200 | 
201 | ### 12. No Surrender of Others' Freedom.
202 | If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program.
203 | 
204 | ### 13. Use with the GNU Affero General Public License.
205 | Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such.
206 | 
207 | ### 14. Revised Versions of this License.
208 | The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.
209 | 
210 | Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License *or any later version* applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation.
211 | 
212 | If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program.
213 | 
214 | Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version.
215 | 
216 | ### 15. Disclaimer of Warranty.
217 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM *AS IS* WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
218 | 
219 | ### 16. Limitation of Liability.
220 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
221 | 
222 | ### 17. Interpretation of Sections 15 and 16.
223 | If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee.
224 | 
225 | ## END OF TERMS AND CONDITIONS
226 | ### How to Apply These Terms to Your New Programs
227 | If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms.
228 | 
229 | To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the *copyright* line and a pointer to where the full notice is found.
230 | 
231 | 	<one line to give the program's name and a brief idea of what it does.>
232 | 	Copyright (C) <year>  <name of author>
233 | 	
234 | 	This program is free software: you can redistribute it and/or modify
235 | 	it under the terms of the GNU General Public License as published by
236 | 	the Free Software Foundation, either version 3 of the License, or
237 | 	(at your option) any later version.
238 | 	
239 | 	This program is distributed in the hope that it will be useful,
240 | 	but WITHOUT ANY WARRANTY; without even the implied warranty of
241 | 	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
242 | 	GNU General Public License for more details.
243 | 	
244 | 	You should have received a copy of the GNU General Public License
245 | 	along with this program.  If not, see <http://www.gnu.org/licenses/>.
246 | 
247 | Also add information on how to contact you by electronic and paper mail.
248 | 
249 | If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode:
250 | 
251 | 	<program>  Copyright (C) <year>  <name of author>
252 | 	This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
253 | 	This is free software, and you are welcome to redistribute it
254 | 	under certain conditions; type `show c' for details.
255 | 
256 | The hypothetical commands `show w` and `show c` should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an *about box*.
257 | 
258 | You should also get your employer (if you work as a programmer) or school, if any, to sign a *copyright disclaimer* for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see [http://www.gnu.org/licenses/](http://www.gnu.org/licenses/).
259 | 
260 | The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read [http://www.gnu.org/philosophy/why-not-lgpl.html](http://www.gnu.org/philosophy/why-not-lgpl.html).


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | cartography with complex survey data


--------------------------------------------------------------------------------
/how to map the american community survey.R:
--------------------------------------------------------------------------------
   1 | # # # # # # # # # # # # # # # # #
   2 | # # set the working directory # #
   3 | # # # # # # # # # # # # # # # # #
   4 | 
   5 | # setwd( "C:/My Directory/SWMAP/" )
   6 | 
   7 | 
   8 | # # # # # # # # # # # # # # # #
   9 | # # example survey data set # #
  10 | # # # # # # # # # # # # # # # #
  11 | 
  12 | # american community survey
  13 | 
  14 | 
  15 | # # # # # # # # # # # # # # # # # # # # #
  16 | # # different from other maps because # #
  17 | # # # # # # # # # # # # # # # # # # # # #
  18 | 
  19 | # displays a non-ordinal categorical variable
  20 | # crosses the international date line
  21 | 
  22 | 
  23 | # # # # # # # # # # # # # # # # # #
  24 | # # smallest level of geography # #
  25 | # # # # # # # # # # # # # # # # # #
  26 | 
  27 | # state, public use microdata areas
  28 | 
  29 | 
  30 | # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  31 | # # asdfree.com blog post for this survey microdata # #
  32 | # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  33 | 
  34 | # http://www.asdfree.com/search/label/american%20community%20survey%20%28acs%29
  35 | 
  36 | 
  37 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  38 | # # r code repository for setup and analysis examples # #
  39 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  40 | 
  41 | # https://github.com/ajdamico/asdfree/tree/master/American%20Community%20Survey
  42 | 
  43 | 
  44 | # # # # # # # # # # # # #
  45 | # # value of interest # #
  46 | # # # # # # # # # # # # #
  47 | 
  48 | # disproportionate shares of veterans of foreign wars (categorical)
  49 | 
  50 | 
  51 | # # # # # # #
  52 | # # flaws # #
  53 | # # # # # # #
  54 | 
  55 | # map presents shares that are *disproportionately* higher than the statewide average.
  56 | # in absolute numbers, gulf war veterans outnumber other veteran categories in four of the five pumas.
  57 | 
  58 | 
  59 | # # # # # # # # # # # # # # # # # # # # #
  60 | # # step 1: load the survey microdata # #
  61 | 
  62 | # remove the # in order to run this install.packages line only once
  63 | # install.packages( "MonetDBLite" )
  64 | 
  65 | library(downloader)
  66 | 
  67 | # download the 2013 american community survey microdata onto the local disk
  68 | # path.to.7z <- "7za"							# # only macintosh and *nix users need this line
  69 | single.year.datasets.to.download <- 2013
  70 | three.year.datasets.to.download <- NULL
  71 | five.year.datasets.to.download <- NULL
  72 | source_url( "https://raw.githubusercontent.com/ajdamico/asdfree/master/American%20Community%20Survey/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE )
  73 | 
  74 | gc()
  75 | 
  76 | # # end of step 1 # #
  77 | # # # # # # # # # # #
  78 | 
  79 | 
  80 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  81 | # # step 2: conduct your analysis of interest at the smallest geography allowed # #
  82 | 
  83 | library(survey)
  84 | library(DBI)			# load the DBI package (implements the R-database coding)
  85 | library(MonetDBLite)
  86 | library(scales)
  87 | 
  88 | # connect to the database
  89 | dbfolder <- paste0( getwd() , "/MonetDB" )
  90 | 
  91 | db <- dbConnect( MonetDBLite::MonetDBLite() , dbfolder )
  92 | 
  93 | # # # # run your analysis commands # # # #
  94 | 
  95 | # subset the design to only alaska before actually constructing the design
  96 | acs.alaska <- dbGetQuery( db , 'select * from acs2013_1yr_m where st = 2' )
  97 | # note: this is not allowed for taylor-series linearized designs
  98 | 
  99 | # disconnect from the current monet database
 100 | dbDisconnect( db )
 101 | 
 102 | # construct a svrepdesign object
 103 | alaska.design <-
 104 | 	svrepdesign(
 105 | 		weight = ~pwgtp ,
 106 | 		repweights = 'pwgtp[1-9]' ,
 107 | 		scale = 4 / 80 ,
 108 | 		rscales = rep( 1 , 80 ) ,
 109 | 		mse = TRUE ,
 110 | 		data = acs.alaska
 111 | 	)
 112 | 
 113 | # create a denominator variable indicating any period of service
 114 | alaska.design <- update( alaska.design , vet = as.numeric( vps > 0 ) )
 115 | 
 116 | # create a categorical variable indicating era of service
 117 | alaska.design <- 
 118 | 	update( 
 119 | 		alaska.design , 
 120 | 		gulf = as.numeric( vps %in% 1:5 ) ,
 121 | 		vietnam = as.numeric( vps %in% 6:8 ) ,
 122 | 		other = as.numeric( vps %in% 9:15 )
 123 | 	)
 124 | 	
 125 | 
 126 | 
 127 | # statewide era of service shares
 128 | sw <- svyratio( ~ gulf + vietnam + other , ~ vet , alaska.design , na.rm = TRUE )
 129 | 
 130 | # puma-specific era of service shares
 131 | ps <- svyby( ~ gulf + vietnam + other , denominator = ~ vet , by = ~ puma , alaska.design , svyratio , na.rm = TRUE )
 132 | 
 133 | # find the disproportionate shares
 134 | ds <- ps[ , 2:4 ] - matrix( coef( sw ) , 5 , 3 , byrow = T )
 135 | 
 136 | # so look at this table.
 137 | ds
 138 | 
 139 | # pumas 101 and 300 have veterans that disproportionately served during the gulf wars (up to the present)
 140 | 
 141 | # pumas 102 and 200 have veterans that disproportionately served during the vietnam war
 142 | 
 143 | # puma 400 has veterans that disproportionately served during another era
 144 | 
 145 | # hold on to these disproportionate shares and the standard errors of the original ratios.
 146 | alaska.pumas <- cbind( ps[ 1 ] , ds , ps[ , 5:7 ] )
 147 | 
 148 | # note that the standard error of the ratio statistic
 149 | # and the standard error of the ratio of
 150 | # the difference between statewide and puma-level statistics
 151 | # are probably not the same.  well, i'm sure of it.
 152 | # if you're an academic statistician, you might be mad at me
 153 | # for making this half-assed calculation right here.
 154 | 
 155 | # github makes it easy to patch and edit and update other people's code.
 156 | 
 157 | # go for it  ;)
 158 | 
 159 | # remove those slashvets from the column names
 160 | names( alaska.pumas ) <- gsub( "/vet" , "" , names( alaska.pumas ) )
 161 | 
 162 | # these are the small area statistics to be mapped
 163 | print( alaska.pumas )
 164 | # the standard errors are a measure of precision,
 165 | # their inverse will serve as the mapping weights
 166 | 
 167 | # make this object easier to type
 168 | sas <- alaska.pumas
 169 | 
 170 | # # end of step 2 # #
 171 | # # # # # # # # # # #
 172 | 
 173 | 
 174 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 175 | # # step 3: download and import necessary geographic crosswalks # #
 176 | 
 177 | library(downloader)
 178 | library(maptools)
 179 | 
 180 | # load the download_cached and related functions
 181 | # to prevent re-downloading of files once they've been downloaded.
 182 | source_url(
 183 | 	"https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" ,
 184 | 	prompt = FALSE ,
 185 | 	echo = FALSE
 186 | )
 187 | 
 188 | 
 189 | # create a temporary file containing the census bureau's
 190 | # 2010 census tract to 2010 puma crosswalk
 191 | # then download the file.
 192 | ctpxw.tf <- tempfile()
 193 | 
 194 | download_cached(
 195 | 	"http://www2.census.gov/geo/docs/maps-data/data/rel/2010_Census_Tract_to_2010_PUMA.txt" ,
 196 | 	ctpxw.tf ,
 197 | 	mode = 'wb'
 198 | )
 199 | # note: to re-download a file from scratch, add the parameter usecache = FALSE
 200 | 
 201 | # import this csv file into an R data.frame object
 202 | ctpxw <- read.csv( ctpxw.tf )
 203 | 
 204 | # match the column names of sf1 and of the `sas` output
 205 | names( ctpxw ) <- c( 'state' , 'county' , 'tract' , 'puma' )
 206 | 
 207 | # immediately limit this to alaskan census tracts
 208 | ak.ctpxw <- subset( ctpxw , state == 2 )
 209 | 
 210 | # clear up RAM
 211 | rm( ctpxw ) ; gc()
 212 | 
 213 | 
 214 | # create a temporary file containing the census bureau's
 215 | # 2010 census summary file #1 for alaska
 216 | # then download the file.
 217 | sf1ak.tf <- tempfile()
 218 | 
 219 | download_cached( 
 220 | 	"ftp://ftp2.census.gov/census_2010/04-Summary_File_1/Alaska/ak2010.sf1.zip" ,
 221 | 	sf1ak.tf ,
 222 | 	mode = 'wb'
 223 | )
 224 | # note: to re-download a file from scratch, add the parameter usecache = FALSE
 225 | 
 226 | # unzip the summary file #1 files
 227 | sf1ak.uz <- unzip( sf1ak.tf , exdir = tempdir() )
 228 | 
 229 | # file layout from http://www.census.gov/prod/cen2010/doc/sf1.pdf#page=18
 230 | sf1ak <- read.fwf( sf1ak.uz[ grep( "akgeo2010" , sf1ak.uz ) ] , c( -8 , 3 , -16 , 2 , 3 , -22 , 6 , 1 , 4 , -253 , 9 , -9 , 11 , 12 ) )
 231 | 
 232 | # add columns names matching the census bureau, so it's easy to read
 233 | names( sf1ak ) <- c( "sumlev" , "state" , "county" , "tract" , "blkgrp" , "block" , "pop100" , "intptlat" , "intptlon" )
 234 | 
 235 | # summary level 101 has census tracts and census blocks
 236 | sf1ak.101 <- subset( sf1ak , sumlev == "101" )
 237 | 
 238 | # merge these files together
 239 | sf1ak.101 <- merge( sf1ak.101 , ak.ctpxw )
 240 | # the number of records and population sums serve
 241 | # as a check to confirm that this merge worked
 242 | 
 243 | # one record per census block in alaska.  see?  same number.
 244 | nrow( sf1ak.101 )
 245 | # https://www.census.gov/geo/maps-data/data/tallies/census_block_tally.html
 246 | 
 247 | # and guess what?  the total alaska population matches as well.
 248 | sum( sf1ak.101$pop100 )
 249 | # http://quickfacts.census.gov/qfd/states/02000.html
 250 | 
 251 | # clear up RAM
 252 | rm( sf1ak ) ; gc()
 253 | 
 254 | 
 255 | # so now we have a data.frame object with
 256 | # one record per census block,
 257 | # and also with the geography (puma)
 258 | # that matches the american community survey
 259 | head( sf1ak.101 )
 260 | 
 261 | # and guess what?
 262 | # we've now got the census 2010 weighted populations (field pop100)
 263 | # and also each census block's centroid latitude & longitude (fields intptlat + intptlon)
 264 | 
 265 | # # end of step 3 # #
 266 | # # # # # # # # # # #
 267 | 
 268 | 
 269 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 270 | # # step 4: merge the results of your survey analysis with the small-area geography # #
 271 | 
 272 | # confirm that we've created all possible geographies correctly.
 273 | 
 274 | # the number of records in our small area statistics..
 275 | sas.row <- nrow( sas )
 276 | 
 277 | # ..should equal the number of unique-match-merged records..
 278 | mrow <- nrow( merge( unique( sf1ak.101[ "puma" ] ) , sas ) )
 279 | 
 280 | # ..and it does/they do.
 281 | stopifnot( sas.row == mrow )
 282 | 
 283 | # now the census block-level alaska census data *could* merge if you wanted it to.
 284 | 
 285 | 
 286 | # but you don't.  yet.
 287 | 
 288 | 
 289 | # the standard error (the `se.` fields) are measures of precision.
 290 | print( sas )
 291 | # the smaller the standard error, the more confident you should be
 292 | # that the estimate at a particular geography is correct.
 293 | 
 294 | 
 295 | # so invert them.  you heard me.  invert them.
 296 | sas$invse.gulf <- 1 / sas$se.gulf
 297 | sas$invse.vietnam <- 1 / sas$se.vietnam
 298 | sas$invse.other <- 1 / sas$se.other
 299 | # a smaller standard error indicates more precision.
 300 | 
 301 | # for our purposes, precision can be considered weight! #
 302 | 
 303 | # now we've got the weight that we should give each of our estimates #
 304 | 
 305 | # distribute that weight across all census blocks #
 306 | 
 307 | 
 308 | # aggregate the 2010 census block populations to the geographies that you have.
 309 | popsum <- aggregate( sf1ak.101$pop100 , by = ( sf1ak.101[ "puma" ] ) , sum )
 310 | 
 311 | # make the column name meaningful
 312 | names( popsum )[ names( popsum ) == 'x' ] <- 'popsum'
 313 | 
 314 | # merge the popsum onto the sasfile
 315 | sas <- merge( sas , popsum )
 316 | 
 317 | # now.  merge
 318 | 	# the disproportionate veteran era in each puma (the variable of interest)
 319 | 	# the inverted standard errors (the total weight of the broad geography)
 320 | 	# the population sum (the total population of all census blocks that are part of that geography)
 321 | 
 322 | x <- merge( sf1ak.101 , sas )
 323 | 
 324 | # confirm no record loss
 325 | stopifnot( nrow( x ) == nrow( sf1ak.101 ) )
 326 | 
 327 | 
 328 | # (this is the fun part)
 329 | # calculate the weight at each census block
 330 | x$weight.gulf <- x$invse.gulf * ( x$pop100 / x$popsum )
 331 | x$weight.vietnam <- x$invse.vietnam * ( x$pop100 / x$popsum )
 332 | x$weight.other <- x$invse.other * ( x$pop100 / x$popsum )
 333 | 
 334 | # note that weight of all census blocks put together
 335 | # sums to the `invse` on the original analysis file
 336 | stopifnot( sum( x$weight.gulf ) == sum( sas$invse.gulf ) )
 337 | stopifnot( sum( x$weight.vietnam ) == sum( sas$invse.vietnam ) )
 338 | stopifnot( sum( x$weight.other ) == sum( sas$invse.other ) )
 339 | 
 340 | # remove records with zero population across all three measures
 341 | x <- subset( x , weight.gulf > 0 | weight.vietnam > 0 | weight.other > 0 )
 342 | 
 343 | # scale all weights so that they average to one
 344 | x$weight.gulf <- x$weight.gulf / mean( x$weight.gulf )
 345 | x$weight.vietnam <- x$weight.vietnam / mean( x$weight.viet )
 346 | x$weight.other <- x$weight.other / mean( x$weight.other )
 347 | 
 348 | 
 349 | # you're done preparing your data.
 350 | # keep only the columns you need.
 351 | x <- x[ , c( 'gulf', 'vietnam' , 'other' , 'weight.gulf' , 'weight.vietnam' , 'weight.other' , 'intptlat' , 'intptlon' ) ]
 352 | 
 353 | # pop quiz: which states are the furthest north, east, south, west?
 354 | # if you guessed alaska, maine, hawaii, alaska, you are wrong!
 355 | # the answer is alaska, alaska, hawaii, alaska.
 356 | 
 357 | # a few of the aleutians cross the international date line.
 358 | 
 359 | # do you want to keep the edges of the aleutian islands in your map?
 360 | # of course you do!  here's an ultra-simple recode to keep them gridded together.
 361 | x <- transform( x , intptlon = ifelse( intptlon > 0 , intptlon - 360 , intptlon ) )
 362 | 
 363 | 
 364 | # # end of step 4 # #
 365 | # # # # # # # # # # #
 366 | 
 367 | 
 368 | # # # # # # # # # # # #
 369 | # # step 5: outline # #
 370 | 
 371 | library(maptools)
 372 | library(raster)
 373 | library(rgeos)
 374 | library(rgdal)
 375 | library(ggplot2)
 376 | 
 377 | # make a character vector containing the shapefiles to download
 378 | shftd <- 
 379 | 	c( 
 380 | 		# download the clipped alaska public use microdata area map, described
 381 | 		# https://www.census.gov/geo/maps-data/maps/2010puma/st02_ak.html
 382 | 		'http://www2.census.gov/geo/tiger/GENZ2013/cb_2013_02_puma10_500k.zip' ,
 383 | 
 384 | 		# download the clipped nationwide state outlines
 385 | 		'http://www2.census.gov/geo/tiger/GENZ2013/cb_2013_us_state_500k.zip' ,
 386 | 
 387 | 		# download the roads in alaska
 388 | 		'http://www2.census.gov/geo/tiger/TIGER2013/PRISECROADS/tl_2013_02_prisecroads.zip'
 389 | 	)
 390 | 
 391 | # initiate a function to download and import all census bureau shapefiles
 392 | daiacbsf <-
 393 | 	function( fn , myproj = "+init=epsg:2163" ){
 394 | 		tf <- tempfile()
 395 | 
 396 | 		# # note: to re-download a file from scratch, add the parameter usecache = FALSE # #
 397 | 		download_cached( fn , tf , mode = 'wb' )
 398 | 
 399 | 		# unzip the downloaded file to a temporary directory
 400 | 		shp.uz <- unzip( tf , exdir = tempdir() )
 401 | 
 402 | 		# figure out which filename ends with "shp"
 403 | 		sfname <- grep( 'shp$' , shp.uz , value = TRUE )
 404 | 		
 405 | 		# read in the shapefile, using the correct layer
 406 | 		sf <- readOGR( sfname  , layer = gsub( "\\.shp" , "" , basename( sfname ) ) )
 407 | 
 408 | 		# project this shapefile immediately
 409 | 		# this projection (and a few others) keeps
 410 | 		# the aleutian islands that cross the
 411 | 		# international date line easy to work with.
 412 | 		spTransform( sf , CRS( myproj ) )
 413 | 	}
 414 | 
 415 | # run all downloads at once, store the result in a list.
 416 | asf <- sapply( shftd , daiacbsf )
 417 | 
 418 | # pull out the clipped state borders of alaska only
 419 | alaska.borders <- subset( asf[[2]] , STATEFP == '02' )
 420 | 
 421 | # plot as-is.  see how the aleutians screw up the map?
 422 | plot( alaska.borders )
 423 | 
 424 | # add puma boundaries
 425 | plot( asf[[1]] , add = TRUE )
 426 | 
 427 | # refresh the map with state borders only
 428 | plot( alaska.borders )
 429 | 
 430 | # add roads
 431 | plot( asf[[3]] , add = TRUE , col = 'red' )
 432 | 
 433 | # draw a rectangle 15% bigger than the original state
 434 | ak.shp.blank <- as( 1.3 * extent( alaska.borders ) , "SpatialPolygons" )
 435 | 
 436 | # calculate the difference between the rectangle and the actual shape
 437 | ak.shp.diff <- gDifference( ak.shp.blank , alaska.borders )
 438 | # this will be used to cover up points outside of alaska's state borders
 439 | 
 440 | # this box will later blank out the surrounding area
 441 | plot( ak.shp.diff )
 442 | 
 443 | # # end of step 5 # #
 444 | # # # # # # # # # # #
 445 | 
 446 | 
 447 | # # # # # # # # # # # # # # # # # #
 448 | # # step 6: tie knots and krige # #
 449 | 
 450 | library(sqldf)
 451 | 
 452 | # # warning warning # # # # warning warning # # 
 453 | # alaska has a vast geography and highly skewed population centers
 454 | # kriging functions might not converge.  that's why there are other options ;)
 455 | # # warning warning # # # # warning warning # # 
 456 | 
 457 | 
 458 | # how many knots should you make? #
 459 | 
 460 | # knots are the computationally-intensive part of this process,
 461 | # choose as many as your computer and your patience can handle.
 462 | 
 463 | # you should aim for between 100 - 999 knots,
 464 | # but numbers closer to 1,000 will overload smaller computers
 465 | 
 466 | # you could let the `fields` package attempt to guess knots for you,
 467 | # xknots <- cover.design( cbind( x$intptlon , x$intptlat ) , 100 )$design
 468 | # but with census microdata, you've already got easy access to a relevant geographic grouping
 469 | 
 470 | # the sqldf() function doesn't like `.` in data.frame object names
 471 | sf1s <- sf1ak.101
 472 | 
 473 | # exactamundo same transform operation as you saw previously on `x`
 474 | sf1s <- transform( sf1s , intptlon = ifelse( intptlon > 0 , intptlon - 360 , intptlon ) )
 475 | 
 476 | # within each county x census tract
 477 | # calculate the population-weighted mean of the coordinates
 478 | ct.knots <- 
 479 | 	sqldf( 
 480 | 		"select 
 481 | 			county , tract ,
 482 | 			sum( pop100 ) as pop100 , 
 483 | 			sum( pop100 * intptlon ) / sum( pop100 ) as intptlon ,
 484 | 			sum( pop100 * intptlat ) / sum( pop100 ) as intptlat
 485 | 		from sf1s
 486 | 		group by
 487 | 			county , tract"
 488 | 	)
 489 | # note: this screws up coordinates that cross the international date line
 490 | # or the equator.  in the united states, only alaska's aleutian islands do this
 491 | # and we're mapping alaska, aren't we?  good thing we fixed it, huh?
 492 | 
 493 | gc()
 494 | 
 495 | # interpolation option one #
 496 | library(fields)
 497 | 
 498 | krig.fit.gulf <-
 499 | 	Krig(
 500 | 		cbind( x$intptlon , x$intptlat ) ,
 501 | 		Y = x$gulf ,
 502 | 		weights = x$weight.gulf ,
 503 | 		knots = cbind( ct.knots$intptlon , ct.knots$intptlat )
 504 | 		# if you prefer to use cover.design, all you'd need is this knots= line instead:
 505 | 		# knots = xknots
 506 | 	)
 507 | 
 508 | gc()
 509 | 	
 510 | krig.fit.vietnam <-
 511 | 	Krig(
 512 | 		cbind( x$intptlon , x$intptlat ) ,
 513 | 		x$vietnam ,
 514 | 		weights = x$weight.vietnam ,
 515 | 		knots = cbind( ct.knots$intptlon , ct.knots$intptlat )
 516 | 		# if you prefer to use cover.design, all you'd need is this knots= line instead:
 517 | 		# knots = xknots
 518 | 	)
 519 | 
 520 | gc()
 521 | 
 522 | krig.fit.other <-
 523 | 	Krig(
 524 | 		cbind( x$intptlon , x$intptlat ) ,
 525 | 		x$other ,
 526 | 		weights = x$weight.other ,
 527 | 		knots = cbind( ct.knots$intptlon , ct.knots$intptlat )
 528 | 		# if you prefer to use cover.design, all you'd need is this knots= line instead:
 529 | 		# knots = xknots
 530 | 	)
 531 | 
 532 | gc()
 533 | 
 534 | # that is: what is the (weighted) relationship between
 535 | # your variable of interest (veteran service eras) and
 536 | # the x/y points on a grid?
 537 | 
 538 | # check this out!
 539 | surface( krig.fit.gulf )
 540 | surface( krig.fit.vietnam )
 541 | surface( krig.fit.other )
 542 | # you're almost there!
 543 | 
 544 | 
 545 | # interpolation option two #
 546 | library(mgcv)
 547 | 
 548 | 
 549 | gam.gulf <- 
 550 | 	gam( 
 551 | 		gulf ~ s( intptlon , intptlat ) , 
 552 | 		weights = weight.gulf , 
 553 | 		data = x
 554 | 	)
 555 | 
 556 | gam.vietnam <- 
 557 | 	gam( 
 558 | 		vietnam ~ s( intptlon , intptlat ) , 
 559 | 		weights = weight.vietnam , 
 560 | 		data = x
 561 | 	)
 562 | 
 563 | gam.other <- 
 564 | 	gam( 
 565 | 		other ~ s( intptlon , intptlat ) , 
 566 | 		weights = weight.other , 
 567 | 		data = x
 568 | 	)
 569 | 
 570 | # # end of step 6 # #
 571 | # # # # # # # # # # #
 572 | 
 573 | 
 574 | # # # # # # # # # # # # # # # # # # # #
 575 | # # step 7: make a grid and predict # #
 576 | 
 577 | # use as fine of a grid as your computer can handle
 578 | grid.length <- 750
 579 | # # note: smaller grids will render faster
 580 | # # (so they're better if you're just playing around)
 581 | # # but larger grids will prevent your final plot from
 582 | # # being too pixelated, even when zooming in
 583 | 
 584 | 
 585 | x.range <- c( min( x$intptlon ) , max( x$intptlon ) )
 586 | y.range <- c( min( x$intptlat ) , max( x$intptlat ) )
 587 | 
 588 | # add five percent on each side
 589 | x.diff <- abs( x.range[ 2 ] - x.range[ 1 ] ) * 0.2
 590 | y.diff <- abs( y.range[ 2 ] - y.range[ 1 ] ) * 0.2
 591 | 
 592 | x.range[ 1 ] <- x.range[ 1 ] - x.diff
 593 | x.range[ 2 ] <- x.range[ 2 ] + x.diff
 594 | y.range[ 1 ] <- y.range[ 1 ] - y.diff
 595 | y.range[ 2 ] <- y.range[ 2 ] + y.diff
 596 | 
 597 | 
 598 | grd <- krig.grd <- gam.grd <- 
 599 | 	expand.grid(
 600 | 		intptlon = seq( x.range[ 1 ] , x.range[ 2 ] , length = grid.length ) , 
 601 | 		intptlat = seq( y.range[ 1 ] , y.range[ 2 ] , length = grid.length )
 602 | 	)
 603 | 
 604 | 	
 605 | # along your rectangular grid,
 606 | # what are the predicted values of
 607 | # each veteran era category
 608 | krig.grd$gulf <- predict( krig.fit.gulf , krig.grd[ , 1:2 ] )
 609 | 
 610 | krig.grd$vietnam <- predict( krig.fit.vietnam , krig.grd[ , 1:2 ] )
 611 | 
 612 | krig.grd$other <- predict( krig.fit.other , krig.grd[ , 1:2 ] )
 613 | 
 614 | gam.grd$gulf <- predict( gam.gulf , gam.grd[ , 1:2 ] )
 615 | 
 616 | gam.grd$vietnam <- predict( gam.vietnam , gam.grd[ , 1:2 ] )
 617 | 
 618 | gam.grd$other <- predict( gam.other , gam.grd[ , 1:2 ] )
 619 | 
 620 | 
 621 | 
 622 | # remember that these values have been re-scaled
 623 | # as how disproportionate they are from the state-wide averages.
 624 | # therefore, negative values are possible.
 625 | sapply( krig.grd , summary )
 626 | sapply( gam.grd , summary )
 627 | 
 628 | # what we're really hoping for is that
 629 | # the overall mean averages out to zero
 630 | sum( sapply( gam.grd , summary )[ 4 , 3:5 ] )
 631 | 
 632 | # in general, these predictions at each point should approximately sum to zero
 633 | summary( rowSums( krig.grd[ , 3:5 ] ) )
 634 | summary( rowSums( gam.grd[ , 3:5 ] ) )
 635 | 
 636 | # # end of step 7 # #
 637 | # # # # # # # # # # #
 638 | 
 639 | 
 640 | # # # # # # # # # # # # # # # # # # # # # #
 641 | # # step 8: limit information and color # #
 642 | 
 643 | 
 644 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 645 | # # # warning # # # warning # # # # # # warning # # # # # # warning # # # # # # warning # #
 646 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 647 | # if your data is are not binomial, then by mapping with a single image, you lose clarity #
 648 | # if you have three levels of information and you generate two maps, you can get an idea  #
 649 | # about the entire distribution of the variable.  if you attempt encoding three levels or #
 650 | # more into a single map, you will explode. just kidding rofl lmao but you will lose info #
 651 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 652 | # # # warning # # # warning # # # # # # warning # # # # # # warning # # # # # # warning # #
 653 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 654 | 
 655 | 
 656 | library(scales)
 657 | 
 658 | # from among the three categories, find the maximum disproportionate share
 659 | krig.grd$svccat <- c( 'gulf' , 'vietnam' , 'other' )[ apply( krig.grd[ , 3:5 ] , 1 , which.max ) ]
 660 | 
 661 | # save only that max
 662 | krig.grd$statistic <- apply( krig.grd[ , 3:5 ] , 1 , max )
 663 | 
 664 | # it's important to note that i've thrown out a lot of information here
 665 | krig.grd <- krig.grd[ , c( 'intptlon' , 'intptlat' , 'statistic' , 'svccat' ) ]
 666 | 
 667 | # do any points not make sense?
 668 | summary( krig.grd$statistic )
 669 | 
 670 | # yup, the minimum is below zero.
 671 | krig.grd$statistic <- pmax( 0 , krig.grd$statistic )
 672 | 
 673 | 
 674 | # from among the three categories, find the maximum disproportionate share
 675 | gam.grd$svccat <- c( 'gulf' , 'vietnam' , 'other' )[ apply( gam.grd[ , 3:5 ] , 1 , which.max ) ]
 676 | 
 677 | # save only that max
 678 | gam.grd$statistic <- apply( gam.grd[ , 3:5 ] , 1 , max )
 679 | 
 680 | # it's important to note that i've thrown out a lot of information here
 681 | gam.grd <- gam.grd[ , c( 'intptlon' , 'intptlat' , 'statistic' , 'svccat' ) ]
 682 | 
 683 | # again, do any points not make sense?
 684 | summary( gam.grd$statistic )
 685 | 
 686 | # another point below zero.
 687 | gam.grd$statistic <- pmax( 0 , gam.grd$statistic )
 688 | 
 689 | # our complex sample survey-computed statistics rely on categories,
 690 | # but the final map only shows the _highest_ disproportionate item
 691 | # from each puma.  for example,
 692 | 
 693 | # puma 300 is slightly disproportionately more gulf veterans
 694 | # and it's also near evenly-split between being
 695 | # slightly disproportionately less vietnam vets and
 696 | # slightly disproportionately less other era vets
 697 | 
 698 | # puma 101 is disproportionately more gulf vets too,
 699 | # but it has very heavily disproportionately fewer vietnam vets
 700 | # and has close to state-average veterans from other eras.
 701 | sas
 702 | 
 703 | # only the "disproportionately more" share variable gets retained
 704 | # in these predictions.  all other information gets thrown away.
 705 | # this is the nature of mapping categorical variables
 706 | 
 707 | # if you are intent on showing a multi-color gradient with all information,
 708 | # you can use the rgb() function, but fair warning:
 709 | # the values mush together quickly and your map will probably look like ass.
 710 | # i tried building color gradients to map multi-dimensional categorical values
 711 | # like the multi-category values in
 712 | ps
 713 | # but on the color gradient, the red/green/blue values on the palette tend to mush together.
 714 | # for example, on this plot right here..
 715 | plot( 1:5 , rep( 1 , 5 ) , cex = 3 , pch = 16 , col = mapply( rgb , ps[ , 2 ] , ps[ , 3 ] , ps[ , 4 ] ) )
 716 | # red is gulf veterans, green is vietnam veterans, blue is other veterans.
 717 | # the colors end up just looking drab.
 718 | 
 719 | # even when re-scaled..
 720 | rsps <- apply( ps[ , 2:4 ] , 2 , rescale )
 721 | # ..the points with high relative rates in two categories (because they're the lowest in the third)
 722 | # have a lot of mixture (puma 300) and are therefore indecipherable.  what is the color brown here?
 723 | plot( 1:5 , rep( 1 , 5 ) , cex = 3 , pch = 16 , col = mapply( rgb , rsps[ , 1 ] , rsps[ , 2 ] , rsps[ , 3 ] ) )
 724 | # high vietnam era and also high gulf era service.
 725 | text( 1:5 , rep( 1.2 , 5 ) , ps[ , 1 ] )
 726 | # multi-dimensional categorical variable coloring is a nightmare.
 727 | 
 728 | # you have to simplify it.
 729 | # simplifying it means throwing out information.
 730 | 
 731 | # now where were we?
 732 | 
 733 | 
 734 | library(RColorBrewer)
 735 | 
 736 | # draw three gradients
 737 | tg <-
 738 | 	lapply( 
 739 | 		brewer.pal( 3 , 'Set1' ) , 
 740 | 		function( z ) colorRampPalette( c( 'white' , z ) )( 101 )
 741 | 	)
 742 | 
 743 | # check out each of these three colors, mapped from opaque to intense.
 744 | plot( rep( 0:100 , 3 ) , rep( c( -1 , 0 , 1 ) , each = 101 ) , col = unlist( tg ) , pch = 16 , cex = 3 )
 745 | 
 746 | # draw an alternate three gradients
 747 | # that start at ~20% ( that is: 25 / 125 )
 748 | # and also use a different palette from colorbrewer2.org
 749 | tag <-
 750 | 	lapply( 
 751 | 		brewer.pal( 3 , 'Dark2' ) , 
 752 | 		function( z ) colorRampPalette( c( 'white' , z ) )( 125 )[ 25:125 ]
 753 | 	)
 754 | 
 755 | # check out each of these three colors, mapped from opaque to intense.
 756 | plot( rep( 0:100 , 3 ) , rep( c( -1 , 0 , 1 ) , each = 101 ) , col = unlist( tag ) , pch = 16 , cex = 3 )
 757 | 
 758 | 
 759 | # # rescale both of the interpolated grids
 760 | krig.grd$statistic <- krig.grd$statistic * ( 1 / max( krig.grd$statistic ) )
 761 | gam.grd$statistic <- gam.grd$statistic * ( 1 / max( gam.grd$statistic ) )
 762 | # note that the re-scaling gets done across all categories,
 763 | # and not individually within each category.
 764 | 
 765 | # add the hex color identifier
 766 | krig.grd$color.value <- 
 767 | 		ifelse( krig.grd$svccat == 'gulf' , tg[[1]][ round( krig.grd$statistic * 100 ) ] ,
 768 | 		ifelse( krig.grd$svccat == 'vietnam' , tg[[2]][ round( krig.grd$statistic * 100) ] ,
 769 | 		ifelse( krig.grd$svccat == 'other' , tg[[3]][ round( krig.grd$statistic * 100 ) ] , 
 770 | 			NA ) ) )
 771 | 
 772 | # awwwwwwww yeah, something's happening now.
 773 | plot( krig.grd$intptlon , krig.grd$intptlat , col = krig.grd$color.value , pch = 16 , cex = 3 )
 774 | 
 775 | # add the alternate hex color identifier
 776 | krig.grd$alt.color <- 
 777 | 		ifelse( krig.grd$svccat == 'gulf' , tag[[1]][ round( krig.grd$statistic * 100 ) ] ,
 778 | 		ifelse( krig.grd$svccat == 'vietnam' , tag[[2]][ round( krig.grd$statistic * 100) ] ,
 779 | 		ifelse( krig.grd$svccat == 'other' , tag[[3]][ round( krig.grd$statistic * 100 ) ] , 
 780 | 			NA ) ) )
 781 | 
 782 | # that looks a bit better to me
 783 | plot( krig.grd$intptlon , krig.grd$intptlat , col = krig.grd$alt.color , pch = 16 , cex = 3 )
 784 | 
 785 | 
 786 | # lower-bound the alternate color to remove the white lines
 787 | krig.grd$bound.color <- 
 788 | 		ifelse( krig.grd$svccat == 'gulf' , tag[[1]][ pmax( 5 , round( krig.grd$statistic * 100 ) ) ] ,
 789 | 		ifelse( krig.grd$svccat == 'vietnam' , tag[[2]][ pmax( 5 , round( krig.grd$statistic * 100) ) ] ,
 790 | 		ifelse( krig.grd$svccat == 'other' , tag[[3]][ pmax( 5 , round( krig.grd$statistic * 100 ) ) ] , 
 791 | 			NA ) ) )
 792 | 
 793 | # that's smoothing by hand for you.
 794 | plot( krig.grd$intptlon , krig.grd$intptlat , col = krig.grd$bound.color , pch = 16 , cex = 3 )
 795 | 
 796 | 
 797 | # put that color band on the `gam.grd` data.frame as well
 798 | gam.grd$bound.color <- 
 799 | 		ifelse( gam.grd$svccat == 'gulf' , tag[[1]][ pmax( 5 , round( gam.grd$statistic * 100 ) ) ] ,
 800 | 		ifelse( gam.grd$svccat == 'vietnam' , tag[[2]][ pmax( 5 , round( gam.grd$statistic * 100) ) ] ,
 801 | 		ifelse( gam.grd$svccat == 'other' , tag[[3]][ pmax( 5 , round( gam.grd$statistic * 100 ) ) ] , 
 802 | 			NA ) ) )
 803 | 
 804 | 
 805 | # # end of step 8 # #
 806 | # # # # # # # # # # #
 807 | 
 808 | 
 809 | # # # # # # # # # # # # # # # # # # # # #
 810 | # # step 9: ggplot and choose options # #
 811 | 
 812 | library(ggplot2)
 813 | library(mapproj)
 814 | library(scales)
 815 | 
 816 | 
 817 | # initiate the krige-based plot
 818 | krig.grd$color.column <- as.factor( krig.grd$bound.color )
 819 | 
 820 | krg.plot <- 
 821 | 	ggplot( data = krig.grd , aes( x = intptlon , y = intptlat ) ) +
 822 | 	geom_point( shape = 15 , colour = krig.grd$color.column ) +
 823 | 	scale_fill_manual( values = unique( krig.grd$bound.color ) )
 824 | 
 825 | 
 826 | # initiate the gam-based plot
 827 | gam.grd$color.column <- as.factor( gam.grd$bound.color )
 828 | 
 829 | gam.plot <- 
 830 | 	ggplot( data = gam.grd , aes( x = intptlon , y = intptlat ) ) +
 831 | 	geom_point( shape = 15 , colour = gam.grd$color.column ) +
 832 | 	scale_fill_manual( values = unique( gam.grd$bound.color ) )
 833 | 
 834 | # view both grids!
 835 | krg.plot
 836 | gam.plot
 837 | 
 838 | 
 839 | # initiate the entire plot
 840 | the.plot <-
 841 | 
 842 | 	# choose only one of the two interpolation grids
 843 | 	krg.plot +
 844 | 	# gam.plot +
 845 | 	
 846 | 	# blank out the legend and axis labels
 847 | 	theme(
 848 | 		legend.position = "none" ,
 849 | 		axis.title.x = element_blank() ,
 850 | 		axis.title.y = element_blank()		
 851 | 	) + 
 852 | 	
 853 | 	xlab( "" ) + ylab( "" ) +
 854 | 
 855 | 	# force the x and y axis limits at the shape of the city and don't do anything special for off-map values
 856 | 	scale_x_continuous( limits = c( -191 , -127 ) , breaks = NULL , oob = squish ) +
 857 | 	# since we're going to add lots of surrounding-area detail!
 858 |     scale_y_continuous( limits = c( 50 , 73 ) , breaks = NULL , oob = squish ) +
 859 | 
 860 | 	theme(
 861 | 		panel.grid.major = element_blank(),
 862 | 		panel.grid.minor = element_blank(),
 863 | 		panel.background = element_blank(),
 864 | 		panel.border = element_blank(),
 865 | 		axis.ticks = element_blank()
 866 | 	)
 867 | 
 868 | # print the plot to the screen
 869 | the.plot
 870 | # this is the bottom layer.
 871 | 
 872 | 
 873 | # initiate an aleutian islands-focused wrap-around function
 874 | s360 <- function( z ){ z[ z$long > 0 , 'long' ] <- z[ z$long > 0 , 'long' ] - 360 ; z }
 875 | 
 876 | 
 877 | # # alaskan state borders # #
 878 | 
 879 | # convert the alaskan borders to longlat,
 880 | # prepare for ggplot2 with `fortify`
 881 | # wrap edge points around
 882 | ab <- s360( fortify( spTransform( alaska.borders , CRS( "+proj=longlat" ) ) ) )
 883 | 
 884 | # store this information in a layer
 885 | state.border.layer <- geom_path( data = ab , aes( x = long , y = lat , group = group ) , colour = 'darkgrey' )
 886 | 
 887 | # plot the result
 888 | the.plot + state.border.layer
 889 | 
 890 | 
 891 | # # alaskan main roads # #
 892 | 
 893 | # convert the alaskan borders to longlat,
 894 | # prepare for ggplot2 with `fortify`
 895 | # wrap edge points around
 896 | akr <- s360( fortify( spTransform( asf[[3]] , CRS( "+proj=longlat" ) ) ) )
 897 | 
 898 | # store this information in a layer
 899 | state.roads.layer <- geom_path( data = akr , aes( x = long , y = lat , group=group ) , colour = 'darkred' )
 900 | 
 901 | # plot the result
 902 | the.plot + state.border.layer + state.roads.layer
 903 | 
 904 | # # end of step 9 # #
 905 | # # # # # # # # # # #
 906 | 
 907 | 
 908 | # # # # # # # # # # # # # # # # # # # # #
 909 | # # step 10: project, blank, and save # #
 910 | 
 911 | library(ggplot2)
 912 | library(scales)
 913 | library(raster)
 914 | library(plyr)
 915 | library(rgeos)
 916 | 
 917 | 
 918 | # exclude outer alaska if you hate the wilderness or something
 919 | the.plot + state.border.layer + coord_cartesian( xlim = c( -155 , max( x$intptlon ) ) , ylim = c( min( x$intptlat ) , 70 ) )
 920 | 
 921 | # distort the map with simple latitude/longitude scaling
 922 | the.plot + state.border.layer + coord_fixed( 2.5 )
 923 | 
 924 | # this looks crappy, who knows what it is
 925 | the.plot + state.border.layer + coord_equal()
 926 | 
 927 | # check out a bunch of other options #
 928 | the.plot + state.border.layer + coord_map( project = "cylequalarea" , mean( x$intptlat ) )
 929 | 
 930 | # here's the one that makes the most sense for alaska
 931 | the.plot + state.border.layer + coord_map( project = "conic" , mean( x$intptlat ) , orientation = c( 90 , 0 , -141 ) )
 932 | 
 933 | # see ?mapproject and the ?coord_* functions for a zillion alternatives
 934 | 
 935 | # store this projection, but not the state border
 936 | the.plot <- the.plot + coord_map( project = "conic" , mean( x$intptlat ) , orientation = c( 90 , 0 , -141 ) )
 937 | # into `the.plot`
 938 | 
 939 | 
 940 | # force the difference shapefile's projection
 941 | proj4string( ak.shp.diff ) <- "+init=epsg:2163"
 942 | 
 943 | # initiate the outside blanking layer
 944 | outside <- s360( fortify( spTransform( ak.shp.diff , CRS( "+proj=longlat" ) ) ) )
 945 | 
 946 | # fix islands piecing together
 947 | outside2 <- ddply( outside , .( piece ) , function( x ) rbind( x , outside[ 1 , ] ) )
 948 | 
 949 | # convert this fortified object to a ggplot layer
 950 | outside.layer <- geom_polygon( data = outside2 , aes( x = long , y = lat , group = id ) , fill = 'white' )
 951 | 
 952 | # plot this -- the layer doesn't work, does it?
 953 | the.plot + outside.layer
 954 | 
 955 | # five points need to change so we have a real bounding box.
 956 | subset( outside , lat < 45 | lat > 75 | long < -190 | long > -125 )
 957 | 
 958 | # move all of them counter-clockwise by hand
 959 | outside[ outside$order %in% c( 1 , 5 ) , 'long' ] <- -116.6568
 960 | # outside[ outside$order %in% c( 1 , 5 ) , 'lat' ] <- 20
 961 | 
 962 | # outside[ outside$order %in% 4 , 'long' ] <- -220
 963 | outside[ outside$order %in% 4 , 'lat' ] <- 37.56767
 964 | 
 965 | outside[ outside$order %in% 3 , 'long' ] <- -195.4295
 966 | # outside[ outside$order %in% 3 , 'lat' ] <- 100
 967 | 
 968 | # outside[ outside$order %in% 2 , 'long' ] <- -100
 969 | outside[ outside$order %in% 2 , 'lat' ] <- 79.36447
 970 | 
 971 | 
 972 | # fix islands piecing together
 973 | outside2 <- ddply( outside , .( piece ) , function( x ) rbind( x , outside[ 1 , ] ) )
 974 | 
 975 | # convert this fortified object to a ggplot layer
 976 | outside.layer <- geom_polygon( data = outside2 , aes( x = long , y = lat , group = id ) , fill = 'white' )
 977 | 
 978 | # plot this. 
 979 | the.plot + outside.layer
 980 | # that's not so bad, i guess.
 981 | 
 982 | # i don't care for the state border layer,
 983 | # but if you want the state border layer,
 984 | # use this save line:
 985 | final.plot <- the.plot + outside.layer + state.border.layer
 986 | # otherwise use this save line:
 987 | # final.plot <- the.plot + outside.layer
 988 | # you can airbrush the outside blue border
 989 | # in microsoft paint or something
 990 | # if you want, right? like a boss.
 991 | 
 992 | 
 993 | # use cairo-png as your bitmap type
 994 | options( bitmapType = "cairo" )
 995 | 
 996 | # uncomment this block to save the file to your current working directory
 997 | 
 998 | # ggsave( 
 999 | 	# "2013 alaskan veteran service eras.png" ,
1000 | 	# plot = final.plot ,
1001 | 	# scale = 3
1002 | # )
1003 | 
1004 | # happy?
1005 | 
1006 | # # end of step ten # #
1007 | # # # # # # # # # # # #
1008 | 


--------------------------------------------------------------------------------
/how to map the consumer expenditure survey.R:
--------------------------------------------------------------------------------
   1 | # # # # # # # # # # # # # # # # #
   2 | # # set the working directory # #
   3 | # # # # # # # # # # # # # # # # #
   4 | 
   5 | # setwd( "C:/My Directory/SWMAP/" )
   6 | 
   7 | 
   8 | # # # # # # # # # # # # # # # #
   9 | # # example survey data set # #
  10 | # # # # # # # # # # # # # # # #
  11 | 
  12 | # consumer expenditure survey
  13 | 
  14 | 
  15 | # # # # # # # # # # # # # # # # # # # # #
  16 | # # different from other maps because # #
  17 | # # # # # # # # # # # # # # # # # # # # #
  18 | 
  19 | # this is a good starting point, probably
  20 | # the simplest map you can make with survey data
  21 | 
  22 | 
  23 | # # # # # # # # # # # # # # # # # #
  24 | # # smallest level of geography # #
  25 | # # # # # # # # # # # # # # # # # #
  26 | 
  27 | # state, some metropolitan statistical areas
  28 | 
  29 | 
  30 | # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  31 | # # asdfree.com blog post for this survey microdata # #
  32 | # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  33 | 
  34 | # http://www.asdfree.com/search/label/consumer%20expenditure%20survey%20%28ce%29
  35 | 
  36 | 
  37 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  38 | # # r code repository for setup and analysis examples # #
  39 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  40 | 
  41 | # https://github.com/ajdamico/asdfree/tree/master/Consumer%20Expenditure%20Survey
  42 | 
  43 | 
  44 | # # # # # # # # # # # # #
  45 | # # value of interest # #
  46 | # # # # # # # # # # # # #
  47 | 
  48 | # transportation expenditure as a share of total expenditure
  49 | 
  50 | 
  51 | # # # # # # #
  52 | # # flaws # #
  53 | # # # # # # #
  54 | 
  55 | # major downloads required.  let this run overnight.
  56 | 
  57 | 
  58 | # # # # # # # # # # # # # # # # # # # # #
  59 | # # step 1: load the survey microdata # #
  60 | 
  61 | library(downloader)
  62 | 
  63 | # download the consumer expenditure survey microdata onto the local disk
  64 | source_url( "https://raw.githubusercontent.com/ajdamico/asdfree/master/Consumer%20Expenditure%20Survey/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE )
  65 | 
  66 | # # end of step 1 # #
  67 | # # # # # # # # # # #
  68 | 
  69 | 
  70 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  71 | # # step 2: download and import necessary geographic crosswalks # #
  72 | 
  73 | 
  74 | # options( encoding = "windows-1252" )		# # only macintosh and *nix users need this line
  75 | 
  76 | 
  77 | library(downloader)
  78 | library(sqldf)
  79 | 
  80 | # load the download_cached and related functions
  81 | # to prevent re-downloading of files once they've been downloaded.
  82 | source_url(
  83 | 	"https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" ,
  84 | 	prompt = FALSE ,
  85 | 	echo = FALSE
  86 | )
  87 | 
  88 | 
  89 | # create a temporary file containing the census bureau's
  90 | # 2010 census summary file #1, then download the file.
  91 | sf1.tf <- tempfile()
  92 | 
  93 | # initiate an empty data.frame object
  94 | # to store all summary file blocks
  95 | sf <- NULL
  96 | 
  97 | # download the census bureau's 2010 summary file one #
  98 | 
  99 | # this takes a long time.
 100 | # depending on your internet speed,
 101 | # you might need to let this loop run overnight.
 102 | 
 103 | # but this code will cache the downloads onto your local disk
 104 | # when you download these files once,
 105 | # they'll never need to be downloaded again.
 106 | 
 107 | # loop through every state and dc
 108 | for ( state.number in 1:51 ){
 109 | 
 110 | 	# after downloading all fifty states, get dc as well
 111 | 	if( state.number == 51 ){
 112 | 		sn <- "District_of_Columbia"
 113 | 		sa <- "dc"
 114 | 	} else {
 115 | 		sn <- gsub( " " , "_" , state.name[ state.number ] )
 116 | 		sa <- tolower( state.abb[ state.number ] )
 117 | 	}
 118 | 	
 119 | 	# create a single-element character string containing the ftp path
 120 | 	ftp.loc <- 
 121 | 		paste0( 
 122 | 			"ftp://ftp2.census.gov/census_2010/04-Summary_File_1/" ,
 123 | 			sn ,
 124 | 			"/" ,
 125 | 			sa ,
 126 | 			"2010.sf1.zip"
 127 | 		)
 128 | 
 129 | 	# download the current state's summary file
 130 | 	download_cached( ftp.loc , sf1.tf , mode = 'wb' )
 131 | 	# note: to re-download a file from scratch, add the parameter usecache = FALSE
 132 | 
 133 | 	# unzip the summary file #1 files
 134 | 	sf1.uz <- unzip( sf1.tf , exdir = tempdir() )
 135 | 
 136 | 	# file layout from http://www.census.gov/prod/cen2010/doc/sf1.pdf#page=18
 137 | 	sf1 <- read.fwf( sf1.uz[ grep( "geo2010" , sf1.uz ) ] , c( -8 , 3 , -14 , 1 , -1 , 2 , 3 , -22 , 6 , 1 , 4 , -47 , 5 , 2 , -5 , 3 , -191 , 9 , -9 , 11 , 12 ) )
 138 | 
 139 | 	# add columns names matching the census bureau, so it's easy to read
 140 | 	names( sf1 ) <- c( "sumlev" , "region" , "state" , "county" , "tract" , "blkgrp" , "block" , "cbsa" , "cbsasc" , "csa" , "pop100" , "intptlat" , "intptlon" )
 141 | 
 142 | 	# summary level 101 has metro areas, urban/rural status, and census blocks
 143 | 	sf101 <- subset( sf1 , sumlev == "101" )
 144 | 
 145 | 	# within each census tract x cbsa/csa combo,
 146 | 	# calculate the population-weighted mean of the coordinates
 147 | 	sfs <- 
 148 | 		sqldf( 
 149 | 			"select 
 150 | 				region , state , county , tract , cbsa , cbsasc , csa , 
 151 | 				count(*) as census_blocks ,
 152 | 				sum( pop100 ) as pop100 , 
 153 | 				sum( pop100 * intptlon ) / sum( pop100 ) as intptlon ,
 154 | 				sum( pop100 * intptlat ) / sum( pop100 ) as intptlat
 155 | 			from sf101
 156 | 			group by
 157 | 				region , state , county , tract , cbsa , cbsasc , csa" )
 158 | 	# note: this screws up coordinates that cross the international date line
 159 | 	# or the equator.  in the united states, only alaska's aleutian islands do this
 160 | 	# and those geographies will be thrown out later.  so it doesn't matter.
 161 | 	
 162 | 	# the above consolidation step isn't necessary if you have a huge computer
 163 | 	# and a lot of time.. but it makes all of the kriging and rendering computations
 164 | 	# work much faster, and mapping at the census tract- versus census block-level
 165 | 	# really doesn't make much of a damn difference.
 166 | 	
 167 | 	# stack these blocks in with all the other states
 168 | 	sf <- rbind( sf , sfs )
 169 | 	
 170 | 	# remove the single-state data.frame objects and clear up RAM
 171 | 	rm( sf101 , sf1 , sfs ) ; gc()
 172 | 	
 173 | 	# remove the unzipped files from your local disk
 174 | 	file.remove( sf1.uz , sf1.tf )
 175 | 
 176 | }
 177 | 
 178 | # one record per census block in every state.  see?  same number.
 179 | tapply( sf$census_blocks , sf$state , sum )
 180 | # https://www.census.gov/geo/maps-data/data/tallies/census_block_tally.html
 181 | 
 182 | # and guess what?  the population by state matches as well.
 183 | tapply( sf$pop100 , sf$state , sum )
 184 | # http://en.wikipedia.org/wiki/2010_United_States_Census#State_rankings
 185 | 
 186 | # remove columns you actually don't need
 187 | sf <- sf[ , !( names( sf ) %in% c( 'sumlev' , 'block' , 'county' , 'region' , 'tract' , 'blkgrp' , 'cbsasc' ) ) ]
 188 | 
 189 | # remove records with zero population
 190 | sf <- subset( sf , pop100 > 0 )
 191 | 
 192 | # clear up RAM
 193 | gc()
 194 | 
 195 | 
 196 | # so now we have a data.frame object with
 197 | # one record per census tract,
 198 | # and also with each of the geography-levels
 199 | # that match the consumer expenditure survey
 200 | head( sf )
 201 | 
 202 | # and guess what?
 203 | # we've now got the census 2010 weighted populations (field pop100)
 204 | # and also each census tract's centroid latitude & longitude (fields intptlat + intptlon)
 205 | 
 206 | # add the consumer expenditure survey results'
 207 | # geographic identifiers to the census tract data.frame
 208 | 
 209 | # # align psu variables # #
 210 | 
 211 | # note: the primary sampling units available in the ces microdata
 212 | # http://www.bls.gov/cex/2013/csxintvwdata.pdf#page=9
 213 | # do not perfectly map to combined statistical areas or
 214 | # core-based statistical areas, so
 215 | # (1) match the geographies that might be matchable
 216 | # (2) combine los angeles in both data.frame objects
 217 | # (3) make all other records 9999
 218 | sf <-
 219 | 	transform(
 220 | 		sf ,
 221 | 		
 222 | 		psu =
 223 | 			# new york, ny
 224 | 			ifelse( state %in% 36 & csa %in% 408 , 1109 ,
 225 | 			# new york, ct
 226 | 			ifelse( state %in% 9 & csa %in% 408 , 1110 ,
 227 | 			# new york, nj
 228 | 			ifelse( state %in% 34 & csa %in% 408 , 1111 ,
 229 | 			# philadelphia
 230 | 			ifelse( csa %in% 428 , 1102 ,
 231 | 			# boston
 232 | 			ifelse( csa %in% 148 , 1103 ,
 233 | 			# chicago
 234 | 			ifelse( csa %in% 176 , 1207 ,
 235 | 			# detroit
 236 | 			ifelse( csa %in% 220 , 1208 ,
 237 | 			# cleveland
 238 | 			ifelse( csa %in% 184 , 1210 ,
 239 | 			# minneapolis
 240 | 			ifelse( csa %in% 378 , 1211 ,
 241 | 			# washington dc
 242 | 			ifelse( cbsa %in% 47900 , 1312 ,
 243 | 			# baltimore
 244 | 			ifelse( cbsa %in% 12580 , 1313 ,
 245 | 			# dallas
 246 | 			ifelse( csa %in% 206 , 1316 ,
 247 | 			# houston
 248 | 			ifelse( csa %in% 288 , 1318 ,
 249 | 			# atlanta
 250 | 			ifelse( csa %in% 122 , 1319 ,
 251 | 			# miami
 252 | 			ifelse( csa %in% 370 , 1320 ,
 253 | 			# los angeles - orange
 254 | 			ifelse( csa %in% 348 , 1000 ,
 255 | 			# los angeles suburbs
 256 | 			ifelse( csa %in% 348 , 1000 ,
 257 | 			# san francisco
 258 | 			ifelse( csa %in% 488 , 1422 ,
 259 | 			# seattle
 260 | 			ifelse( csa %in% 500 , 1423 ,
 261 | 			# san diego
 262 | 			ifelse( cbsa %in% 41740 , 1424 ,
 263 | 			# phoenix
 264 | 			ifelse( cbsa %in% 38060 , 1429 , 
 265 | 		
 266 | 				9999 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
 267 | 		
 268 | 	)
 269 | 
 270 | # # align metro status variables # #
 271 | sf <- transform( sf , smsastat = ifelse( cbsa %in% 99999 , 2 , 1 ) )
 272 | 
 273 | # which geographies are available amongst all census blocks
 274 | sf.available.geographies <- unique( sf[ , c( 'state' , 'psu' , 'smsastat' ) ] )
 275 | # note: this really should be the universe, but it's not because
 276 | # the consumer expenditure survey's psus don't appear to perfectly map to census areas
 277 | 
 278 | # # end of step 2 # #
 279 | # # # # # # # # # # #
 280 | 
 281 | 
 282 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 283 | # # step 3: prepare your analysis of interest at the smallest geography allowed # #
 284 | 
 285 | library(survey)
 286 | library(plyr)
 287 | library(stringr)
 288 | 
 289 | # following the analysis examples in the r code repository --
 290 | # # https://github.com/ajdamico/asdfree/blob/master/Consumer%20Expenditure%20Survey/2011%20fmly%20intrvw%20-%20analysis%20examples.R
 291 | # -- calculate the transportation share of total expenditure at the smallest available geographic area
 292 | 
 293 | # load all five quarters of 2013 microdata
 294 | load( "./2013/intrvw/fmli131x.rda" )
 295 | load( "./2013/intrvw/fmli132.rda" )
 296 | load( "./2013/intrvw/fmli133.rda" )
 297 | load( "./2013/intrvw/fmli134.rda" )
 298 | load( "./2013/intrvw/fmli141.rda" )
 299 | 
 300 | # load all five quarters of 2012 microdata
 301 | load( "./2012/intrvw/fmli121x.rda" )
 302 | load( "./2012/intrvw/fmli122.rda" )
 303 | load( "./2012/intrvw/fmli123.rda" )
 304 | load( "./2012/intrvw/fmli124.rda" )
 305 | load( "./2012/intrvw/fmli131.rda" )
 306 | 
 307 | # stack all ten quarters
 308 | fmly <- rbind.fill( fmli121x , fmli122 , fmli123 , fmli124 , fmli131 , fmli131x , fmli132 , fmli133 , fmli134 , fmli141 )
 309 | 
 310 | # before anything else, the los angeles suburban split isn't possible on the census side
 311 | fmly <-
 312 | 	transform(
 313 | 		fmly ,
 314 | 		psu =
 315 | 			# align with los angeles area in the `sf` object
 316 | 			ifelse( psu %in% 1419:1420 , 1000 ,
 317 | 			ifelse( psu == '' , 9999 , 
 318 | 				as.numeric( psu ) ) )
 319 | 	)
 320 | 
 321 | # coerce states to match `sf` as well
 322 | fmly$state <- as.numeric( fmly$state )
 323 | 
 324 | # set blanks to 99s
 325 | fmly[ is.na( fmly$state ) , 'state' ] <- 99
 326 | 
 327 | # extract available geographies
 328 | fmly.available.geographies <- unique( fmly[ , c( 'state' , 'psu' , 'smsastat' ) ] )
 329 | 
 330 | # merge this with the sf file's available geographies
 331 | ag <- merge( sf.available.geographies , fmly.available.geographies )
 332 | 
 333 | # create a flag with geographies to keep
 334 | ag$keep <- 1
 335 | 
 336 | # merge the available geographies back on
 337 | fmly <- merge( fmly , ag , all = TRUE )
 338 | 
 339 | # anyone with a missing flag needs their geography blanked out
 340 | fmly[ is.na( fmly$keep ) , 'state' ] <- 99
 341 | fmly[ is.na( fmly$keep ) , 'psu' ] <- 9999
 342 | 
 343 | # remove the flag
 344 | fmly$keep <- NULL
 345 | 
 346 | # create a character vector containing 45 variable names (wtrep01, wtrep02, ... wtrep44 and finlwt21)
 347 | wtrep <- c( paste0( "wtrep" , str_pad( 1:44 , 2 , pad = "0" ) ) , "finlwt21" )
 348 | 
 349 | # immediately loop through each weight column (stored in the wtrep vector)
 350 | for ( i in wtrep ){
 351 | 	# overwrite all missing values (NA) with zeroes
 352 | 	fmly[ is.na( fmly[ , i ] ) , i ] <- 0
 353 | 
 354 | 	# since we've pooled two years, divide all weights by two
 355 | 	fmly[ , i ] <- fmly[ , i ] / 2
 356 | }
 357 | 
 358 | # create a new variable in the fmly data table called 'totalexp'
 359 | # that contains the sum of the total expenditure from the current and previous quarters
 360 | fmly$totalexp <- rowSums( fmly[ , c( "totexppq" , "totexpcq" ) ] , na.rm = TRUE )
 361 | 
 362 | # immediately convert missing values (NA) to zeroes
 363 | fmly[ is.na( fmly$totalexp ) , "totalexp" ] <- 0
 364 | 
 365 | # same for transportation
 366 | fmly$transexp <- rowSums( fmly[ , c( "transpq" , "transcq" ) ] , na.rm = TRUE )
 367 | fmly[ is.na( fmly$transexp ) , "transexp" ] <- 0
 368 | 
 369 | 
 370 | # turn on replicate-weighted mean squared errors
 371 | options( survey.replicates.mse = TRUE )
 372 | # this matches the official census bureau published methods
 373 | 
 374 | # construct a replicate-weighted survey design object
 375 | fmly.design <-
 376 | 	svrepdesign(
 377 | 		repweights = "wtrep[0-9]+" ,
 378 | 		weights = ~finlwt21 ,
 379 | 		data = fmly
 380 | 	)
 381 | 
 382 | 	
 383 | # the family tables are no longer necessary
 384 | rm( fmly , fmli121x , fmli122 , fmli123 , fmli124 , fmli131 , fmli131x , fmli132 , fmli133 , fmli134 , fmli141 ) ; gc()
 385 | # remove them and clear up RAM
 386 | 
 387 | 
 388 | # calculate the 2012-2013 nationwide ratio of transportation spending as a share of total spending
 389 | svyratio( ~ transexp , ~ totalexp , fmly.design )
 390 | 
 391 | # note: this is almost the same number as the bls-published 2011 share:
 392 | # http://www.bls.gov/cex/2011/share/cusize.pdf
 393 | 
 394 | # the smallest geography reasonably extracted
 395 | # from this survey microdata set will be
 396 | # state + psu + metro status all combined
 397 | full.table <- data.frame( svytable( ~ state + psu + smsastat , fmly.design ) )
 398 | # this crosstabulation includes too many zeroes,
 399 | # so store the result in a data.frame object
 400 | # and only print the non-zero records
 401 | subset( full.table , Freq > 0 )
 402 | 
 403 | # simply use both of those geographies in the by= argument
 404 | # of the `svyby` command, and re-calculate the
 405 | # transportation expenditure shares
 406 | smallest.area.statistics <-
 407 | 	svyby( 
 408 | 		~ transexp , 
 409 | 		denominator = ~ totalexp ,
 410 | 		by = ~ state + psu + smsastat ,
 411 | 		fmly.design , 
 412 | 		svyratio
 413 | 	)
 414 | # this is the same command as the nationwide calculation above,
 415 | # except these results have been broken into smaller areas.	
 416 | 
 417 | # these are the statistics to be mapped
 418 | print( smallest.area.statistics )
 419 | # the standard errors are a measure of precision,
 420 | # their inverse will serve as the mapping weights
 421 | 
 422 | # make this object easier to type..
 423 | sas <- smallest.area.statistics
 424 | 
 425 | # ..and also easier to read
 426 | names( sas )[ names( sas ) == 'transexp/totalexp' ] <- 'share'
 427 | names( sas )[ names( sas ) == 'se.transexp/totalexp' ] <- 'se'
 428 | 
 429 | # remove objects you no longer need..
 430 | rm( fmly.design ) ; gc()
 431 | # ..and clear up RAM
 432 | 
 433 | # # end of step 3 # #
 434 | # # # # # # # # # # #
 435 | 
 436 | 
 437 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 438 | # # step 4: merge the results of your survey analysis with the small-area geography # #
 439 | 
 440 | # save a continental states copy of the `sf` table for later
 441 | sfsave <- sf[ !( sf$state %in% c( 2 , 15 ) ), c( 'state' , 'csa' , 'cbsa' , 'pop100' , 'intptlat' , 'intptlon' ) ]
 442 | # trust me, you'll need this.
 443 | 
 444 | # merge the available geographies on to the census tract file too
 445 | sf <- merge( sf , ag , all = TRUE )
 446 | 
 447 | # note that alaska and hawaii need to be manually removed,
 448 | # but some of the geographies in these areas will be collapsed
 449 | # into the 99/9999 categories.  so pre-collapse, identify them.
 450 | sf$akhi <- sf$state %in% c( 2 , 15 )
 451 | 
 452 | # anyone with a missing flag needs their geography blanked out
 453 | sf[ is.na( sf$keep ) , 'state' ] <- 99
 454 | sf[ is.na( sf$keep ) , 'psu' ] <- 9999
 455 | 
 456 | # remove the flag
 457 | sf$keep <- NULL
 458 | 
 459 | # integers are overflowing
 460 | sf$pop100 <- as.numeric( sf$pop100 )
 461 | 
 462 | # continue being as sparse as possible.  remove columns you no longer need.
 463 | sf <- sf[ , ( names( sf ) %in% c( 'state' , 'psu' , 'smsastat' , 'pop100' , 'intptlat' , 'intptlon' , 'akhi' ) ) ]
 464 | 
 465 | # clear up RAM
 466 | gc()
 467 | 
 468 | # confirm that we've created all possible geographies correctly.
 469 | 
 470 | # the number of records in our small area statistics..
 471 | sas.row <- nrow( sas )
 472 | 
 473 | # ..should equal the number of unique-match-merged records..
 474 | mrow <- nrow( merge( unique( sf[ , c( 'state' , 'psu' , 'smsastat' ) ] ) , sas ) )
 475 | 
 476 | # ..and it does/they do.
 477 | stopifnot( sas.row == mrow )
 478 | 
 479 | # now the census tract-level nationwide census data *could* merge if you wanted it to.
 480 | 
 481 | 
 482 | # but you don't.  yet.
 483 | 
 484 | 
 485 | # the standard error (the `se` field) is a measure of precision.
 486 | print( sas )
 487 | # the smaller the standard error, the more confident you should be
 488 | # that the estimate at a particular geography is correct.
 489 | 
 490 | 
 491 | # so invert it.  you heard me.  invert it.
 492 | sas$invse <- 1 / sas$se
 493 | # a smaller standard error indicates more precision.
 494 | 
 495 | # for our purposes, precision can be considered weight! #
 496 | 
 497 | # now we've got the weight that we should give each of our estimates #
 498 | 
 499 | # distribute that weight across all census tracts #
 500 | 
 501 | 
 502 | # aggregate the 2010 census block populations to the geographies that you have.
 503 | popsum <- aggregate( sf$pop100 , by = ( sf[ , c( 'state' , 'psu' , 'smsastat' ) ] ) , sum )
 504 | 
 505 | # make the column name meaningful
 506 | names( popsum )[ names( popsum ) == 'x' ] <- 'popsum'
 507 | 
 508 | # merge the popsum onto the sasfile
 509 | sas <- merge( sas , popsum )
 510 | 
 511 | # now.  merge
 512 | 	# the transportation share of total expenditure (the variable of interest)
 513 | 	# the inverted standard error (the total weight of the broad geography)
 514 | 	# the population sum (the total population of all census tracts that are part of that geography)
 515 | 
 516 | x <- merge( sf , sas )
 517 | 
 518 | # confirm no record loss
 519 | stopifnot( nrow( x ) == nrow( sf ) )
 520 | 
 521 | # clear up RAM
 522 | rm( sf ) ; gc()
 523 | 
 524 | # (this is the fun part)
 525 | # calculate the weight of each census tract
 526 | x$weight <- x$invse * ( x$pop100 / x$popsum )
 527 | 
 528 | # note that weight of all census tracts put together
 529 | # sums to the `invse` on the original analysis file
 530 | stopifnot( all.equal( sum( x$weight ) , sum( sas$invse ) ) )
 531 | 
 532 | # scale all weights so that they average to one
 533 | x$weight <- x$weight / mean( x$weight )
 534 | 
 535 | 
 536 | # now that all weights have been computed,
 537 | # remove alaska and hawaii
 538 | x <- subset( x , !( akhi ) )
 539 | # note that those states need to be included up until this point
 540 | # otherwise their populations won't scoop up their
 541 | # respective shares of any multi-state statistics
 542 | 
 543 | 
 544 | # you're done preparing your data.
 545 | # keep only the columns you need.
 546 | x <- x[ , c( 'share' , 'weight' , 'intptlat' , 'intptlon' ) ]
 547 | 
 548 | # # end of step 4 # #
 549 | # # # # # # # # # # #
 550 | 
 551 | 
 552 | # # # # # # # # # # # # # # # # # # # # # # #
 553 | # # step 5: decide on your map parameters # #
 554 | 
 555 | library(ggplot2)
 556 | library(scales)
 557 | library(mapproj)
 558 | library(RColorBrewer)
 559 | 
 560 | # before you ever touch surface smoothing or kriging,
 561 | # make some decisions about how you generally want
 562 | # your map to look:  the projection and coloring
 563 | 
 564 | # the options below simply use hadley wickham's ggplot2
 565 | # with the census tract-level transportation spending shares and centroids
 566 | 
 567 | 
 568 | # initiate the simple map
 569 | us.map <- 
 570 | 	qplot( 
 571 | 		intptlon , 
 572 | 		intptlat , 
 573 | 		data = x , 
 574 | 		colour = share ,
 575 | 		xlab = NULL ,
 576 | 		ylab = NULL
 577 | 	)
 578 | 
 579 | 
 580 | # remove all map crap.
 581 | us.map <- 
 582 | 	us.map + 
 583 | 
 584 | 	scale_x_continuous( breaks = NULL ) +
 585 | 
 586 |     scale_y_continuous( breaks = NULL ) +
 587 | 
 588 |     theme(
 589 | 		legend.position = "none" ,
 590 | 		panel.grid.major = element_blank(),
 591 | 		panel.grid.minor = element_blank(),
 592 | 		panel.background = element_blank(),
 593 | 		panel.border = element_blank(),
 594 | 		axis.ticks = element_blank()
 595 | 	)
 596 | 
 597 | 
 598 | # print the map without any projection
 599 | us.map
 600 | 
 601 | # print the map with an albers projection.
 602 | us.map + coord_map( project = "albers" , lat0 = min( x$intptlat ) , lat1 = max( x$intptlat ) )
 603 | # see ?mapproject for a zillion alternatives
 604 | 
 605 | # if you like that projection, store it in the map object.
 606 | us.map <- 
 607 | 	us.map + 
 608 | 	coord_map( project = "albers" , lat0 = min( x$intptlat ) , lat1 = max( x$intptlat ) )
 609 | 
 610 | 
 611 | # check out some purty colors.
 612 | 
 613 | # from http://colorbrewer2.org/
 614 | 
 615 | # three sequential color schemes
 616 | Greys.3.p <- colorRampPalette( rev( brewer.pal( 3 , "Greys" ) ) )
 617 | YlGnBu.3.p <- colorRampPalette( rev( brewer.pal( 3 , "YlGnBu" ) ) )
 618 | YlGnBu.9.p <- colorRampPalette( rev( brewer.pal( 9 , "YlGnBu" ) ) )
 619 | 
 620 | # three diverging color schemes
 621 | PRGn.11.p <- colorRampPalette( rev( brewer.pal( 11 , "PRGn" ) ) )
 622 | RdYlBu.3.p <- colorRampPalette( rev( brewer.pal( 3 , "RdYlBu" ) ) )
 623 | RdYlBu.11.p <- colorRampPalette( rev( brewer.pal( 11 , "RdYlBu" ) ) )
 624 | 
 625 | # print all six
 626 | us.map + scale_colour_gradientn( colours = Greys.3.p( 100 ) )
 627 | us.map + scale_colour_gradientn( colours = YlGnBu.3.p( 100 ) )
 628 | us.map + scale_colour_gradientn( colours = YlGnBu.9.p( 100 ) )
 629 | 
 630 | us.map + scale_colour_gradientn( colours = PRGn.11.p( 100 ) )
 631 | us.map + scale_colour_gradientn( colours = RdYlBu.3.p( 100 ) )
 632 | us.map + scale_colour_gradientn( colours = RdYlBu.11.p( 100 ) )
 633 | 
 634 | # clear up RAM
 635 | rm( us.map ) ; gc()
 636 | 
 637 | # # end of step 5 # #
 638 | # # # # # # # # # # #
 639 | 
 640 | 
 641 | # # # # # # # # # # # #
 642 | # # step 6: outline # #
 643 | 
 644 | library(maptools)
 645 | library(raster)
 646 | library(rgeos)
 647 | library(stringr)
 648 | library(plyr)
 649 | library(ggplot2)
 650 | 
 651 | shpus.tf <- tempfile()
 652 | 
 653 | # use the census bureau's cartographic boundary files
 654 | # instead of the regular tiger shapefiles
 655 | # unless you want to display transportation shares in the ocean.
 656 | 
 657 | download_cached( 
 658 | 	"http://www2.census.gov/geo/tiger/GENZ2013/cb_2013_us_state_500k.zip" ,
 659 | 	shpus.tf ,
 660 | 	mode = 'wb'
 661 | )
 662 | 
 663 | shpus.uz <- unzip( shpus.tf , exdir = tempdir() )
 664 | 
 665 | us.shp <- readShapePoly( shpus.uz[ grep( 'shp$' , shpus.uz ) ] )
 666 | 
 667 | # remove alaska, hawaii, and all territories
 668 | us.shp <- subset( us.shp , !( STUSPS %in% c( 'AK' , 'HI' , 'PR' , 'AS' , 'VI' , 'GU' , 'MP' ) ) )
 669 | 
 670 | # draw a rectangle 5% bigger than the original 48-state shape
 671 | us.shp.out <- as( 1.1 * extent( us.shp ), "SpatialPolygons" )
 672 | 
 673 | # # end of step 6 # #
 674 | # # # # # # # # # # #
 675 | 
 676 | 
 677 | # # # # # # # # # # # # # # # # # #
 678 | # # step 7: tie knots and krige # #
 679 | 
 680 | library(sqldf)
 681 | 
 682 | # how many knots should you make? #
 683 | 
 684 | # knots are the computationally-intensive part of this process,
 685 | # choose as many as your computer and your patience can handle.
 686 | 
 687 | # you should aim for between 100 - 999 knots,
 688 | # but numbers closer to 1,000 will overload smaller computers
 689 | 
 690 | # you could let the `fields` package attempt to guess knots for you,
 691 | # xknots <- cover.design( cbind( x$intptlon , x$intptlat ) , 100 )$design
 692 | # but with census microdata, you've already got easy access to a relevant geographic grouping
 693 | 
 694 | # the continental united states contains
 695 | length( unique( sfsave$cbsa ) )
 696 | # unique core-based statistical areas and
 697 | length( unique( sfsave$csa ) )
 698 | # unique combined statistical areas, including `99999` values.
 699 | 
 700 | # but you should probably distinguish the `99999` values across states.
 701 | # if you have a powerful computer, you could try creating knots table
 702 | # that crosses states by cbsas, but for smaller computers (and quicker processing)
 703 | nrow( unique( sfsave[ , c( 'state' , 'csa' ) ] ) )
 704 | # here are 207 beeeeautiful knots just for you.
 705 | 
 706 | 
 707 | # within each state x csa,
 708 | # calculate the population-weighted mean of the coordinates
 709 | # and (for smoothing) the weighted share at each state-csa centroid
 710 | us.knots <- 
 711 | 	sqldf( 
 712 | 		"select 
 713 | 			state , csa ,
 714 | 			sum( pop100 ) as pop100 , 
 715 | 			sum( pop100 * intptlon ) / sum( pop100 ) as intptlon ,
 716 | 			sum( pop100 * intptlat ) / sum( pop100 ) as intptlat
 717 | 		from sfsave
 718 | 		group by
 719 | 			state , csa"
 720 | 	)
 721 | # note: this screws up coordinates that cross the international date line
 722 | # or the equator.  in the united states, only alaska's aleutian islands do this
 723 | # and those geographies will be thrown out later.  so it doesn't matter.
 724 | 
 725 | # how many knots have you gots?
 726 | nrow( us.knots )
 727 | 
 728 | # you can look at the weighted centroids of those csas by state
 729 | plot( us.knots$intptlon , us.knots$intptlat )
 730 | 
 731 | # clear up RAM
 732 | rm( sfsave ) ; gc()
 733 | 
 734 | # interpolation option one #
 735 | library(fields)
 736 | 
 737 | krig.fit <-
 738 | 	Krig(
 739 | 		cbind( x$intptlon , x$intptlat ) ,
 740 | 		x$share ,
 741 | 		weights = x$weight ,
 742 | 		knots = cbind( us.knots$intptlon , us.knots$intptlat )
 743 | 		# if you prefer to use cover.design, all you'd need is this knots= line instead:
 744 | 		# knots = xknots
 745 | 	)
 746 | 
 747 | # that is: what is the (weighted) relationship between
 748 | # your variable of interest (transportation share of total expenditure) and
 749 | # the x/y points on a grid?
 750 | 
 751 | # check this out!
 752 | surface( krig.fit )
 753 | # you're almost there!
 754 | 
 755 | 
 756 | # interpolation option two #
 757 | library(mgcv)
 758 | 
 759 | gam.fit <- 
 760 | 	gam( 
 761 | 		share ~ s(intptlon , intptlat ) , 
 762 | 		weights = weight , 
 763 | 		data = x
 764 | 	)
 765 | 	
 766 | 
 767 | # for the third alternative, keep reading.
 768 | 	
 769 | 	
 770 | # # end of step 7 # #
 771 | # # # # # # # # # # #
 772 | 
 773 | 
 774 | # # # # # # # # # # # # # # # # # # # #
 775 | # # step 8: make a grid and predict # #
 776 | 
 777 | library(raster)
 778 | 
 779 | x.range <- bbox( us.shp )[ 1 , ]
 780 | y.range <- bbox( us.shp )[ 2 , ]
 781 | 
 782 | # add one percent on each side
 783 | x.diff <- abs( x.range[ 2 ] - x.range[ 1 ] ) * 0.01
 784 | y.diff <- abs( y.range[ 2 ] - y.range[ 1 ] ) * 0.01
 785 | 
 786 | x.range[ 1 ] <- x.range[ 1 ] - x.diff
 787 | x.range[ 2 ] <- x.range[ 2 ] + x.diff
 788 | y.range[ 1 ] <- y.range[ 1 ] - y.diff
 789 | y.range[ 2 ] <- y.range[ 2 ] + y.diff
 790 | 
 791 | # choose the number of ticks (in each direction) on your grid
 792 | grid.length <- 500
 793 | # grid.length <- 700
 794 | # # note: smaller grids will render much much faster
 795 | # # (so they're better if you're just playing around)
 796 | # # but larger grids will prevent your final plot from
 797 | # # being too pixelated, even when zooming in
 798 | 
 799 | 
 800 | # create some grid data.frame objects, one for each interpolation type
 801 | grd <- gam.grd <- krig.grd <-
 802 | 	expand.grid(
 803 | 		intptlon = seq( from = x.range[1] , to = x.range[2] , length = grid.length ) , 
 804 | 		intptlat = seq( from = y.range[1] , to = y.range[2] , length = grid.length )
 805 | 	)
 806 | 
 807 | 
 808 | # along your rectangular grid,
 809 | # what are the predicted values of
 810 | # the transportation spending share?
 811 | for ( i in split( seq( nrow( grd ) ) , ceiling( seq( nrow( grd ) ) / 20000 ) ) ){
 812 | 	krig.grd[ i , 'kout' ] <- as.numeric( predict( krig.fit , krig.grd[ i , c( 'intptlon' , 'intptlat' ) ] ) )
 813 | 	gc()
 814 | }
 815 | 
 816 | # alternate grid using gam.fit
 817 | for ( i in split( seq( nrow( grd ) ) , ceiling( seq( nrow( grd ) ) / 20000 ) ) ){
 818 | 	gam.grd[ i , 'gamout' ] <- as.numeric( predict( gam.fit , gam.grd[ i , c( 'intptlon' , 'intptlat' ) ] ) )
 819 | 	gc()
 820 | }
 821 | 
 822 | # interpolation option three #
 823 | library(spatstat)
 824 | 
 825 | smoout <- 
 826 | 	Smooth(
 827 | 		ppp( 
 828 | 			x$intptlon , 
 829 | 			x$intptlat , 
 830 | 			x.range ,
 831 | 			y.range ,
 832 | 			marks = x$share
 833 | 		) ,
 834 | 		# here's a good starting point for sigma, but screw around with this value.
 835 | 		sigma = 2 ,
 836 | 		weights = x$weight
 837 | 	)
 838 | 
 839 | smoo.grd <-	
 840 | 	expand.grid(
 841 | 		intptlon = seq( from = smoout$xrange[1] , to = smoout$xrange[2] , length = smoout$dim[1] ) , 
 842 |         intptlat = seq( from = smoout$yrange[1] , to = smoout$yrange[2] , length = smoout$dim[2] )
 843 | 	)
 844 | 
 845 | smoo.grd$smoout <- as.numeric( t( smoout$v ) )
 846 | 
 847 | # # end of step 8 # #
 848 | # # # # # # # # # # #
 849 | 
 850 | 
 851 | # # # # # # # # # # # # # # # # # # # # #
 852 | # # step 9: ggplot and choose options # #
 853 | 
 854 | library(ggplot2)
 855 | library(mapproj)
 856 | 
 857 | 
 858 | # # # psa # # # 
 859 | # capping your outliers might drastically change your map.
 860 | # if you find the 25th percentile and 75th percentile with
 861 | # summary( krig.grd$kout )
 862 | # and then replace all `kout` values below the 25th or above the 75th
 863 | # with those capped percentile endpoints, i promise promise promise
 864 | # your maps will appear quite different.  you could cap at the 25th and 75th with..
 865 | # grd.sum <- summary( krig.grd$kout )
 866 | # krig.grd[ krig.grd$kout > grd.sum[ 5 ] , 'kout' ] <- grd.sum[ 5 ]
 867 | # krig.grd[ krig.grd$kout < grd.sum[ 2 ] , 'kout' ] <- grd.sum[ 2 ]
 868 | # # # end # # # 
 869 | 
 870 | # you don't want to cap at the 25th and 75th?
 871 | # well consider one other idea: at least cap at the 5th and 95th of the nation
 872 | # this will also increase the visible gradient ultimately plotted.
 873 | 
 874 | # for example, the lowest krigged value is negative.
 875 | summary( krig.grd$kout )
 876 | # that's obviously not right.
 877 | 
 878 | # if a numeric vector has values below the 5th percentile or above the 95th percentile, cap 'em
 879 | minnmax.at.0595 <- 
 880 | 	function( z ){ 
 881 | 		q0595 <- quantile( z , c( 0.05 , 0.95 ) )
 882 | 		z[ z < q0595[ 1 ] ] <- q0595[ 1 ]
 883 | 		z[ z > q0595[ 2 ] ] <- q0595[ 2 ]
 884 | 		z
 885 | 	}
 886 | 
 887 | # min and max all numeric values.  this makes the gradient much more visible.
 888 | krig.grd$kout <- minnmax.at.0595( krig.grd$kout )
 889 | gam.grd$gamout <- minnmax.at.0595( gam.grd$gamout )
 890 | smoo.grd$smoout <- minnmax.at.0595( smoo.grd$smoout )
 891 | 
 892 | 
 893 | # initiate the krige-based plot
 894 | krg.plot <- 
 895 | 	ggplot( data = krig.grd , aes( x = intptlon , y = intptlat ) ) +
 896 | 	geom_tile( data = krig.grd , aes( fill = kout ) )
 897 | 	
 898 | # initiate the gam-based plot
 899 | gam.plot <- 
 900 | 	ggplot( data = gam.grd , aes( x = intptlon , y = intptlat ) ) +
 901 | 	geom_tile( data = gam.grd , aes( fill = gamout ) )
 902 | 
 903 | # initiate the smooth-based plot
 904 | smooth.plot <- 
 905 | 	ggplot( data = smoo.grd , aes( x = intptlon , y = intptlat ) ) +
 906 | 	geom_tile( data = smoo.grd , aes( fill = smoout ) )
 907 | 
 908 | # view all three grids!
 909 | krg.plot
 910 | gam.plot
 911 | smooth.plot
 912 | 
 913 | # choose a projection.  here's one using albers on the continental united states borders
 914 | co <- coord_map( project = "albers" , lat0 = min( x$intptlat ) , lat1 = max( x$intptlat ) )
 915 | # but save this puppy for laytur
 916 | # because printing the projected plot takes much more time than printing the unprojected one
 917 | 
 918 | # initiate the entire plot
 919 | the.plot <-
 920 | 
 921 | 	# choose only one of the three interpolation grids
 922 | 	krg.plot +
 923 | 	# gam.plot +
 924 | 	# smooth.plot +
 925 | 	
 926 | 	# blank out the legend and axis labels
 927 | 	theme(
 928 | 		legend.position = "none" ,
 929 | 		axis.title.x = element_blank() ,
 930 | 		axis.title.y = element_blank()		
 931 | 	) + 
 932 | 	
 933 | 	# blank out other plot elements
 934 | 
 935 | 	scale_x_continuous( limits = x.range , breaks = NULL , oob = squish ) +
 936 | 
 937 |     scale_y_continuous( limits = y.range , breaks = NULL , oob = squish ) +
 938 | 
 939 | 	theme(
 940 | 		panel.grid.major = element_blank(),
 941 | 		panel.grid.minor = element_blank(),
 942 | 		panel.background = element_blank(),
 943 | 		panel.border = element_blank(),
 944 | 		axis.ticks = element_blank()
 945 | 	)
 946 | 
 947 | # print the plot to the screen
 948 | the.plot
 949 | 
 950 | # # end of step 9 # #
 951 | # # # # # # # # # # #
 952 | 
 953 | 
 954 | # # # # # # # # # # # # # # # # # #
 955 | # # step 10: blank, color, save # #
 956 | 
 957 | library(ggplot2)
 958 | library(scales)
 959 | library(raster)
 960 | library(plyr)
 961 | library(RColorBrewer)
 962 | library(rgeos)
 963 | 
 964 | 
 965 | # draw a rectangle 15% bigger than the original state
 966 | us.shp.blank <- as( 1.3 * extent( us.shp ), "SpatialPolygons" )
 967 | 
 968 | # compute the difference between connecticut and the rectangle 15% beyond the borders
 969 | us.shp.diff <- gDifference( us.shp.blank , us.shp )
 970 | 
 971 | # prepare the difference layer for ggplot2
 972 | outside <- fortify( us.shp.diff )
 973 | 
 974 | # fix the islands
 975 | outside2 <- ddply( outside , .(piece) , function(x) rbind( x , outside[ 1 , ] ) )
 976 | 
 977 | # blank out coastal areas
 978 | blank.layer <- 
 979 | 	geom_polygon( 
 980 | 		data = outside2 , 
 981 | 		aes( x = long , y = lat , group = id ) , 
 982 | 		fill = 'white' 
 983 | 	)
 984 | 
 985 | # closer, eh?
 986 | the.plot + blank.layer
 987 | 
 988 | # store this plot
 989 | the.plot <- the.plot + blank.layer	
 990 | 
 991 | # location of all state borders
 992 | sbo <- fortify( us.shp )
 993 | 
 994 | # here's a layer with the continental united states borders
 995 | sbo.layer <- geom_path( data = sbo , aes( x = long , y = lat , group = group ) , colour = 'lightgray' )
 996 | 
 997 | # print that result to the screen.
 998 | the.plot + sbo.layer
 999 | # okay if we stick with this one?
1000 | the.plot <- the.plot + sbo.layer
1001 | # good.  now it's part of the plot.
1002 | 
1003 | # print all six
1004 | the.plot + scale_fill_gradientn( colours = Greys.3.p( 100 ) )
1005 | the.plot + scale_fill_gradientn( colours = YlGnBu.3.p( 100 ) )
1006 | the.plot + scale_fill_gradientn( colours = YlGnBu.9.p( 100 ) )
1007 | 
1008 | the.plot + scale_fill_gradientn( colours = PRGn.11.p( 100 ) )
1009 | the.plot + scale_fill_gradientn( colours = RdYlBu.3.p( 100 ) )
1010 | the.plot + scale_fill_gradientn( colours = RdYlBu.11.p( 100 ) )
1011 | 
1012 | # ooh i like that one mom, can we keep it can we keep it?
1013 | final.plot <- the.plot + scale_fill_gradientn( colours = RdYlBu.11.p( 100 ) )
1014 | 
1015 | final.plot
1016 | 
1017 | # would you like to save this game?
1018 | 
1019 | # use cairo-png as your bitmap type
1020 | options( bitmapType = "cairo" )
1021 | 
1022 | # uncomment this block to save the file to your current working directory
1023 | 
1024 | # ggsave( 
1025 | 	# "2012-2013 transportation spending as a share of total spending - unprojected.png" ,
1026 | 	# plot = final.plot ,
1027 | 	# type = "cairo-png"
1028 | # )
1029 | 
1030 | # add the projection
1031 | projected.plot <- final.plot + co
1032 | 
1033 | # # # # # # # # # # # # # #
1034 | # warning warning warning #
1035 | 
1036 | # this next save-line takes a few hours.
1037 | # leave it running overnight
1038 | 
1039 | # warning warning warning #
1040 | # # # # # # # # # # # # # #
1041 | 
1042 | 
1043 | # uncomment this block to save the file to your current working directory
1044 | 
1045 | # ggsave( 
1046 | 	# "2012-2013 transportation spending as a share of total spending - projected.png" ,
1047 | 	# plot = projected.plot ,
1048 | 	# type = "cairo-png"
1049 | # )
1050 | 
1051 | # # end of step ten # #
1052 | # # # # # # # # # # # #
1053 | 


--------------------------------------------------------------------------------
/how to map the current population survey.R:
--------------------------------------------------------------------------------
  1 | # # # # # # # # # # # # # # # # #
  2 | # # set the working directory # #
  3 | # # # # # # # # # # # # # # # # #
  4 | 
  5 | # setwd( "C:/My Directory/SWMAP/" )
  6 | 
  7 | 
  8 | # # # # # # # # # # # # # # # #
  9 | # # example survey data set # #
 10 | # # # # # # # # # # # # # # # #
 11 | 
 12 | # current population survey, march supplement
 13 | 
 14 | 
 15 | # # # # # # # # # # # # # # # # # # # # #
 16 | # # different from other maps because # #
 17 | # # # # # # # # # # # # # # # # # # # # #
 18 | 
 19 | # allows values outside of the displayed geography to influence the coloring
 20 | # overlays multiple water layers (no longer shapefiles) to blank out water areas
 21 | 
 22 | 
 23 | # # # # # # # # # # # # # # # # # #
 24 | # # smallest level of geography # #
 25 | # # # # # # # # # # # # # # # # # #
 26 | 
 27 | # state, core-based statistical areas
 28 | # (for connecticut, massachusetts, and rhode island: new england city and town areas)
 29 | 
 30 | # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 31 | # # asdfree.com blog post for this survey microdata # #
 32 | # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 33 | 
 34 | # http://www.asdfree.com/search/label/current%20population%20survey%20%28cps%29
 35 | 
 36 | 
 37 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 38 | # # r code repository for setup and analysis examples # #
 39 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 40 | 
 41 | # https://github.com/ajdamico/asdfree/tree/master/Current%20Population%20Survey
 42 | 
 43 | 
 44 | # # # # # # # # # # # # #
 45 | # # value of interest # #
 46 | # # # # # # # # # # # # #
 47 | 
 48 | # poverty rate (warning: the final map created by this script is of no value.  fix it by pooling years or choosing larger geographies in step #2)
 49 | 
 50 | 
 51 | # # # # # # #
 52 | # # flaws # #
 53 | # # # # # # #
 54 | 
 55 | # this map displays nothing but survey noise.  run it on a different year, you'll see completely different results.  not enough sample size.
 56 | # a legitimate map with the cps would pool years or aggregate small areas into larger ones.  when possible, compare findings to other estimates.
 57 | # for this particular statistic (poverty rate), compare your results to the county-level maps you can create on http://www.census.gov/did/www/saipe/
 58 | 
 59 | 
 60 | # # # # # # # # # # # # # # # # # # # # #
 61 | # # step 1: load the survey microdata # #
 62 | 
 63 | # remove the # in order to run this install.packages line only once
 64 | # install.packages( "MonetDBLite" )
 65 | 
 66 | library(downloader)
 67 | 
 68 | # download the 2013 current population survey microdata onto the local disk
 69 | cps.years.to.download <- 2013
 70 | source_url( "https://raw.github.com/ajdamico/asdfree/master/Current%20Population%20Survey/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE )
 71 | 
 72 | # # end of step 1 # #
 73 | # # # # # # # # # # #
 74 | 
 75 | 
 76 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 77 | # # step 2: conduct your analysis of interest at the smallest geography allowed # #
 78 | 
 79 | library(survey)
 80 | library(DBI)			# load the DBI package (implements the R-database coding)
 81 | library(MonetDBLite)
 82 | 
 83 | # following the analysis examples in the r code repository --
 84 | # # https://github.com/ajdamico/asdfree/blob/master/Current%20Population%20Survey/2012%20asec%20-%20analysis%20examples.R
 85 | # -- calculate the poverty rate at the smallest available geographic area
 86 | # within the state of connecticut
 87 | 
 88 | # turn on replicate-weighted mean squared errors
 89 | options( survey.replicates.mse = TRUE )
 90 | # this matches the official census bureau published methods
 91 | 
 92 | # name the database files in the "MonetDB" folder of the current working directory
 93 | dbfolder <- paste0( getwd() , "/MonetDB" )
 94 | 
 95 | # construct a replicate-weighted, database-backed survey design object
 96 | cps.design <-
 97 | 	svrepdesign(
 98 | 		weights = ~marsupwt, 
 99 | 		repweights = "pwwgt[1-9]", 
100 | 		type = "Fay", 
101 | 		rho = ( 1 - 1 / sqrt( 4 ) ),
102 | 		data = "asec13" ,
103 | 		combined.weights = TRUE ,
104 | 		dbtype = "MonetDBLite" ,
105 | 		dbname = dbfolder
106 | 	)
107 | 
108 | # restrict the survey object to connecticut *plus adjacent state* records only
109 | cpas.design <- 
110 | 	subset( 
111 | 		cps.design , 
112 | 		gestfips %in% c( 
113 | 			9 ,		# connecticut
114 | 			25 , 	# massachusetts
115 | 			36 , 	# new york
116 | 			44		# rhode island
117 | 		)
118 | 	)
119 | 
120 | # the original national survey object is no longer necessary
121 | rm( cps.design ) ; gc()
122 | # remove it and clear up RAM
123 | 
124 | 
125 | # calculate the 2012 connecticut state-wide poverty rate
126 | svymean( ~ as.numeric( povll %in% 1:3 ) , subset( cpas.design , gestfips == 9 & pov_univ == 1 ) )
127 | # using only records in the poverty universe
128 | 
129 | # note: this estimate and standard error precisely matches
130 | # the census bureau's table 19 (historical poverty by state)
131 | # https://www.census.gov/hhes/www/poverty/data/historical/hstpov19.xls
132 | 
133 | 
134 | # # examine which geographies are available # #
135 | 
136 | # the current population survey identifies records
137 | # from six different core-based statistical areas (cbsa)
138 | svytable( ~ gtcbsa , cpas.design )
139 | 
140 | # the current population survey identifies records
141 | # from both metro and non-metro respondents
142 | svytable( ~ gtmetsta , cpas.design )
143 | 
144 | # the current population survey identifies records
145 | # from every state in the nation,
146 | # but we've restricted this design to
147 | # connecticut + adjacent states
148 | svytable( ~ gestfips , cpas.design )
149 | 
150 | # the smallest geography reasonably extracted
151 | # from this survey microdata set will be
152 | # cbsa + metro status combined
153 | svytable( ~ gtcbsa + gtmetsta + gestfips , cpas.design )
154 | 
155 | # simply use both of those geographies in the by= argument
156 | # of the `svyby` command, and re-calculate the poverty rates
157 | smallest.area.statistics <-
158 | 	svyby( 
159 | 		~ as.numeric( povll %in% 1:3 ) , 
160 | 		~ gtcbsa + gtmetsta + gestfips , 
161 | 		subset( cpas.design , pov_univ == 1 ) , 
162 | 		svymean 
163 | 	)
164 | # this is the same command as the statewide calculation above,
165 | # except these results have been broken into smaller areas.	
166 | 
167 | # these are the statistics to be mapped
168 | print( smallest.area.statistics )
169 | # the standard errors are a measure of precision,
170 | # their inverse will serve as the mapping weights
171 | 
172 | # make this object easier to type..
173 | sas <- smallest.area.statistics
174 | 
175 | # ..and also easier to read
176 | names( sas )[ names( sas ) == 'as.numeric(povll %in% 1:3)' ] <- 'povrate'
177 | 
178 | # # end of step 2 # #
179 | # # # # # # # # # # #
180 | 
181 | 
182 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
183 | # # step 3: download and import necessary geographic crosswalks # #
184 | 
185 | library(downloader)
186 | 
187 | # load the download_cached and related functions
188 | # to prevent re-downloading of files once they've been downloaded.
189 | source_url(
190 | 	"https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" ,
191 | 	prompt = FALSE ,
192 | 	echo = FALSE
193 | )
194 | 
195 | # create a temporary file containing the census bureau's
196 | # 2010 census summary file #1 for all four states
197 | # then download the file.
198 | sf1.tf <- tempfile()
199 | 
200 | # create two vectors with the names and abbreviations of all four states to download
201 | sn <- c( "Connecticut" , "Massachusetts" , "New_York" , "Rhode_Island" )
202 | sa <- c( "ct" , "ma" , "ny" , "ri" )
203 | 
204 | # create an empty data.frame
205 | sf1.stack <- NULL
206 | 
207 | # loop through all four connecticut-adjacent states
208 | for ( i in 1:4 ){
209 | 
210 | 	# create a single-element character string containing the ftp path
211 | 	ftp.loc <- 
212 | 		paste0( 
213 | 			"ftp://ftp2.census.gov/census_2010/04-Summary_File_1/" ,
214 | 			sn[ i ] ,
215 | 			"/" ,
216 | 			sa[ i ] ,
217 | 			"2010.sf1.zip"
218 | 		)
219 | 
220 | 	# download the current state's summary file
221 | 	download_cached( ftp.loc , sf1.tf , mode = 'wb' )
222 | 	# note: to re-download a file from scratch, add the parameter usecache = FALSE
223 | 	
224 | 	# unzip the summary file #1 files to the current working directory
225 | 	sf1.uz <- unzip( sf1.tf , exdir = tempdir() )
226 | 
227 | 	# file layout from http://www.census.gov/prod/cen2010/doc/sf1.pdf#page=18
228 | 	sf1 <- 
229 | 		read.fwf( 
230 | 			sf1.uz[ grep( "geo2010" , sf1.uz ) ] , 
231 | 			c( -8 , 3 , -16 , 2 , 3 , -4 , 5 , -4 , 5 , -4 , 6 , 1 , 4 , -47 , 5 , -10 , 5 , -186 , 9 , -9 , 11 , 12 , -116 , 1 , 1 ) 
232 | 		)
233 | 
234 | 	# add columns names matching the census bureau, so it's easy to read
235 | 	names( sf1 ) <- c( "sumlev" , "state" , "county" , "cousub" , "place" , "tract" , "blkgrp" , "block" , "cbsa" , "necta" , "pop100" , "intptlat" , "intptlon" , "memi" , "nmemi" )
236 | 
237 | 	# summary level 101 has NECTA and census blocks
238 | 	sf1.101 <- subset( sf1 , sumlev == "101" )
239 | 
240 | 	# stack all four states into one object
241 | 	sf1.stack <- rbind( sf1.stack , sf1.101 )
242 | 
243 | 	# remove some data.frames and clear up RAM
244 | 	rm( sf1.101 , sf1 ) ; gc()
245 | 	
246 | }
247 | 
248 | # just as a check, limit the summary file #1 to connecticut.
249 | sf1ct <- subset( sf1.stack , state == 9 )
250 | 
251 | # hold on to all unique counties in connecticut
252 | ascc <- unique( sf1ct[ , c( 'state' , 'county' ) ] )
253 | 
254 | # one record per census block in connecticut.  see?  same number.
255 | nrow( sf1ct )
256 | # https://www.census.gov/geo/maps-data/data/tallies/census_block_tally.html
257 | 
258 | # and guess what?  the total connecticut population matches as well.
259 | sum( sf1ct$pop100 )
260 | # http://quickfacts.census.gov/qfd/states/09000.html
261 | 
262 | 
263 | # so now we have a data.frame object with
264 | # one record per census block,
265 | # and also with the two geography-levels
266 | # that match the current population survey
267 | head( sf1.stack )
268 | # in connecticut, rhode island, and massachusetts
269 | # necta is the cbsa and
270 | # nmemi indicates metropolitan status
271 | 
272 | # and guess what?
273 | # we've now got the census 2010 weighted populations (field pop100)
274 | # and also each census block's centroid latitude & longitude (fields intptlat + intptlon)
275 | 
276 | # # end of step 3 # #
277 | # # # # # # # # # # #
278 | 
279 | 
280 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
281 | # # step 4: merge the results of your survey analysis with the small-area geography # #
282 | 
283 | # add the current population survey results'
284 | # geographic identifiers to the connecticut census block data.frame
285 | sf1.merge <-
286 | 	
287 | 	transform(
288 | 	
289 | 		sf1.stack ,
290 | 	
291 | 		gestfips = state ,
292 | 	
293 | 		gtcbsa = 
294 | 			# if the record is in new york state,
295 | 			# check if the cbsa is in the current population survey, and if it is, keep it.
296 | 			ifelse( ( state %in% 36 ) & cbsa %in% unique( sas$gtcbsa ) , cbsa ,
297 | 			
298 | 			# if the record is in connecticut, rhode island, or massachusetts,
299 | 			# check if the new england city and town area (necta) is in the cps result,
300 | 			# and if it is, keep it.  otherwise zero-it.
301 | 			ifelse( !( state %in% 36 ) & necta %in% unique( sas$gtcbsa ) , necta , 0 ) ) ,
302 | 		
303 | 		gtmetsta = 
304 | 			# check whether to use the new england variable or not..
305 | 			ifelse( state %in% 36 ,
306 | 				
307 | 				# if the census block is metro, `gtmetsta` is a one.  otherwise it's a two.
308 | 				ifelse( memi == 1 , 1 , 2 ) ,
309 | 				
310 | 				ifelse( nmemi == 1 , 1 , 2 )
311 | 			)
312 | 		
313 | 	)
314 | 	
315 | # confirm that we've created all possible geographies correctly.
316 | merge( unique( sf1.merge[ , c( 'gtcbsa' , 'gtmetsta' , 'gestfips' ) ] ) , sas , all = TRUE )
317 | # nope.  there are missings.
318 | 
319 | # recode anyone in new york state not in a cbsa to be non-metro
320 | sf1.merge[ 
321 | 	sf1.merge$gestfips == 36 & 
322 | 	sf1.merge$gtmetsta == 1 & 
323 | 	sf1.merge$gtcbsa == 0 , 
324 | 	'gtmetsta' ] <- 2
325 | 
326 | # recode 100% of rhode island to be a part of the providence cbsa
327 | sf1.merge[ sf1.merge$gestfips == 44 , 'gtcbsa' ] <- 77200
328 | sf1.merge[ sf1.merge$gestfips == 44 , 'gtmetsta' ] <- 1
329 | 
330 | # recode connecticut residents in the
331 | # worcester, ma or springfield, ma cbsas to unavailable cbsas
332 | sf1.merge[ sf1.merge$gestfips == 9 & sf1.merge$gtcbsa %in% c( 78100 , 79600 ) , 'gtcbsa' ] <- 0
333 | 
334 | 
335 | # the number of records in our small area statistics..
336 | sas.row <- nrow( sas )
337 | 
338 | # ..should equal the number of unique-match-merged records..
339 | mrow <- nrow( merge( unique( sf1.merge[ , c( 'gtcbsa' , 'gtmetsta' , 'gestfips' ) ] ) , sas ) )
340 | 
341 | # ..and it does/they do.
342 | stopifnot( sas.row == mrow )
343 | 
344 | # now the census block-level connecticut+adjacent state census data
345 | # *could* merge if you wanted it to.
346 | 
347 | 
348 | # but you don't.  yet.
349 | 
350 | 
351 | # the standard error (the `se` field) is a measure of precision.
352 | print( sas )
353 | # the smaller the standard error, the more confident you should be
354 | # that the estimate at a particular geography is correct.
355 | 
356 | 
357 | # so invert it.  you heard me.  invert it.
358 | sas$invse <- 1 / sas$se
359 | # a smaller standard error indicates more precision.
360 | 
361 | # for our purposes, precision can be considered weight! #
362 | 
363 | # now we've got the weight that we should give each of our estimates #
364 | 
365 | # distribute that weight across all census blocks #
366 | 
367 | 
368 | # aggregate the 2010 census block populations to the geographies that you have.
369 | popsum <- 
370 | 	aggregate( 
371 | 		sf1.merge$pop100 , 
372 | 		by = ( sf1.merge[ , c( 'gestfips' , 'gtcbsa' , 'gtmetsta' ) ] ) , 
373 | 		sum 
374 | 	)
375 | 
376 | # make the column name meaningful
377 | names( popsum )[ names( popsum ) == 'x' ] <- 'popsum'
378 | 
379 | # merge the popsum onto the sasfile
380 | sas <- merge( sas , popsum )
381 | 
382 | # now.  merge
383 | 	# the poverty rate (the variable of interest)
384 | 	# the inverted standard error (the total weight of the broad geography)
385 | 	# the population sum (the total population of all census blocks that are part of that geography)
386 | 
387 | x <- merge( sf1.merge , sas )
388 | 
389 | # confirm no record loss
390 | stopifnot( nrow( x ) == nrow( sf1.merge ) )
391 | 
392 | 
393 | # (this is the fun part)
394 | # calculate the weight of each census block
395 | x$weight <- x$invse * ( x$pop100 / x$popsum )
396 | 
397 | # note that weight of all census blocks put together
398 | # sums to the `invse` on the original analysis file
399 | stopifnot( all.equal( sum( x$weight ) , sum( sas$invse ) ) )
400 | 
401 | # remove records with zero population
402 | x <- subset( x , weight > 0 )
403 | 
404 | # scale all weights so that they average to one
405 | x$weight <- x$weight / mean( x$weight )
406 | 
407 | # you're done preparing your data.
408 | # keep only the columns you need.
409 | x <- x[ , c( 'povrate' , 'weight' , 'intptlat' , 'intptlon' , 'gestfips' ) ]
410 | # be sure to save the state identifier for easy subsets
411 | 
412 | # # end of step 4 # #
413 | # # # # # # # # # # #
414 | 
415 | 
416 | # # # # # # # # # # # # # # # # # # # # # # #
417 | # # step 5: decide on your map parameters # #
418 | 
419 | library(ggplot2)
420 | library(scales)
421 | library(mapproj)
422 | 
423 | 
424 | # before you ever touch surface smoothing or kriging,
425 | # make some decisions about how you generally want
426 | # your map to look:  the projection and coloring
427 | 
428 | # the options below simply use hadley wickham's ggplot2
429 | # with the census block-level poverty rates and centroids
430 | 
431 | 
432 | # initiate the simple map
433 | ct.map <- 
434 | 	qplot( 
435 | 		intptlon , 
436 | 		intptlat , 
437 | 		data = subset( x , gestfips == 9 ) , 
438 | 		colour = povrate ,
439 | 		xlab = NULL ,
440 | 		ylab = NULL
441 | 	)
442 | 
443 | 
444 | # remove all map crap.
445 | ct.map <- 
446 | 	ct.map + 
447 | 
448 | 	scale_x_continuous( breaks = NULL ) +
449 | 
450 |     scale_y_continuous( breaks = NULL ) +
451 | 
452 |     theme(
453 | 		legend.position = "none" ,
454 | 		panel.grid.major = element_blank(),
455 | 		panel.grid.minor = element_blank(),
456 | 		panel.background = element_blank(),
457 | 		panel.border = element_blank(),
458 | 		axis.ticks = element_blank()
459 | 	)
460 | 
461 | 
462 | # print the map without any projection
463 | ct.map
464 | 
465 | # print the map with an albers projection.
466 | ct.map + coord_map( project = "albers" , lat0 = min( x$intptlat ) , lat1 = max( x$intptlat ) )
467 | # see ?mapproject for a zillion alternatives
468 | 
469 | # if you like that projection, store it in the map object.
470 | ct.map <- 
471 | 	ct.map + 
472 | 	coord_map( project = "albers" , lat0 = min( x$intptlat ) , lat1 = max( x$intptlat ) )
473 | 
474 | 
475 | # check out some purty colors.
476 | ct.map + scale_colour_gradient( low = 'green' , high = 'red' )
477 | 
478 | ct.map + scale_colour_gradient( low = 'white' , high = 'blue' )
479 | 
480 | ct.map + scale_colour_gradient( low = muted( 'blue' ) , high = muted( 'red' ) )
481 | 
482 | # notice how the dotted delineations match the census bureau's 2006 necta definitions
483 | # http://www2.census.gov/geo/maps/metroarea/us_wall/Dec2006/necta_1206_large.gif
484 | 
485 | # clear up RAM
486 | rm( ct.map ) ; gc()
487 | 
488 | # # end of step 5 # #
489 | # # # # # # # # # # #
490 | 
491 | 
492 | # # # # # # # # # # # #
493 | # # step 6: outline # #
494 | 
495 | library(maptools)
496 | library(raster)
497 | library(rgeos)
498 | library(stringr)
499 | library(plyr)
500 | library(ggplot2)
501 | 
502 | shpstate.tf <- tempfile()
503 | 
504 | # connnecticut borders the ocean,
505 | # so use the census bureau's cartographic boundary files
506 | # instead of the regular tiger shapefiles
507 | # unless you want to display poverty rates in the ocean.
508 | 
509 | download_cached( 
510 | 	"http://www2.census.gov/geo/tiger/GENZ2013/cb_2013_us_state_500k.zip" ,
511 | 	shpstate.tf ,
512 | 	mode = 'wb'
513 | )
514 | 
515 | shpstate.uz <- unzip( shpstate.tf , exdir = tempdir() )
516 | 
517 | state.shp <- readShapePoly( shpstate.uz[ grep( 'shp$' , shpstate.uz ) ] )
518 | 
519 | ct.shp <- subset( state.shp , STATEFP == '09' )
520 | 
521 | # draw a rectangle 5% bigger than the original state
522 | ct.shp.out <- as( 1.1 * extent( ct.shp ), "SpatialPolygons" )
523 | 
524 | # clear up RAM
525 | rm( state.shp ) ; gc()
526 | 
527 | # # end of step 6 # #
528 | # # # # # # # # # # #
529 | 
530 | 
531 | # # # # # # # # # # # # # # # # # #
532 | # # step 7: tie knots and krige # #
533 | 
534 | library(sqldf)
535 | 
536 | # how many knots should you make? #
537 | 
538 | # knots are the computationally-intensive part of this process,
539 | # choose as many as your computer and your patience can handle.
540 | 
541 | # you should aim for between 100 - 999 knots,
542 | # but numbers closer to 1,000 will overload smaller computers
543 | 
544 | # you could let the `fields` package attempt to guess knots for you,
545 | # xknots <- cover.design( cbind( x$intptlon , x$intptlat ) , 100 )$design
546 | # but with census microdata, you've already got easy access to a relevant geographic grouping
547 | 
548 | # the state of connecticut has 833 census tracts.
549 | # https://www.census.gov/geo/maps-data/data/tallies/tractblock.html
550 | # if you have a powerful computer, you could group based on weighted tracts
551 | # however, to keep the processing requirements lower,
552 | # i'll use county subdivisions
553 | 
554 | # for the knotting, note that adjacent states are no longer necessary,
555 | # so subset the census summary file #1 to only connecticut and nearby census blocks.
556 | 
557 | # the sqldf() function doesn't like `.` in data.frame object names
558 | sf1s <- sf1.stack
559 | 
560 | # within each county x county subdivision,
561 | # calculate the population-weighted mean of the coordinates
562 | ct.knots <- 
563 | 	sqldf( 
564 | 		"select 
565 | 			state , county , cousub ,
566 | 			sum( pop100 ) as pop100 , 
567 | 			sum( pop100 * intptlon ) / sum( pop100 ) as intptlon ,
568 | 			sum( pop100 * intptlat ) / sum( pop100 ) as intptlat
569 | 		from sf1s
570 | 		group by
571 | 			state , county , cousub"
572 | 	)
573 | # note: this screws up coordinates that cross the international date line
574 | # or the equator.  in the united states, only alaska's aleutian islands do this
575 | # and those geographies will be thrown out later.  so it doesn't matter.
576 | 
577 | # how many knots have you gots?
578 | nrow( ct.knots )
579 | # too many, because you've included
580 | # all of the adjacent states
581 | 
582 | # retain only knots within the bounding box
583 | ct.knots <- 
584 | 	subset( 
585 | 		ct.knots ,
586 | 		( bbox( ct.shp.out )[ 1 , 1 ] < intptlon ) &
587 | 		( bbox( ct.shp.out )[ 2 , 1 ] < intptlat ) &
588 | 		( bbox( ct.shp.out )[ 1 , 2 ] > intptlon ) &
589 | 		( bbox( ct.shp.out )[ 2 , 2 ] > intptlat ) 
590 | 	)
591 | 	
592 | # retain only points within the bounding box
593 | x <- 
594 | 	subset( 
595 | 		x ,
596 | 		( bbox( ct.shp.out )[ 1 , 1 ] < intptlon ) &
597 | 		( bbox( ct.shp.out )[ 2 , 1 ] < intptlat ) &
598 | 		( bbox( ct.shp.out )[ 1 , 2 ] > intptlon ) &
599 | 		( bbox( ct.shp.out )[ 2 , 2 ] > intptlat ) 
600 | 	)
601 | 	
602 | # count again
603 | nrow( ct.knots )
604 | # that's more like it.
605 | 
606 | # you can look at the weighted centroids of those county subdivisions
607 | plot( ct.knots$intptlon , ct.knots$intptlat )
608 | # and look at that, bits of long island will be influencing our results
609 | # since it's within a 5% range of the state of connecticut box
610 | 
611 | # clear up RAM
612 | rm( sf1.stack , sf1s ) ; gc()
613 | 
614 | # interpolation option one #
615 | library(fields)
616 | 
617 | krig.fit <-
618 | 	Krig(
619 | 		cbind( x$intptlon , x$intptlat ) ,
620 | 		x$povrate ,
621 | 		weights = x$weight ,
622 | 		knots = cbind( ct.knots$intptlon , ct.knots$intptlat )
623 | 		# if you prefer to use cover.design, all you'd need is this knots= line instead:
624 | 		# knots = xknots
625 | 	)
626 | 
627 | # that is: what is the (weighted) relationship between
628 | # your variable of interest (poverty rate) and
629 | # the x/y points on a grid?
630 | 
631 | # check this out!
632 | surface( krig.fit )
633 | # you're almost there!
634 | 
635 | 
636 | # interpolation option two #
637 | library(mgcv)
638 | 
639 | gam.fit <- 
640 | 	gam( 
641 | 		povrate ~ s(intptlon , intptlat ) , 
642 | 		weights = weight , 
643 | 		data = x
644 | 	)
645 | 	
646 | 
647 | # for the third alternative, keep reading.
648 | 	
649 | 	
650 | # # end of step 7 # #
651 | # # # # # # # # # # #
652 | 
653 | 
654 | # # # # # # # # # # # # # # # # # # # #
655 | # # step 8: make a grid and predict # #
656 | 
657 | library(raster)
658 | 
659 | x.range <- bbox( ct.shp.out )[ 1 , ]
660 | y.range <- bbox( ct.shp.out )[ 2 , ]
661 | 
662 | # add five percent on each side
663 | x.diff <- abs( x.range[ 2 ] - x.range[ 1 ] ) * 0.05
664 | y.diff <- abs( y.range[ 2 ] - y.range[ 1 ] ) * 0.05
665 | 
666 | x.range[ 1 ] <- x.range[ 1 ] - x.diff
667 | x.range[ 2 ] <- x.range[ 2 ] + x.diff
668 | y.range[ 1 ] <- y.range[ 1 ] - y.diff
669 | y.range[ 2 ] <- y.range[ 2 ] + y.diff
670 | 
671 | # choose the number of ticks (in each direction) on your grid
672 | grid.length <- 500
673 | # # note: smaller grids will render faster
674 | # # (so they're better if you're just playing around)
675 | # # but larger grids will prevent your final plot from
676 | # # being too pixelated, even when zooming in
677 | 
678 | 
679 | # create some grid data.frame objects, one for each interpolation type
680 | grd <- gam.grd <- krig.grd <-
681 | 	expand.grid(
682 | 		intptlon = seq( from = x.range[1] , to = x.range[2] , length = grid.length ) , 
683 | 		intptlat = seq( from = y.range[1] , to = y.range[2] , length = grid.length )
684 | 	)
685 | 
686 | 
687 | # along your rectangular grid,
688 | # what are the predicted values of
689 | # the poverty rate?
690 | krig.grd$kout <- predict( krig.fit , krig.grd )
691 | 
692 | # alternate grid using gam.fit
693 | gam.grd$gamout <- predict( gam.fit , gam.grd )
694 | 
695 | # interpolation option three #
696 | library(spatstat)
697 | 
698 | smoout <- 
699 | 	Smooth(
700 | 		ppp( 
701 | 			x$intptlon , 
702 | 			x$intptlat , 
703 | 			x.range ,
704 | 			y.range ,
705 | 			marks = x$povrate
706 | 		) ,
707 | 		# here's a good starting point for sigma, but screw around with this value.
708 | 		sigma = ( max( x$povrate ) - min( x$povrate ) ) / 2 ,
709 | 		weights = x$weight
710 | 	)
711 | 
712 | smoo.grd <-	
713 | 	expand.grid(
714 | 		intptlon = seq( from = smoout$xrange[1] , to = smoout$xrange[2] , length = smoout$dim[1] ) , 
715 |         intptlat = seq( from = smoout$yrange[1] , to = smoout$yrange[2] , length = smoout$dim[2] )
716 | 	)
717 | 
718 | smoo.grd$smoout <- as.numeric( t( smoout$v ) )
719 | 
720 | # # end of step 8 # #
721 | # # # # # # # # # # #
722 | 
723 | 
724 | # # # # # # # # # # # # # # # # # # # # #
725 | # # step 9: ggplot and choose options # #
726 | 
727 | library(ggplot2)
728 | library(mapproj)
729 | 
730 | 
731 | # # # psa # # # 
732 | # capping your outliers might drastically change your map.
733 | # if you find the 25th percentile and 75th percentile with
734 | # summary( krig.grd$kout )
735 | # and then replace all `kout` values below the 25th or above the 75th
736 | # with those capped percentile endpoints, i promise promise promise
737 | # your maps will appear quite different.  you could cap at the 25th and 75th with..
738 | # grd.sum <- summary( krig.grd$kout )
739 | # krig.grd[ krig.grd$kout > grd.sum[ 5 ] , 'kout' ] <- grd.sum[ 5 ]
740 | # krig.grd[ krig.grd$kout < grd.sum[ 2 ] , 'kout' ] <- grd.sum[ 2 ]
741 | # # # end # # # 
742 | 
743 | 
744 | # you don't want to cap at the 25th and 75th?
745 | # well consider one other idea: cap within the minimum and maximum of the state
746 | # this will also increase the visible gradient ultimately plotted.
747 | 
748 | # subset `x` to connecticut only
749 | ctx <- subset( x , gestfips == 9 )
750 | 
751 | # capture the connecticut statewide minimum and maximum known poverty rates
752 | state.sum <- summary( ctx$povrate )
753 | 
754 | # if a numeric vector has values below the minimum or above the maximum, cap 'em
755 | minnmax.at.statesum <- 
756 | 	function( z , ss ){ 
757 | 		z[ z < ss[ 1 ] ] <- ss[ 1 ]
758 | 		z[ z > ss[ 6 ] ] <- ss[ 6 ]
759 | 		z
760 | 	}
761 | 
762 | # min and max all numeric values.  this makes the gradient much more visible.
763 | krig.grd$kout <- minnmax.at.statesum( krig.grd$kout , state.sum )
764 | gam.grd$gamout <- minnmax.at.statesum( gam.grd$gamout , state.sum )
765 | smoo.grd$smoout <- minnmax.at.statesum( smoo.grd$smoout , state.sum )
766 | # this moderates the effect of adjacent-state high-poverty areas.
767 | # you're not discounting the poverty rates in connecticut's border states,
768 | # but you're forcing the richest and poorest spots on the map to be within state borders
769 | 
770 | 
771 | # initiate the krige-based plot
772 | krg.plot <- 
773 | 	ggplot( data = krig.grd , aes( x = intptlon , y = intptlat ) ) +
774 | 	geom_tile( data = krig.grd , aes( fill = kout ) )
775 | 	
776 | # initiate the gam-based plot
777 | gam.plot <- 
778 | 	ggplot( data = gam.grd , aes( x = intptlon , y = intptlat ) ) +
779 | 	geom_tile( data = gam.grd , aes( fill = gamout ) )
780 | 
781 | # initiate the smooth-based plot
782 | smooth.plot <- 
783 | 	ggplot( data = smoo.grd , aes( x = intptlon , y = intptlat ) ) +
784 | 	geom_tile( data = smoo.grd , aes( fill = smoout ) )
785 | 
786 | # view all three grids!
787 | krg.plot
788 | gam.plot
789 | smooth.plot
790 | 
791 | # choose a projection.  here's one using albers on the connecticut borders
792 | co <- coord_map( project = "albers" , lat0 = min( ctx$intptlat ) , lat1 = max( ctx$intptlat ) )
793 | 
794 | # force this projection to work on all object types
795 | co2 <- co
796 | class(co2) <- c( "hoge" , class( co2 ) )
797 | is.linear.hoge <- function(coord) TRUE
798 | 
799 | # initiate the entire plot
800 | the.plot <-
801 | 
802 | 	# choose only one of the three interpolation grids
803 | 	krg.plot +
804 | 	# gam.plot +
805 | 	# smooth.plot +
806 | 	
807 | 	# include the projection requirements
808 | 	co2 + 
809 | 	coord_fixed() +
810 | 	
811 | 	# blank out the legend and axis labels
812 | 	theme(
813 | 		legend.position = "none" ,
814 | 		axis.title.x = element_blank() ,
815 | 		axis.title.y = element_blank()		
816 | 	) + 
817 | 	
818 | 	# blank out other plot elements
819 | 	scale_x_continuous( limits = x.range , breaks = NULL , oob = squish ) +
820 |     scale_y_continuous( limits = y.range , breaks = NULL , oob = squish ) +
821 | 	theme(
822 | 		panel.grid.major = element_blank(),
823 | 		panel.grid.minor = element_blank(),
824 | 		panel.background = element_blank(),
825 | 		panel.border = element_blank(),
826 | 		axis.ticks = element_blank()
827 | 	)
828 | 
829 | # print the plot to the screen
830 | the.plot
831 | 
832 | # # end of step 9 # #
833 | # # # # # # # # # # #
834 | 
835 | 
836 | # # # # # # # # # # # # # # # # # #
837 | # # step 10: blank, color, save # #
838 | 
839 | library(ggplot2)
840 | library(scales)
841 | library(raster)
842 | library(plyr)
843 | library(rgeos)
844 | 
845 | # draw a rectangle 15% bigger than the original state
846 | ct.shp.blank <- as( 1.3 * extent( ct.shp ), "SpatialPolygons" )
847 | 
848 | # compute the difference between connecticut and the rectangle 15% beyond the borders
849 | ct.shp.diff <- gDifference( ct.shp.blank , ct.shp )
850 | 
851 | # prepare the difference layer for ggplot2
852 | outside <- fortify( ct.shp.diff )
853 | 
854 | # fix the islands
855 | outside2 <- ddply( outside , .(piece) , function(x) rbind( x , outside[ 1 , ] ) )
856 | 
857 | # blank out coastal areas
858 | blank.layer <- 
859 | 	geom_polygon( 
860 | 		data = outside2 , 
861 | 		aes( x = long , y = lat , group = id ) , 
862 | 		fill = 'white' 
863 | 	)
864 | 
865 | # closer, eh?
866 | the.plot + blank.layer
867 | 
868 | # store this plot
869 | the.plot <- the.plot + blank.layer	
870 | 
871 | # location of all water files within the state of connecticut
872 | water.files <-
873 | 	paste0( 
874 | 		"ftp://ftp2.census.gov/geo/tiger/TIGER2013/AREAWATER/tl_2013_" ,
875 | 		str_pad( ascc[ , 1 ] , 2 , pad = '0' ) ,
876 | 		str_pad( ascc[ , 2 ] , 3 , pad = '0' ) ,
877 | 		"_areawater.zip"
878 | 	)
879 | 
880 | # initiate another temporary file
881 | watemp <- tempfile()
882 | 
883 | # start with all missing water
884 | all.water <- NULL
885 | 
886 | # loop through each water.file
887 | for ( fn in water.files ){
888 | 
889 | 	# download the shapefile to the local disk
890 | 	download_cached( fn , watemp )
891 | 	
892 | 	# unzip it into a temporary directory
893 | 	z <- unzip( watemp , exdir = tempdir() )
894 | 
895 | 	# read it in..
896 | 	w <- readShapePoly( z[ grep( 'shp$' , z ) ] )
897 | 
898 | 	# ..prepare it for ggplot2..
899 | 	wo <- fortify( w )
900 | 
901 | 	# ..fix any islands or weird outlying shapes
902 | 	w2 <- ddply( wo , .( piece ) , function( x ) rbind( x , wo[ 1 , ] ) )
903 | 
904 | 	# create a white-filled layer
905 | 	wl <- geom_polygon( data = w2 , aes( x = long , y = lat , group = group ) , fill = 'white' )
906 | 
907 | 	# add it to the current plot
908 | 	the.plot <- the.plot + wl
909 | 	
910 | }
911 | 
912 | # print with all water blanked out.
913 | the.plot
914 | 
915 | # print with the same purty colors
916 | the.plot + scale_fill_gradient( low = muted( 'blue' ) , high = muted( 'red' ) )
917 | 
918 | the.plot + scale_fill_gradient( low = 'white' , high = 'red' )
919 | 
920 | the.plot + scale_fill_gradient( low = 'green' , high = 'red' )
921 | 
922 | # ooh i like that one mom, can we keep it can we keep it?
923 | final.plot <- the.plot + scale_fill_gradient( low = 'green' , high = 'red' )
924 | 
925 | final.plot
926 | 
927 | # would you like to save this game?
928 | 
929 | # save the file to your current working directory
930 | ggsave( 
931 | 	"2012 connecticut poverty rate.pdf" ,
932 | 	plot = final.plot ,
933 | 	scale = 2
934 | )
935 | 
936 | # # end of step ten # #
937 | # # # # # # # # # # # #
938 | 


--------------------------------------------------------------------------------
/how to map the demographic and health surveys.R:
--------------------------------------------------------------------------------
  1 | # # # # # # # # # # # # # # # # #
  2 | # # set the working directory # #
  3 | # # # # # # # # # # # # # # # # #
  4 | 
  5 | # setwd( "C:/My Directory/SWMAP/" )
  6 | 
  7 | 
  8 | # # # # # # # # # # # # # # # #
  9 | # # example survey data set # #
 10 | # # # # # # # # # # # # # # # #
 11 | 
 12 | # demographic and health surveys
 13 | 
 14 | 
 15 | # # # # # # # # # # # # # # # # # # # # #
 16 | # # different from other maps because # #
 17 | # # # # # # # # # # # # # # # # # # # # #
 18 | 
 19 | # uses the prevR package, which automates the interpolation
 20 | # kernel density estimation with adaptive bandwidth, rather than kriging
 21 | 
 22 | 
 23 | # # # # # # # # # # # # # # # # # #
 24 | # # smallest level of geography # #
 25 | # # # # # # # # # # # # # # # # # #
 26 | 
 27 | # exact interview locations within the country
 28 | 
 29 | 
 30 | # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 31 | # # asdfree.com blog post for this survey microdata # #
 32 | # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 33 | 
 34 | # http://www.asdfree.com/search/label/demographic%20and%20health%20surveys%20%28dhs%29
 35 | 
 36 | 
 37 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 38 | # # r code repository for setup and analysis examples # #
 39 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 40 | 
 41 | # https://github.com/ajdamico/asdfree/tree/master/Demographic%20and%20Health%20Surveys
 42 | 
 43 | 
 44 | # # # # # # # # # # # # #
 45 | # # value of interest # #
 46 | # # # # # # # # # # # # #
 47 | 
 48 | # infant mortality
 49 | 
 50 | 
 51 | # # # # # # #
 52 | # # flaws # #
 53 | # # # # # # #
 54 | 
 55 | # egypt's nile-dwelling population can distort estimates
 56 | # requires microdata with gps coordinates, often not available
 57 | 
 58 | 
 59 | # # # # # # # # # # # # # # # # # # # # #
 60 | # # step 1: load the survey microdata # #
 61 | 
 62 | library(downloader)
 63 | 
 64 | # download the 2008 egyptian demographic and health survey microdata onto the local disk
 65 | # note that this requires (free) registration before the download will work
 66 | # http://dhsprogram.com/data/Access-Instructions.cfm
 67 | your.username <- "username"
 68 | your.password <- "password"
 69 | your.project <- "project"
 70 | source_url( "https://raw.github.com/ajdamico/asdfree/master/Demographic%20and%20Health%20Surveys/download%20and%20import.R" , prompt = FALSE , echo = TRUE )
 71 | 
 72 | 
 73 | # # end of step 1 # #
 74 | # # # # # # # # # # #
 75 | 
 76 | 
 77 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 78 | # # step 2: calculate interview cluster-specific values # #
 79 | 
 80 | library(foreign)
 81 | library(sqldf)
 82 | 
 83 | # load the children's recode file as a data.frame `ch`
 84 | ch <- read.dta( "./Egypt/Standard DHS 2008/Children's Recode/stata/EGKR5AFL.DTA" )
 85 | 
 86 | # retain only the columns necessary for the analysis
 87 | # v001 cluster number
 88 | # v005 Sample weight
 89 | # b2 year of birth
 90 | # b5 child is alive
 91 | # b6 age at death (1xx in days, 2xx in months, 3xx in years)
 92 | ch <- ch[ , c( 'v001' , 'v005' , 'b2' , 'b5' , 'b6' ) ]
 93 | 
 94 | # limit the sample to only children born between 2004 and 2008
 95 | # with a non-missing age at death
 96 | ch <- subset( ch , b2 %in% 2004:2008 & !( b6 %in% 997:999 ) )
 97 | # that is, in the past three years.
 98 | 
 99 | # create a binary infant mortality variable
100 | ch$im <- as.numeric( ch$b5 %in% 'no' & ch$b6 %in% 100:301 )
101 | 
102 | # create weigth variable
103 | ch$w <- ch$v005 / 1000000
104 | 
105 | # note that this is very close to (but not exactly the same as)
106 | # the nationwide egyptian infant mortality rate given on their report
107 | # table 10.1 http://dhsprogram.com/pubs/pdf/FR220/FR220.pdf#page=148
108 | weighted.mean( ch$im , ch$w ) * 1000
109 | # their report says 24.5 per 1,000.
110 | # the current microdata shows 24.2 per 1,000.  big whoop.
111 | 
112 | # calculate four statistics, grouped by survey cluster
113 | # count, weighted count, infant deaths, weighted infant deaths
114 | cl <- sqldf( "select v001 as dhsclust , count(*) as denom , sum( im ) as numer , sum( w ) as wdenom , sum( im * w ) as wnumer from ch group by v001" )
115 | 
116 | # that was easy, huh?  want to look at your resultant cluster-level information?
117 | head( cl )
118 | 
119 | tail( cl )
120 | 
121 | # # end of step 2 # #
122 | # # # # # # # # # # #
123 | 
124 | 
125 | # # # # # # # # # # # # # # # # # # # # #
126 | # # step 3: prepare your prevR object # #
127 | 
128 | library(prevR)
129 | library(foreign)
130 | 
131 | # what country are you mapping?  egypt?
132 | # oh, okay.  well then let's grab egypt's boundaries.
133 | bounds <- create.boundary( "Egypt" )
134 | # that was it?  that was too easy.
135 | 
136 | # import the longitude and latitude data from the
137 | # geographical information system (gis) files (this is a special request)
138 | longlat <- read.dbf( "./Egypt/Standard DHS 2008/Supplemental/flat/EGGE5DFL.dbf" )
139 | 	
140 | # convert all column names to lowercase
141 | names( longlat ) <- tolower( names( longlat ) )
142 | 
143 | # merge this cluster information onto the cluster-level results data.frame
144 | x <- merge( cl , longlat[ , c( 'dhsclust' , 'longnum' , 'latnum' , 'source' ) ] )
145 | 
146 | # confirm that every cluster that you have infant mortality information for
147 | # also has a longitude & latitude variable now
148 | stopifnot( nrow( x ) == nrow( cl ) )
149 | 
150 | # check how many clusters are missing coordinates
151 | miss.coord <- nrow( subset( x , source == 'MIS' ) )
152 | 
153 | # discard records with missing longitudes & latitudes
154 | x <- subset( x , source != 'MIS' )
155 | 
156 | # confirm you've tossed the correct number of records
157 | stopifnot( nrow( x ) + miss.coord == nrow( cl ) )
158 | 
159 | # identify which columns are integer types
160 | ic <- sapply( x , is.integer )
161 | 
162 | # coerce every integer column to numeric
163 | x[ , ic ] <- sapply( x[ , ic ] , as.numeric )
164 | 
165 | # create a prevR object like a professional.
166 | pro <- 
167 | 	as.prevR(
168 | 		x , 
169 | 		c( id = "dhsclust" , x = "longnum" , y = "latnum" , n = "denom" , wn = "wdenom" , pos = "numer" , wpos = "wnumer" ) , 
170 | 		bounds
171 | 	)
172 | 
173 | # want to take a first glance at the sampling clusters
174 | # of the 2008 egypt demographic and health surveys?
175 | plot( pro )
176 | # woah.
177 | 
178 | # # end of step 3 # #
179 | # # # # # # # # # # #
180 | 
181 | 
182 | # # # # # # # # # # # # # # # # #
183 | # # step 4: make a simple map # #
184 | 
185 | # compute bandwidths
186 | pro <- rings( pro , N = 1000 )
187 | 
188 | # compute surfaces
189 | pro.map <- kde( pro , N = 1000 , nb.cells = 250 )
190 | 
191 | # plot a simple map comparing
192 | # weighted and unweighted surfaces
193 | spplot( pro.map )
194 | 
195 | # re-create a simple weighted surface,
196 | # but with a prevR palette
197 | spplot( 
198 | 	pro.map, 
199 | 	'k.wprev.N1000.RInf' , 
200 | 	cuts = 100 , 
201 | 	col.regions = prevR.colors.red( 101 ) , 
202 | 	main = "regional trends of infant mortality"
203 | )
204 | 
205 | # # end of step 4 # #
206 | # # # # # # # # # # #
207 | 
208 | 
209 | # # # # # # # # # # # #
210 | # # step 5: outline # #
211 | 
212 | library(rgeos)
213 | library(raster)
214 | library(rgdal)
215 | library(ggplot2)
216 | library(downloader)
217 | 
218 | # load the download_cached and related functions
219 | # to prevent re-downloading of files once they've been downloaded.
220 | source_url(
221 | 	"https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" ,
222 | 	prompt = FALSE ,
223 | 	echo = FALSE
224 | )
225 | 
226 | # # note on outline geography selection
227 | # # i am intentionally using sites that host
228 | # # data from many/most/every country worldwide
229 | # # so that this script can be easily extended
230 | # # to whatever country you're working on ;)
231 | 
232 | # # # map of the world # # #
233 | 
234 | # initiate a temporary file
235 | tf <- tempfile()
236 | 
237 | # use eurostat's map of the world
238 | world.fn <- "http://epp.eurostat.ec.europa.eu/cache/GISCO/geodatafiles/CNTR_2014_03M_SH.zip"
239 | 
240 | # store it to the local disk
241 | download_cached( world.fn , tf )
242 | 
243 | # unzip it
244 | world.uz <- unzip( tf , exdir = tempdir() )
245 | 
246 | # identify the shapefile
247 | world.sfn <- grep( 'CNTR_RG(.*)shp$' , world.uz , value = TRUE )
248 | 
249 | # read it in
250 | world.shp <- readOGR( world.sfn  , layer = gsub( "\\.shp" , "" , basename( world.sfn ) ) )
251 | world.shp <- spTransform( world.shp , CRS( "+proj=longlat" ) )
252 | 
253 | # here's the outline of every country in the world
254 | plot( world.shp , fill = 'gray' )
255 | 
256 | # # # map of the nile # # #
257 | 
258 | # use ucdavis's map of egypt's waterways
259 | ucd.fn <- "http://biogeo.ucdavis.edu/data/diva/wat/EGY_wat.zip"
260 | 
261 | # store it to the local disk
262 | download_cached( ucd.fn , tf )
263 | 
264 | # unzip it
265 | ucd.uz <- unzip( tf , exdir = tempdir() )
266 | 
267 | # this file contains lots of information.
268 | ucd.uz
269 | 
270 | # identify the waterways shapefile
271 | waterways.sfn <- grep( "water_areas(.*)shp$" , ucd.uz , value = TRUE )
272 | 
273 | # read it in
274 | waterways.shp <- readOGR( waterways.sfn  , layer = gsub( "\\.shp" , "" , basename( waterways.sfn ) ) )
275 | waterways.shp <- spTransform( waterways.shp , CRS( "+proj=longlat" ) )
276 | 
277 | # here's the outline of all water in egypt
278 | plot( waterways.shp )
279 | 
280 | # here's the outline of only the nile + lakes + suez canal
281 | plot( subset( waterways.shp , grepl( "LAKE|CANAL|NILE" , NAME ) ) )
282 | 
283 | # # # keep going into sudan # # #
284 | 
285 | # use ucdavis's map of sudan's waterways
286 | sucd.fn <- "http://biogeo.ucdavis.edu/data/diva/wat/SDN_wat.zip"
287 | 
288 | # store it to the local disk
289 | download_cached( sucd.fn , tf )
290 | 
291 | # unzip it
292 | sucd.uz <- unzip( tf , exdir = tempdir() )
293 | 
294 | # this file contains lots of information.
295 | sucd.uz
296 | 
297 | # identify the waterways shapefile
298 | sudanwater.sfn <- grep( "water_areas(.*)shp$" , sucd.uz , value = TRUE )
299 | 
300 | # read it in
301 | sudanwater.shp <- readOGR( sudanwater.sfn  , layer = gsub( "\\.shp" , "" , basename( sudanwater.sfn ) ) )
302 | sudanwater.shp <- spTransform( sudanwater.shp , CRS( "+proj=longlat" ) )
303 | 
304 | # here's the outline of all water in sudan
305 | plot( sudanwater.shp )
306 | 
307 | # here's the outline of the nile into sudan
308 | plot( subset( sudanwater.shp , grepl( "NILE|LAKE" , NAME ) ) )
309 | 
310 | 
311 | # # # map of egyptian states # # #
312 | 
313 | # use uc davis's administrative regions
314 | admin.fn <- "http://biogeo.ucdavis.edu/data/diva/adm/EGY_adm.zip"
315 | 
316 | # store it to the local disk
317 | download_cached( admin.fn , tf )
318 | 
319 | # unzip it
320 | admin.uz <- unzip( tf , exdir = tempdir() )
321 | 
322 | # this file contains a few different levels
323 | # of administrative borders.
324 | admin.uz
325 | 
326 | # identify the national and state border shapefiles
327 | nation.sfn <- grep( "adm0(.*)shp$" , admin.uz , value = TRUE )
328 | states.sfn <- grep( "adm1(.*)shp$" , admin.uz , value = TRUE )
329 | 
330 | # read in both
331 | nation.shp <- readOGR( nation.sfn  , layer = gsub( "\\.shp" , "" , basename( nation.sfn ) ) )
332 | states.shp <- readOGR( states.sfn  , layer = gsub( "\\.shp" , "" , basename( states.sfn ) ) )
333 | 
334 | nation.shp <- spTransform( nation.shp , CRS( "+proj=longlat" ) )
335 | states.shp <- spTransform( states.shp , CRS( "+proj=longlat" ) )
336 | 
337 | # # ready to stack all four maps?
338 | 
339 | # calculate the bounding box of egypt, 10% bigger than the country
340 | bb <- bbox( as( 1.05 * extent( states.shp ) , "SpatialPolygons" ) )
341 | 
342 | # initiate the lowest layer (the world)
343 | plot( world.shp , xlim = bb[ 1 , ] , ylim = bb[ 2 , ] , col = 'gray' , fill = TRUE , border = 'white' )
344 | 
345 | # turn gray off for egypt only, then add state boundaries in gray
346 | plot( states.shp , add = TRUE , col = 'white' , fill = TRUE , border = 'gray'  )
347 | 
348 | # add the nile
349 | plot( subset( waterways.shp , grepl( "NILE" , NAME ) ) , add = TRUE , col = 'lightblue' , border = 'lightblue' )
350 | 
351 | # you can add the national border in black if you want
352 | plot( nation.shp , add = TRUE , border = 'black' )
353 | # but i don't think it's necessary
354 | 
355 | # # not bad for a start, huh?
356 | 
357 | # draw a rectangle 100% bigger than the original state
358 | eg.shp.blank <- as( 1.5 * extent( states.shp ) , "SpatialPolygons" )
359 | 
360 | # draw a rectangle 0% bigger than the original state
361 | eg.box <- as( extent( states.shp ) , "SpatialPolygons" )
362 | 
363 | 
364 | # # worldwide coastlines # #
365 | 
366 | coast.fn <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/physical/ne_10m_ocean.zip"
367 | 
368 | # store it to the local disk
369 | download_cached( coast.fn , tf )
370 | 
371 | # unzip it
372 | coast.uz <- unzip( tf , exdir = tempdir() )
373 | 
374 | # this file contains every coastline everywhere
375 | coast.uz
376 | 
377 | # identify the worldwide ocean
378 | coast.sfn <- grep( "ocean(.*)shp$" , coast.uz , value = TRUE )
379 | 
380 | # read in the ocean
381 | coast.shp <- readOGR( coast.sfn  , layer = gsub( "\\.shp" , "" , basename( coast.sfn ) ) )
382 | 
383 | # put the ocean in longlat format
384 | coast.shp <- spTransform( coast.shp , CRS( "+proj=longlat" ) )
385 | 
386 | # limit the egyptian coast shapefile
387 | # to only points within the larger bounding box
388 | eg.coast <- gIntersection( eg.shp.blank , coast.shp )
389 | # so you're not carrying around the whole world's ocean
390 | 
391 | # one more note: `XK` is a disputed region.  see the difference?
392 | plot( subset( world.shp , CNTR_ID %in% 'EG' ) )
393 | plot( subset( world.shp , ( CNTR_ID %in% c( 'EG' , 'XK' ) ) ) )
394 | 
395 | # these are huge objects, so in order to conserve ram
396 | # fortify what you need for ggplot2 and toss all other objects.
397 | fcoast <- fortify( eg.coast ) ; rm( eg.coast )
398 | # keep everything but the disputed region plus egypt proper
399 | wshape <- fortify( subset( world.shp , !( CNTR_ID %in% c( 'EG' , 'XK' ) ) ) ) ; rm( world.shp )
400 | fnation <- fortify( nation.shp )
401 | fstate <- fortify( states.shp ) ; rm( states.shp )
402 | fnile <- fortify( subset( waterways.shp , grepl( "NILE|LAKE|CANAL" , NAME ) ) ) ; rm( waterways.shp )
403 | snile <- fortify( subset( sudanwater.shp , grepl( "NILE|LAKE" , NAME ) ) ) ; rm( sudanwater.shp )
404 | # got 'em all, i think.  clear up RAM
405 | gc()
406 | 
407 | # # end of step 5 # #
408 | # # # # # # # # # # #
409 | 
410 | 
411 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
412 | # # step 6: re-create the prevR object on a rectangle # #
413 | 
414 | library(prevR)
415 | 
416 | # re-create a prevR object like a professional.
417 | bbpro <- 
418 | 	as.prevR(
419 | 		x , 
420 | 		c( id = "dhsclust" , x = "longnum" , y = "latnum" , n = "denom" , wn = "wdenom" , pos = "numer" , wpos = "wnumer" ) , 
421 | 		as( eg.box , "SpatialPolygons" )
422 | 	)
423 | # note the use of `nation.shp`
424 | # which has slightly cleaner borders
425 | # than the borders obtained from
426 | # the create.boundary function
427 | 
428 | your.N <- 1000
429 | 
430 | # re-compute bandwidths
431 | bbpro <- rings( bbpro , N = your.N )
432 | 
433 | # re-compute surfaces
434 | bbpro.map <- kde( bbpro , N = your.N , nb.cells = 250 )
435 | 
436 | # coerce this result to a data.frame object
437 | map.df <- na.omit( as.data.frame( bbpro.map ) )
438 | 
439 | # name your variable something less mathy
440 | map.df$im <- map.df[ , paste0( "k.wprev.N" , your.N , ".RInf" ) ]
441 | 
442 | # sort and move on.
443 | map.df <- map.df[ order( map.df$x , map.df$y ) , ]
444 | 
445 | # # end of step 6 # #
446 | # # # # # # # # # # #
447 | 
448 | 
449 | # # # # # # # # # # # # # # # # # # # # # # #
450 | # # step 7: decide on your map parameters # #
451 | 
452 | library(ggplot2)
453 | library(scales)
454 | library(mapproj)
455 | 
456 | 
457 | # initiate the simple map
458 | first.map <- 
459 | 	qplot( 
460 | 		x , 
461 | 		y , 
462 | 		data = map.df , 
463 | 		colour = im ,
464 | 		xlab = NULL ,
465 | 		ylab = NULL
466 | 	)
467 | 
468 | # look at that.. not bad not bad
469 | first.map
470 | 
471 | # give this map a geom_tile layer
472 | eg.map <-
473 | 	ggplot(
474 | 		map.df , 
475 | 		aes( x = x , y = y )
476 | 	) + geom_tile( aes( fill = im ) )
477 | 
478 | # set the bounding box limits that
479 | # you and i agreed to earlier in the script
480 | # oh, also, remove all map crap
481 | eg.map <- 
482 | 	eg.map + 
483 | 
484 | 	xlab( "" ) + ylab( "" ) +
485 | 
486 | 	scale_x_continuous( limits = bb[ 1 , ] , breaks = NULL , oob = squish ) +
487 | 
488 |     scale_y_continuous( limits = bb[ 2 , ] , breaks = NULL , oob = squish ) +
489 | 
490 |     theme(
491 | 		legend.position = "none" ,
492 | 		panel.grid.major = element_blank(),
493 | 		panel.grid.minor = element_blank(),
494 | 		panel.background = element_blank(),
495 | 		panel.border = element_blank(),
496 | 		axis.ticks = element_blank()
497 | 	)
498 | 
499 | # cleaner still.
500 | eg.map
501 | 
502 | # # end of step 7 # #
503 | # # # # # # # # # # #
504 | 
505 | 
506 | # # # # # # # # # # #
507 | # # step 8: color # #
508 | 
509 | library(ggplot2)
510 | 
511 | eg.map + scale_fill_gradient( low = 'green' , high = 'red' )
512 | 
513 | eg.map + scale_fill_gradient( low = 'white' , high = 'blue' )
514 | 
515 | eg.map + scale_fill_gradient( low = muted( 'blue' ) , high = muted( 'red' ) )
516 | 
517 | # and some prevR-custom colors.
518 | eg.map + scale_fill_gradientn( colours = prevR.colors.gray( 20 ) )
519 | 
520 | eg.map + scale_fill_gradientn( colours = prevR.colors.blue( 20 ) )
521 | 
522 | eg.map + scale_fill_gradientn( colours = prevR.colors.red( 20 ) )
523 | 
524 | # let's save that default
525 | eg.map <- eg.map + scale_fill_gradientn( colours = prevR.colors.red( 20 ) )
526 | 
527 | # # end of step 8 # #
528 | # # # # # # # # # # #
529 | 
530 | 
531 | # # # # # # # # # # # # # # # # # # # #
532 | # # step 9: draw borders and waters # #
533 | 
534 | library(plyr)
535 | library(ggplot2)
536 | 
537 | # store this information in a layer
538 | national.border.layer <- geom_path( data = fnation , aes( x = long , y = lat , group = group ) , colour = 'lightgray' )
539 | 
540 | # plot the result if you like
541 | eg.map + national.border.layer
542 | 
543 | # store this information in a layer
544 | state.border.layer <- geom_path( data = fstate , aes( x = long , y = lat , group = group ) , colour = 'lightgray' )
545 | 
546 | # plot the result if you like
547 | eg.map + state.border.layer 
548 | 
549 | # # international borders # #
550 | 
551 | # store this information in a layer
552 | international.border.layer <- geom_polygon( data = wshape , aes( x = long , y = lat , group = group ) , fill = 'lightgray' , color = 'white' )
553 | 
554 | # plot the result if you like
555 | eg.map + international.border.layer
556 | 
557 | # # coastal areas to blank # #
558 | 
559 | # fix islands piecing together
560 | fcoast2 <- ddply( fcoast , .( piece ) , function( x ) rbind( x , fcoast[ 1 , ] ) )
561 | 
562 | # convert this fortified object to a ggplot layer
563 | ocean.layer <- geom_polygon( data = fcoast2 , aes( x = long , y = lat , group = id ) , fill = 'lightblue' )
564 | 
565 | eg.map + ocean.layer
566 | 
567 | # construct the nile layer
568 | nile.layer <- 
569 | 	geom_polygon( 
570 | 		data = fnile , 
571 | 		aes( x = long , y = lat , group = group ) , 
572 | 		color = 'lightblue' , fill = 'lightblue'
573 | 	)
574 | 	
575 | # everything you do with the egyptian nile,
576 | # also do with the sudanese nile..
577 | # to keep the river flowing off of the map
578 | snile.layer <-
579 | 	geom_polygon( 
580 | 		data = snile , 
581 | 		aes( x = long , y = lat , group = group ) , 
582 | 		color = 'lightblue' , fill = 'lightblue'
583 | 	)
584 | 
585 | 
586 | # closer, eh?
587 | eg.map + nile.layer + snile.layer
588 | 
589 | 
590 | # # external rectangle to blank # #
591 | 
592 | # initiate an external rectangle at the edges of the bounding box
593 | # to blank out the furthest extent of the grid
594 | orect <- 
595 | 	geom_rect( 
596 | 		xmin = bb[ 1 , 1 ] , 
597 | 		xmax = bb[ 1 , 2 ] , 
598 | 		ymin = bb[ 2 , 1 ] , 
599 | 		ymax = bb[ 2 , 2 ] , 
600 | 		color = 'white' , 
601 | 		fill = NA , 
602 | 		size = 4 
603 | 	)
604 | 
605 | eg.map + orect
606 | 
607 | # # end of step 9 # #
608 | # # # # # # # # # # #
609 | 
610 | 
611 | # # # # # # # # # # # # # # # # #
612 | # # step 10: project and save # #
613 | 
614 | library(ggplot2)
615 | library(mapproj)
616 | 
617 | # save each of the layers,
618 | # in order from bottom to top
619 | final.map <-
620 | 	eg.map +
621 | 	international.border.layer +
622 | 	nile.layer + snile.layer +
623 | 	ocean.layer +
624 | 	orect
625 | 
626 | # here's the final plot
627 | final.map
628 | 
629 | # save the file to your current working directory
630 | ggsave( 
631 | 	"2004-2008 infant mortality rate - unprojected.png" ,
632 | 	plot = final.map
633 | )
634 | # but that's unprojected.  you might prefer a projected map.
635 | 
636 | # # # pick your projection # # #
637 | 
638 | # here are lots of choices.  choose wisely.
639 | # final.map + coord_map( project = "albers" , lat0 = bb[ 2 , 1 ] , lat1 = bb[ 2 , 2 ] )
640 | # final.map + coord_fixed()
641 | # final.map + coord_cartesian()
642 | # final.map + coord_map( "gilbert" )
643 | # final.map + coord_map( "lagrange" )
644 | # final.map + coord_map( "stereographic" )
645 | 
646 | 
647 | # project this pup.  i prefer albers for egypt.
648 | projected.map <- final.map + coord_map( project = "albers" , lat0 = bb[ 2 , 1 ] , lat1 = bb[ 2 , 2 ] )
649 | 
650 | # would you like to save this game?
651 | 
652 | # # # fair warning # # #
653 | # the projected map takes hours to render.
654 | # choose carefully from the shapes above,
655 | # then leave this save command running overnight.
656 | 
657 | # save the projected plot, which takes longer doesn't it.
658 | ggsave( 
659 | 	"2004-2008 infant mortality rate - projected.png" ,
660 | 	plot = projected.map
661 | )
662 | 
663 | # # end of step ten # #
664 | # # # # # # # # # # # #
665 | 


--------------------------------------------------------------------------------
/how to map the new york city housing and vacancy survey.R:
--------------------------------------------------------------------------------
   1 | # # # # # # # # # # # # # # # # #
   2 | # # set the working directory # #
   3 | # # # # # # # # # # # # # # # # #
   4 | 
   5 | # setwd( "C:/My Directory/SWMAP/" )
   6 | 
   7 | 
   8 | # # # # # # # # # # # # # # # #
   9 | # # example survey data set # #
  10 | # # # # # # # # # # # # # # # #
  11 | 
  12 | # new york city housing and vacancy survey
  13 | 
  14 | 
  15 | # # # # # # # # # # # # # # # # # # # # #
  16 | # # different from other maps because # #
  17 | # # # # # # # # # # # # # # # # # # # # #
  18 | 
  19 | # incorporates shapefiles with more than one starting projection
  20 | # binds multiple shapefiles (not layers) to blank out water areas
  21 | 
  22 | 
  23 | # # # # # # # # # # # # # # # # # #
  24 | # # smallest level of geography # #
  25 | # # # # # # # # # # # # # # # # # #
  26 | 
  27 | # city boro and subboro areas
  28 | # (subboro areas are neighborhoods that map to collections of census tracts)
  29 | 
  30 | # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  31 | # # asdfree.com blog post for this survey microdata # #
  32 | # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  33 | 
  34 | # http://www.asdfree.com/search/label/new%20york%20city%20housing%20and%20vacancy%20survey%20%28nychvs%29
  35 | 
  36 | 
  37 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  38 | # # r code repository for setup and analysis examples # #
  39 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  40 | 
  41 | # https://github.com/ajdamico/asdfree/tree/master/New%20York%20City%20Housing%20and%20Vacancy%20Survey
  42 | 
  43 | 
  44 | # # # # # # # # # # # # #
  45 | # # value of interest # #
  46 | # # # # # # # # # # # # #
  47 | 
  48 | # ratio of persons per room
  49 | 
  50 | 
  51 | # # # # # # #
  52 | # # flaws # #
  53 | # # # # # # #
  54 | 
  55 | # the standard errors calculated from this data set are incorrect, so
  56 | # the weights used here are even less scientific than in the other maps.
  57 | 
  58 | 
  59 | # # # # # # # # # # # # # # # # # # # # #
  60 | # # step 1: load the survey microdata # #
  61 | 
  62 | library(downloader)
  63 | 
  64 | # download the 2011 new york city housing and vacancy survey microdata onto the local disk
  65 | years.to.download <- 2011
  66 | source_url( "https://raw.github.com/ajdamico/asdfree/master/New%20York%20City%20Housing%20and%20Vacancy%20Survey/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE )
  67 | 
  68 | # # end of step 1 # #
  69 | # # # # # # # # # # #
  70 | 
  71 | 
  72 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
  73 | # # step 2: conduct your analysis of interest at the smallest geography allowed # #
  74 | 
  75 | library(survey)
  76 | 
  77 | # following the analysis examples in the r code repository --
  78 | # https://github.com/ajdamico/asdfree/blob/master/New%20York%20City%20Housing%20and%20Vacancy%20Survey/2011%20analysis%20examples.R
  79 | # -- calculate the persons per room rate at the smallest available geographic area
  80 | 
  81 | 
  82 | # note the large cautionary text that the standard errors for this survey are garbage #
  83 | 
  84 | 
  85 | # set R to produce conservative standard errors instead of crashing
  86 | # http://r-survey.r-forge.r-project.org/survey/exmample-lonely.html
  87 | options( survey.lonely.psu = "adjust" )
  88 | # this setting matches the MISSUNIT option in SUDAAN
  89 | 
  90 | # load the 2011 data files
  91 | load( 'nychvs11.rda' )
  92 | 
  93 | # create a fake survey design object with just the `hhweight`
  94 | # (household weight) variable
  95 | occ.d <- svydesign( ~1 , data = occ , weights = ~hhweight )
  96 | # this svydesign() call is incorrect because nychvs
  97 | # does not release its sampling clusters
  98 | 
  99 | # this script is useful as an example of how to map city-wide survey data
 100 | # but this actual new york city microdata has incorrect standard errors
 101 | 
 102 | 
 103 | # the persons per room variable has two decimals
 104 | occ.d <- update( occ.d , pproom = as.numeric( crowd100 ) / 100 )
 105 | 
 106 | # calculate the 2011 persons per room rate
 107 | svymean( ~ pproom , occ.d )
 108 | 
 109 | # # examine which geographies are available # #
 110 | 
 111 | # the new york city housing and vacancy survey identifies records
 112 | # from subboro (neighborhood) areas within all five boros
 113 | svytable( ~ borough + subboro , occ.d )
 114 | 
 115 | # simply use both of those geographies in the by= argument
 116 | # of the `svyby` command, and re-calculate the poverty rates
 117 | smallest.area.statistics <-
 118 | 	svyby( 
 119 | 		~ pproom , 
 120 | 		~ borough + subboro , 
 121 | 		occ.d , 
 122 | 		svymean 
 123 | 	)
 124 | # this is the same command as the city-wide calculation above,
 125 | # except these results have been broken into smaller areas.	
 126 | 
 127 | # these are the statistics to be mapped
 128 | print( smallest.area.statistics )
 129 | # the standard errors are a measure of precision,
 130 | # their inverse will serve as the mapping weights
 131 | 
 132 | # again!  don't forget that
 133 | # the new york city housing and vacancy survey's
 134 | # standard errors are not computable.
 135 | 
 136 | # make this object easier to type
 137 | sas <- smallest.area.statistics
 138 | 
 139 | # # end of step 2 # #
 140 | # # # # # # # # # # #
 141 | 
 142 | 
 143 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 144 | # # step 3: download and import necessary geographic crosswalks # #
 145 | 
 146 | library(downloader)
 147 | 
 148 | # load the download_cached and related functions
 149 | # to prevent re-downloading of files once they've been downloaded.
 150 | source_url(
 151 | 	"https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" ,
 152 | 	prompt = FALSE ,
 153 | 	echo = FALSE
 154 | )
 155 | 
 156 | 
 157 | # create a temporary file containing the census bureau's
 158 | # 2010 census summary file #1 for new york state
 159 | # then download the file.
 160 | sf1ny.tf <- tempfile()
 161 | 
 162 | 
 163 | download_cached( 
 164 | 	"ftp://ftp2.census.gov/census_2010/04-Summary_File_1/New_York/ny2010.sf1.zip" ,
 165 | 	sf1ny.tf ,
 166 | 	mode = 'wb'
 167 | )
 168 | # note: to re-download a file from scratch, add the parameter usecache = FALSE
 169 | 
 170 | 
 171 | # unzip the summary file #1 files
 172 | sf1ny.uz <- unzip( sf1ny.tf , exdir = tempdir() )
 173 | 
 174 | 
 175 | # file layout from http://www.census.gov/prod/cen2010/doc/sf1.pdf#page=18
 176 | sf1ny <- 
 177 | 	read.fwf( 
 178 | 		sf1ny.uz[ grep( "nygeo2010" , sf1ny.uz ) ] , 
 179 | 		c( -8 , 3 , -16 , 2 , 3 , -4 , 5 , -4 , 5 , -4 , 6 , 1 , 4 , -106 , 5 , -142 , 9 , -9 , 11 , 12 ) 
 180 | 	)
 181 | 
 182 | # add columns names matching the census bureau, so it's easy to read
 183 | names( sf1ny ) <- 
 184 | 	c( "sumlev" , "state" , "county" , "cousub" , "place" , "tract" , "blkgrp" , "block" , "zcta5" , "pop100" , "intptlat" , "intptlon" )
 185 | 
 186 | # summary level 101 has census tracts and census blocks
 187 | sf1ny.101 <- subset( sf1ny , sumlev == "101" )
 188 | 
 189 | # one record per census block in new york state.  see?  same number.
 190 | nrow( sf1ny.101 )
 191 | # https://www.census.gov/geo/maps-data/data/tallies/census_block_tally.html
 192 | 
 193 | # and guess what?  the total new york population matches as well.
 194 | sum( sf1ny.101$pop100 )
 195 | # http://quickfacts.census.gov/qfd/states/36000.html
 196 | 
 197 | 
 198 | # separately, read in the crosswalk between new york census tracts and nychvs subboro areas #
 199 | # http://www.census.gov/housing/nychvs/data/2011/11subcom1.pdf
 200 | # http://www.census.gov/housing/nychvs/data/2011/11subcom2.pdf
 201 | nycsb.tf <- tempfile()
 202 | 
 203 | download( "https://raw.githubusercontent.com/ajdamico/asdfree/master/New%20York%20City%20Housing%20and%20Vacancy%20Survey/boro%20and%20subboro%20to%20census%20tract%20crosswalk.csv" , nycsb.tf )
 204 | 
 205 | nychvs.subboro <- read.csv( nycsb.tf )
 206 | 
 207 | # also define nyc counties
 208 | nyc.counties <- c( '005' , '047' , '061' , '081' , '085' ) 
 209 | 
 210 | # ahh, we also need a county fips code to new york city borough match.
 211 | boro.to.county.fips <-
 212 | 	data.frame( 
 213 | 		boroname = c( 'Bronx' , 'Brooklyn' , 'Manhattan' , 'Queens' , 'Staten Island' ) , 
 214 | 		county = as.numeric( nyc.counties ) 
 215 | 	)
 216 | 
 217 | # merge on the borough county fips codes	
 218 | bo <- merge( nychvs.subboro , boro.to.county.fips )
 219 | 
 220 | # confirm no record loss from the previous merge
 221 | stopifnot( nrow( bo ) == nrow( nychvs.subboro ) )
 222 | 
 223 | # rename the `bo` data.frame's census tract column to match `sf1ny.101`
 224 | names( bo )[ names( bo ) == 'ctract' ] <- 'tract'
 225 | 
 226 | # rename the `bo` data.frame's boro column to match `sas`
 227 | names( bo )[ names( bo ) == 'boro' ] <- 'borough'
 228 | 
 229 | # merge this with the new york state summary file #1..
 230 | sf1.bo <- merge( sf1ny.101 , bo )
 231 | 
 232 | # ..and guess what?  now we have a perfect match with new york city's 2010 population.
 233 | sum( sf1.bo$pop100 )
 234 | # http://quickfacts.census.gov/qfd/states/36/3651000.html
 235 | 
 236 | # so now we have a data.frame object with
 237 | # one record per census block,
 238 | # and also with the two geography-levels
 239 | # that match the new york city housing and vacancy survey
 240 | head( sf1.bo )
 241 | 
 242 | # and guess what?
 243 | # we've now got the census 2010 weighted populations (field pop100)
 244 | # and also each census block's centroid latitude & longitude (fields intptlat + intptlon)
 245 | 
 246 | # # end of step 3 # #
 247 | # # # # # # # # # # #
 248 | 
 249 | 
 250 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 251 | # # step 4: merge the results of your survey analysis with the small-area geography # #
 252 | 
 253 | # confirm that we've created all possible geographies correctly.
 254 | 
 255 | # the number of records in our small area statistics..
 256 | sas.row <- nrow( sas )
 257 | 
 258 | # ..should equal the number of unique-match-merged records..
 259 | mrow <- nrow( merge( unique( sf1.bo[ , c( 'borough' , 'subboro' ) ] ) , sas ) )
 260 | 
 261 | # ..and it does/they do.
 262 | stopifnot( sas.row == mrow )
 263 | 
 264 | # now the census block-level new york city census data *could* merge if you wanted it to.
 265 | 
 266 | 
 267 | # but you don't.  yet.
 268 | 
 269 | 
 270 | # the standard error (the `se` field) is a measure of precision.
 271 | print( sas )
 272 | # the smaller the standard error, the more confident you should be
 273 | # that the estimate at a particular geography is correct.
 274 | 
 275 | 
 276 | # so invert it.  you heard me.  invert it.
 277 | sas$invse <- 1 / sas$se
 278 | # a smaller standard error indicates more precision.
 279 | 
 280 | # for our purposes, precision can be considered weight! #
 281 | 
 282 | # now we've got the weight that we should give each of our estimates #
 283 | 
 284 | # distribute that weight across all census blocks #
 285 | 
 286 | # aggregate the 2010 census block populations to the geographies that you have.
 287 | popsum <- aggregate( sf1.bo$pop100 , by = ( sf1.bo[ , c( 'borough' , 'subboro' ) ] ) , sum )
 288 | 
 289 | # make the column name meaningful
 290 | names( popsum )[ names( popsum ) == 'x' ] <- 'popsum'
 291 | 
 292 | # merge the popsum onto the sasfile
 293 | sas <- merge( sas , popsum )
 294 | 
 295 | # now.  merge
 296 | 	# the persons per room rate (the variable of interest)
 297 | 	# the inverted standard error (the total weight of the broad geography)
 298 | 	# the population sum (the total population of all census blocks that are part of that geography)
 299 | 
 300 | x <- merge( sf1.bo , sas )
 301 | 
 302 | # confirm no record loss
 303 | stopifnot( nrow( x ) == nrow( sf1.bo ) )
 304 | 
 305 | 
 306 | # (this is the fun part)
 307 | # calculate the weight of each census block
 308 | x$weight <- x$invse * ( x$pop100 / x$popsum )
 309 | 
 310 | # note that weight of all census blocks put together
 311 | # sums to the `invse` on the original analysis file
 312 | stopifnot( all.equal( sum( x$weight ) , sum( sas$invse ) ) )
 313 | 
 314 | # remove records with zero population
 315 | x <- subset( x , weight > 0 )
 316 | 
 317 | # scale all weights so that they average to one
 318 | x$weight <- x$weight / mean( x$weight )
 319 | 
 320 | # you're done preparing your data.
 321 | # keep only the columns you need.
 322 | x <- x[ , c( 'pproom' , 'weight' , 'intptlat' , 'intptlon' ) ]
 323 | 
 324 | # # end of step 4 # #
 325 | # # # # # # # # # # #
 326 | 
 327 | 
 328 | # # # # # # # # # # # # # # # # # # # # # # #
 329 | # # step 5: decide on your map parameters # #
 330 | 
 331 | library(maptools)
 332 | library(ggplot2)
 333 | library(rgdal)
 334 | library(RColorBrewer)
 335 | library(scales)
 336 | 
 337 | # before you ever touch surface smoothing or kriging,
 338 | # make some decisions about how you generally want
 339 | # your map to look:  the projection and coloring
 340 | 
 341 | # the options below simply use hadley wickham's ggplot2
 342 | # with the census block-level poverty rates and centroids
 343 | 
 344 | 
 345 | # initiate a simple map
 346 | nyc.map <- 
 347 | 	ggplot( data = x , aes( x = intptlon , y = intptlat ) ) +
 348 | 	geom_point( data = x , aes( colour = pproom ) )
 349 | 
 350 | # remove all map crap.
 351 | nyc.map <- 
 352 | 	nyc.map + 
 353 | 	
 354 | 	xlab( "" ) + ylab( "" ) +
 355 | 
 356 | 	# force the x and y axis limits at the shape of the city and don't do anything special for off-map values
 357 | 	scale_x_continuous( limits = c( min( x$intptlon ) , max( x$intptlon ) ) , breaks = NULL , oob = squish ) +
 358 | 	# since we're going to add lots of surrounding-area detail!
 359 |     scale_y_continuous( limits = c( min( x$intptlat ) , max( x$intptlat ) ) , breaks = NULL , oob = squish ) +
 360 | 
 361 |     theme(
 362 | 		legend.position = "none" ,
 363 | 		panel.grid.major = element_blank(),
 364 | 		panel.grid.minor = element_blank(),
 365 | 		panel.background = element_blank(),
 366 | 		panel.border = element_blank(),
 367 | 		axis.ticks = element_blank()
 368 | 	)
 369 | 
 370 | # print the map without any projection
 371 | nyc.map
 372 | 
 373 | 
 374 | # check out some purty colors.
 375 | 
 376 | # from http://colorbrewer2.org/
 377 | 
 378 | # three sequential color schemes
 379 | YlOrBr.3.p <- colorRampPalette( brewer.pal( 3 , "YlOrBr" ) )
 380 | YlOrBr.9.p <- colorRampPalette( brewer.pal( 9 , "YlOrBr" ) )
 381 | Purples.9.p <- colorRampPalette( brewer.pal( 9 , "Purples" ) )
 382 | 
 383 | # one diverging color schemes
 384 | RdGy.11.p <- colorRampPalette( rev( brewer.pal( 11 , "RdGy" ) ) )
 385 | 
 386 | # here's what the map looks like in black n white.
 387 | nyc.map + scale_colour_gradient( low = 'white' , high = 'black' )
 388 | 
 389 | # or one of the colorbrewer schemes
 390 | nyc.map + scale_colour_gradientn( colours = YlOrBr.3.p( 100 ) )
 391 | nyc.map + scale_colour_gradientn( colours = YlOrBr.9.p( 100 ) )
 392 | nyc.map + scale_colour_gradientn( colours = RdGy.11.p( 100 ) )
 393 | nyc.map + scale_colour_gradientn( colours = Purples.9.p( 100 ) )
 394 | 
 395 | # download and read-in the new york city parks shapefile
 396 | tf <- tempfile()
 397 | 
 398 | download_cached( "http://www.nyc.gov/html/dpr/nycbigapps/DPR_Parks_001.zip" , tf )
 399 | 
 400 | z <- unzip( tf , exdir = tempdir() )
 401 | 
 402 | sfname <- z[ grep( 'shp$' , z ) ]
 403 | 
 404 | # new york city's shapefiles are not in longlat projection format, so
 405 | # use `readOGR` instead of `readShapePoly` here to capture the map projection
 406 | parks.shp <- readOGR( sfname  , layer = gsub( "\\.shp" , "" , basename( sfname ) ) )
 407 | 
 408 | # convert the shapefile to longlat (which matches the us census bureau's summary file #1)
 409 | parks.shp <- spTransform( parks.shp , CRS( "+proj=longlat" ) )
 410 | # now `parks.shp` will overlay properly with us census bureau files.
 411 | 
 412 | # prepare the parks shapefile for ggplot2
 413 | parks <- fortify( parks.shp )
 414 | 
 415 | # the `parks.shp` has a weird circle in the upper-left.
 416 | parks <- subset( parks , long > -75 )
 417 | # hack it off by removing all points to the west of -75 longitude
 418 | 
 419 | # create a dark green layer of parks in new york city
 420 | park.layer <- 
 421 | 	geom_polygon( 
 422 | 		data = parks , 
 423 | 		aes( x = long , y = lat , group = group ) , 
 424 | 		fill = '#abdda4' 
 425 | 	)
 426 | 
 427 | # that's what the map looks like so far..
 428 | nyc.map + park.layer
 429 | 
 430 | # ..save it if you like.
 431 | nyc.map <- nyc.map + park.layer
 432 | 
 433 | # oh sorry, if you'd like black & white again, add this segment.
 434 | nyc.map + scale_colour_gradient( low = 'white' , high = 'black' )
 435 | 
 436 | # download and read-in the new york city clipped-to-shoreline borough map
 437 | download_cached( "http://www.nyc.gov/html/dcp/download/bytes/nybb_14c.zip" , tf )
 438 | # from http://www.nyc.gov/html/dcp/html/bytes/districts_download_metadata.shtml
 439 | 
 440 | z <- unzip( tf , exdir = tempdir() )
 441 | 
 442 | sfname <- z[ grep( 'shp$' , z ) ]
 443 | 
 444 | # once again, new york city's shapefiles are not in longlat projection format, so
 445 | # use `readOGR` instead of `readShapePoly` here to capture the map projection
 446 | boro.shp <- readOGR( sfname  , layer = gsub( "\\.shp" , "" , basename( sfname ) ) )
 447 | 
 448 | # convert the shapefile to longlat (which matches the us census bureau's summary file #1)
 449 | boro.shp <- spTransform( boro.shp , CRS( "+proj=longlat" ) )
 450 | # now `boro.shp` will overlay properly with us census bureau files.
 451 | 
 452 | # prepare the boro shapefile for ggplot2
 453 | boro <- fortify( boro.shp )
 454 | 
 455 | # create a borough-border around the city
 456 | boro.layer <- 
 457 | 	geom_path( 
 458 | 		data = boro , 
 459 | 		aes( x = long , y = lat , group = group ) 
 460 | 	)
 461 | 
 462 | # and would you look at that?
 463 | nyc.map + boro.layer
 464 | # some of the city parks are in the ocean.  we'll have to snip off those later.
 465 | 
 466 | # let's save all of those attributes, including the coloring
 467 | nyc.map <- nyc.map + boro.layer + scale_colour_gradient( low = 'white' , high = 'black' )
 468 | 
 469 | # and finally, try some projections!
 470 | 
 471 | # which of these do you prefer?
 472 | nyc.map + coord_map( "newyorker" , r = 0 )
 473 | nyc.map + coord_map( "newyorker" , r = 10 )
 474 | nyc.map + coord_map( "newyorker" , r = 20 )
 475 | nyc.map + coord_map( "newyorker" , r = 30 )
 476 | nyc.map + coord_map( "newyorker" , r = 40 )
 477 | 
 478 | # meh, i like the map better without any.
 479 | nyc.map
 480 | 
 481 | # # end of step 5 # #
 482 | # # # # # # # # # # #
 483 | 
 484 | 
 485 | # # # # # # # # # # # #
 486 | # # step 6: outline # #
 487 | 
 488 | library(raster)
 489 | library(rgdal)
 490 | library(RCurl)
 491 | 
 492 | # new york city has lots of water areas within the city limits
 493 | # instead of the regular census bureau tiger shapefiles,
 494 | # the city government provides the cleanest clipped shapefile.
 495 | 
 496 | # draw a rectangle 5% bigger than the city limits
 497 | boro.shp.out <- as( 1.1 * extent( boro.shp ), "SpatialPolygons" )
 498 | 
 499 | # get the directory listing of the folder with all water layers from census
 500 | water.ftp <- 'ftp://ftp2.census.gov/geo/tiger/TIGER2013/AREAWATER/'
 501 | 
 502 | water.files <- 
 503 | 	paste0(
 504 | 		water.ftp , 
 505 | 		strsplit( 
 506 | 		getURL( 
 507 | 			water.ftp , 
 508 | 			ftp.use.epsv = FALSE , 
 509 | 			ftplistonly = TRUE
 510 | 		) , 
 511 | 		'\\s+' )[[ 1 ]]
 512 | 	)
 513 | 
 514 | # limit these files to new york, new jersey, and connecticut
 515 | # you can get more counties than you need, the program will just run a bit slower.
 516 | nynjct.water <- 
 517 | 	water.files[ 
 518 | 		# bergen, hudson, essex, passaic, morris, somerset, union, middlesex, monmouth in new jersey, or
 519 | 		substr( water.files , 61 , 65 ) %in% c( '34003' , '34017' , '34013' , '34031' , '34027' , '34035' , '34039' , '34023' , '34025' ) |
 520 | 		
 521 | 		# rockland, westchester, nassau, suffolk, orange, putnam in new york state
 522 | 		substr( water.files , 61 , 65 ) %in% c( '36087' , '36119' , '36059' , '36103' , '36071' , '36079' ) |
 523 | 		
 524 | 		# fairfield in connecticut
 525 | 		substr( water.files , 61 , 65 ) %in% c( '09001' )
 526 | 	]
 527 | 
 528 | # download and extract to a temporary directory
 529 | invisible( sapply( nynjct.water , function( x ) {
 530 | 	path <- file.path( tempdir() , basename( x ) )
 531 | 	download_cached( x , destfile = path , mode = 'wb' )
 532 | 	unzip( path , exdir = file.path( tempdir() , 'watershps' ) )
 533 | } ) )
 534 | 
 535 | # read in all shps, and prepend shapefile name to identifiers
 536 | shps <- lapply( sub( '\\.zip' , '' , basename( nynjct.water ) ) , function( x ) {
 537 | 	shp <- readOGR( file.path( tempdir() , 'watershps' ) , x )
 538 | 	shp <- spChFIDs( shp , paste0( x , '_' , sapply( slot( shp , "polygons" ) , slot , "ID" ) ) )
 539 | 	shp
 540 | })
 541 | # this step above removes potential duplicate identifiers
 542 | 
 543 | # rbind to a single object
 544 | water.shp <- do.call( rbind , as.list( shps ) )
 545 | 
 546 | # want to see all of the little waters of the region that can now be blanked out?
 547 | plot( water.shp )
 548 | # these water layers do not come with the city's shapefile, so construct them independently.
 549 | 
 550 | # prepare the surrounding region shapefile for ggplot2
 551 | water <- fortify( water.shp )
 552 | 
 553 | 
 554 | lightblue.water.layer <-
 555 | 	geom_polygon( 
 556 | 		data = water , 
 557 | 		aes( x = long , y = lat , group = group ) , 
 558 | 		fill = 'lightblue' 
 559 | 	)
 560 | 
 561 | # here's the surrounding area water in light blue so you can see it..
 562 | nyc.map + lightblue.water.layer
 563 | 
 564 | # ..but we actually want to blank this on the final map,
 565 | # so make it white on the layer that gets saved
 566 | water.layer <- 
 567 | 	geom_polygon( 
 568 | 		data = water , 
 569 | 		aes( x = long , y = lat , group = group ) , 
 570 | 		fill = 'white' 
 571 | 	)
 572 | 
 573 | 	
 574 | # outline the surrounding landmasses as well
 575 | ccbf.tf <- tempfile()
 576 | 
 577 | # new york city borders the ocean,
 578 | # so use the census bureau's cartographic boundary files
 579 | # instead of the regular tiger shapefiles
 580 | 
 581 | download_cached( 
 582 | 	"http://www2.census.gov/geo/tiger/GENZ2013/cb_2013_us_county_500k.zip" ,
 583 | 	ccbf.tf ,
 584 | 	mode = 'wb'
 585 | )
 586 | 
 587 | ccbf.uz <- unzip( ccbf.tf , exdir = tempdir() )
 588 | 
 589 | ccbf.shp <- readShapePoly( ccbf.uz[ grep( 'shp$' , ccbf.uz ) ] )
 590 | 
 591 | nynjct.shp <- 
 592 | 	subset( 
 593 | 		ccbf.shp , 
 594 | 		GEOID %in% 
 595 | 			c( '34003' , '34017' , '34013' , '34031' , '34027' , '34035' , '34039' , '34023' , '34025' ,
 596 | 				'36087' , '36119' , '36059' , '36103' , '36071' , '36079' ,
 597 | 				'09001'
 598 | 			)
 599 | 	)
 600 | 	
 601 | # prepare the surrounding region shapefile for ggplot2
 602 | nynjct <- fortify( nynjct.shp )
 603 | 
 604 | # gray out surrounding landmasses
 605 | nynjct.layer <- 
 606 | 	geom_polygon( 
 607 | 		data = nynjct , 
 608 | 		aes( x = long , y = lat , group = group ) , 
 609 | 		fill = 'lightgray' 
 610 | 	)
 611 | 
 612 | # and would you look at that?
 613 | nyc.map + nynjct.layer
 614 | # no?  we will re-order the layers later.
 615 | # but now you've got a blank-gray of the surrounding areas.
 616 | 
 617 | # put it all together, what have you got?
 618 | nyc.map + nynjct.layer + lightblue.water.layer
 619 | # hey okay.  that looks like crap but we can work with it.
 620 | 
 621 | 	
 622 | # # end of step 6 # #
 623 | # # # # # # # # # # #
 624 | 
 625 | 
 626 | # # # # # # # # # # # # # # # # # #
 627 | # # step 7: tie knots and krige # #
 628 | 
 629 | library(sqldf)
 630 | 
 631 | # how many knots should you make? #
 632 | 
 633 | # knots are the computationally-intensive part of this process,
 634 | # choose as many as your computer and your patience can handle.
 635 | 
 636 | # you should aim for between 100 - 999 knots,
 637 | # but numbers closer to 1,000 will overload smaller computers
 638 | 
 639 | # with census microdata, you've often already got easy access to a relevant geographic grouping
 640 | # however for new york city, most of the groupings are either too big to be useful as knots,
 641 | # or too small that they'll overload smaller computers.
 642 | 
 643 | 
 644 | # the city of new york contains
 645 | nrow( unique( sf1.bo[ , c( 'boroname' , 'neighborhood' ) ] ) )
 646 | # subboro areas, which is too small of a number of knots.
 647 | 
 648 | # the city also has
 649 | nrow( unique( sf1.bo[ , c( 'boroname' , 'tract' ) ] ) )
 650 | # census tracts, which will overload most computers
 651 | 
 652 | # county subdivision and place codes do not exist within the city limits.
 653 | nrow( unique( sf1.bo[ , c( 'boroname' , 'cousub' , 'place' ) ] ) )
 654 | 
 655 | # so be creative.  take another look at the summary file #1 documentation
 656 | # http://www.census.gov/prod/cen2010/doc/sf1.pdf#page=18
 657 | # oh hayyyy how about zip code tabulation areas?  that could work.
 658 | nrow( unique( sf1.bo[ , c( 'boroname' , 'zcta5' ) ] ) )
 659 | # nope, too many for a computer with 3gb of ram.
 660 | 
 661 | # if you have a powerful computer, you might try tying knots at the census tract-level
 662 | # otherwise, the neighborhods defined by the new york city housing and vacancy survey work.
 663 | 
 664 | # `sqldf` does not like periods in the data.frame name
 665 | # # sf1_bo <- sf1.bo
 666 | 
 667 | # within each borough x zcta5,
 668 | # calculate the population-weighted mean of the coordinates
 669 | # and (for smoothing) the weighted share at each borough-zcta5 centroid
 670 | # # nyc.knots <- 
 671 | 	# # sqldf( 
 672 | 		# # "select 
 673 | 			# # boroname , zcta5 ,
 674 | 			# # sum( pop100 ) as pop100 , 
 675 | 			# # sum( pop100 * intptlon ) / sum( pop100 ) as intptlon ,
 676 | 			# # sum( pop100 * intptlat ) / sum( pop100 ) as intptlat
 677 | 		# # from sf1_bo
 678 | 		# # group by
 679 | 			# # boroname , zcta5"
 680 | 	# # )
 681 | # note: this screws up coordinates that cross the international date line
 682 | # or the equator.  in the united states, only alaska's aleutian islands do this
 683 | # and those geographies will be thrown out later.  so it doesn't matter.
 684 | 
 685 | # clear up RAM
 686 | rm( sf1.bo ) ; gc()
 687 | 
 688 | 
 689 | # interpolation option one #
 690 | library(fields)
 691 | 
 692 | # instead, we can let the `fields` package attempt to guess knots for you,
 693 | xknots <- cover.design( cbind( x$intptlon , x$intptlat ) , 200 )$design
 694 | # # # # note: tying 200 knots instead of 100 knots takes about an hour longer
 695 | # # # # but the final map looks a bit better.  if you're just passing through, use 100.
 696 | 
 697 | # you can look at the estimated knots
 698 | plot( xknots )
 699 | 
 700 | krig.fit <-
 701 | 	Krig(
 702 | 		cbind( x$intptlon , x$intptlat ) ,
 703 | 		x$pproom ,
 704 | 		weights = x$weight ,
 705 | 		knots = xknots
 706 | 		# if you computed the knots yourself, you'll need this knots= line instead:
 707 | 		# knots = cbind( nyc.knots$intptlon , nyc.knots$intptlat )
 708 | 	)
 709 | 
 710 | # that is: what is the (weighted) relationship between
 711 | # your variable of interest (persons per room) and
 712 | # the x/y points on a grid?
 713 | 
 714 | # check this out!
 715 | surface( krig.fit )
 716 | # you're almost there!
 717 | 
 718 | 
 719 | # interpolation option two #
 720 | library(mgcv)
 721 | 
 722 | gam.fit <- 
 723 | 	gam( 
 724 | 		pproom ~ s(intptlon , intptlat ) , 
 725 | 		weights = weight , 
 726 | 		data = x
 727 | 	)
 728 | 	
 729 | 
 730 | # for the third alternative, keep reading.
 731 | 	
 732 | 	
 733 | # # end of step 7 # #
 734 | # # # # # # # # # # #
 735 | 
 736 | 
 737 | # # # # # # # # # # # # # # # # # # # #
 738 | # # step 8: make a grid and predict # #
 739 | 
 740 | library(raster)
 741 | 
 742 | x.range <- bbox( boro.shp.out )[ 1 , ]
 743 | y.range <- bbox( boro.shp.out )[ 2 , ]
 744 | 
 745 | # add five percent on each side
 746 | x.diff <- abs( x.range[ 2 ] - x.range[ 1 ] ) * 0.05
 747 | y.diff <- abs( y.range[ 2 ] - y.range[ 1 ] ) * 0.05
 748 | 
 749 | x.range[ 1 ] <- x.range[ 1 ] - x.diff
 750 | x.range[ 2 ] <- x.range[ 2 ] + x.diff
 751 | y.range[ 1 ] <- y.range[ 1 ] - y.diff
 752 | y.range[ 2 ] <- y.range[ 2 ] + y.diff
 753 | 
 754 | # choose the number of ticks (in each direction) on your grid
 755 | grid.length <- 500
 756 | # # note: smaller grids will render faster
 757 | # # (so they're better if you're just playing around)
 758 | # # but larger grids will prevent your final plot from
 759 | # # being too pixelated, even when zooming in
 760 | 
 761 | 
 762 | # create some grid data.frame objects, one for each interpolation type
 763 | grd <- gam.grd <- krig.grd <-
 764 | 	expand.grid(
 765 | 		intptlon = seq( from = x.range[1] , to = x.range[2] , length = grid.length ) , 
 766 | 		intptlat = seq( from = y.range[1] , to = y.range[2] , length = grid.length )
 767 | 	)
 768 | 
 769 | 
 770 | # along your rectangular grid,
 771 | # what are the predicted values of
 772 | # the number of persons per room?
 773 | krig.grd$kout <- predict( krig.fit , krig.grd )
 774 | 
 775 | # alternate grid using gam.fit
 776 | gam.grd$gamout <- predict( gam.fit , gam.grd )
 777 | 
 778 | # interpolation option three #
 779 | library(spatstat)
 780 | 
 781 | smoout <- 
 782 | 	Smooth(
 783 | 		ppp( 
 784 | 			x$intptlon , 
 785 | 			x$intptlat , 
 786 | 			x.range ,
 787 | 			y.range ,
 788 | 			marks = x$pproom
 789 | 		) ,
 790 | 		# here's a good starting point for sigma, but screw around with this value.
 791 | 		sigma = 0.05 ,
 792 | 		weights = x$weight
 793 | 	)
 794 | 
 795 | smoo.grd <-	
 796 | 	expand.grid(
 797 | 		intptlon = seq( from = smoout$xrange[1] , to = smoout$xrange[2] , length = smoout$dim[1] ) , 
 798 |         intptlat = seq( from = smoout$yrange[1] , to = smoout$yrange[2] , length = smoout$dim[2] )
 799 | 	)
 800 | 
 801 | smoo.grd$smoout <- as.numeric( t( smoout$v ) )
 802 | 
 803 | # # end of step 8 # #
 804 | # # # # # # # # # # #
 805 | 
 806 | 
 807 | # # # # # # # # # # # # # # # # # # # # #
 808 | # # step 9: ggplot and choose options # #
 809 | 
 810 | library(ggplot2)
 811 | library(mapproj)
 812 | library(scales)
 813 | 
 814 | 
 815 | # # # psa # # # 
 816 | # capping your outliers might drastically change your map.
 817 | # if you find the 25th percentile and 75th percentile with
 818 | # summary( krig.grd$kout )
 819 | # and then replace all `kout` values below the 25th or above the 75th
 820 | # with those capped percentile endpoints, i promise promise promise
 821 | # your maps will appear quite different.  you could cap at the 25th and 75th with..
 822 | # grd.sum <- summary( krig.grd$kout )
 823 | # krig.grd[ krig.grd$kout > grd.sum[ 5 ] , 'kout' ] <- grd.sum[ 5 ]
 824 | # krig.grd[ krig.grd$kout < grd.sum[ 2 ] , 'kout' ] <- grd.sum[ 2 ]
 825 | # # # end # # # 
 826 | 
 827 | 
 828 | # you don't want to cap at the 25th and 75th?
 829 | # well consider one other idea: at least cap at the 5th and 95th of the nation
 830 | # this will also increase the visible gradient ultimately plotted.
 831 | 
 832 | # for example, the lowest krigged value is negative.
 833 | summary( krig.grd$kout )
 834 | # that's obviously not right.
 835 | 
 836 | # if a numeric vector has values below the 5th percentile or above the 75th percentile, cap 'em
 837 | minnmax.at.0595 <- 
 838 | 	function( z ){ 
 839 | 		q0595 <- quantile( z , c( 0.05 , 0.95 ) )
 840 | 		z[ z < q0595[ 1 ] ] <- q0595[ 1 ]
 841 | 		z[ z > q0595[ 2 ] ] <- q0595[ 2 ]
 842 | 		z
 843 | 	}
 844 | 
 845 | # min and max all numeric values.
 846 | # krig.grd$kout <- minnmax.at.0595( krig.grd$kout )
 847 | # gam.grd$gamout <- minnmax.at.0595( gam.grd$gamout )
 848 | # smoo.grd$smoout <- minnmax.at.0595( smoo.grd$smoout )
 849 | # sometimes this makes the gradient much more visible.
 850 | # but for this statistic, it doesn't do much.
 851 | 
 852 | # initiate the krige-based plot
 853 | krg.plot <- 
 854 | 	ggplot( data = krig.grd , aes( x = intptlon , y = intptlat ) ) +
 855 | 	geom_tile( data = krig.grd , aes( fill = kout ) )
 856 | 	
 857 | # initiate the gam-based plot
 858 | gam.plot <- 
 859 | 	ggplot( data = gam.grd , aes( x = intptlon , y = intptlat ) ) +
 860 | 	geom_tile( data = gam.grd , aes( fill = gamout ) )
 861 | 
 862 | # initiate the smooth-based plot
 863 | smooth.plot <- 
 864 | 	ggplot( data = smoo.grd , aes( x = intptlon , y = intptlat ) ) +
 865 | 	geom_tile( data = smoo.grd , aes( fill = smoout ) )
 866 | 
 867 | # view all three grids!
 868 | krg.plot
 869 | gam.plot
 870 | smooth.plot
 871 | 
 872 | 
 873 | # initiate the entire plot
 874 | the.plot <-
 875 | 
 876 | 	# choose only one of the three interpolation grids
 877 | 	krg.plot +
 878 | 	# gam.plot +
 879 | 	# smooth.plot +
 880 | 	
 881 | 	# blank out the legend and axis labels
 882 | 	theme(
 883 | 		legend.position = "none" ,
 884 | 		axis.title.x = element_blank() ,
 885 | 		axis.title.y = element_blank()		
 886 | 	) + 
 887 | 	
 888 | 	xlab( "" ) + ylab( "" ) +
 889 | 
 890 | 	# force the x and y axis limits at the shape of the city and don't do anything special for off-map values
 891 | 	scale_x_continuous( limits = c( min( grd$intptlon ) , max( grd$intptlon ) ) , breaks = NULL , oob = squish ) +
 892 | 	# since we're going to add lots of surrounding-area detail!
 893 |     scale_y_continuous( limits = c( min( grd$intptlat ) , max( grd$intptlat ) ) , breaks = NULL , oob = squish ) +
 894 | 
 895 | 	theme(
 896 | 		panel.grid.major = element_blank(),
 897 | 		panel.grid.minor = element_blank(),
 898 | 		panel.background = element_blank(),
 899 | 		panel.border = element_blank(),
 900 | 		axis.ticks = element_blank()
 901 | 	)
 902 | 
 903 | # print the plot to the screen
 904 | the.plot
 905 | # this is the bottom layer.
 906 | 
 907 | # next the park layer, don't forget about the park layer!
 908 | the.plot + park.layer
 909 | 
 910 | # are you alright with saving that?  save it.
 911 | the.plot <- the.plot + park.layer
 912 | 
 913 | # # end of step 9 # #
 914 | # # # # # # # # # # #
 915 | 
 916 | 
 917 | # # # # # # # # # # # # # # # # # #
 918 | # # step 10: blank, color, save # #
 919 | 
 920 | library(ggplot2)
 921 | library(scales)
 922 | library(raster)
 923 | library(plyr)
 924 | library(rgeos)
 925 | 
 926 | # draw a rectangle 15% bigger than the original city
 927 | boro.shp.blank <- as( 1.3 * extent( boro.shp ), "SpatialPolygons" )
 928 | 
 929 | # compute the difference between new york city and the rectangle 15% beyond the borders
 930 | boro.shp.diff <- gDifference( boro.shp.blank , boro.shp )
 931 | 
 932 | # prepare the difference layer for ggplot2
 933 | outside <- fortify( boro.shp.diff )
 934 | 
 935 | # fix any weird island polygons
 936 | outside2 <- ddply( outside , .(piece) , function(x) rbind( x , outside[ 1 , ] ) )
 937 | 
 938 | # blank out waterways and coastal areas
 939 | blank.layer <- 
 940 | 	geom_polygon( 
 941 | 		data = outside2 , 
 942 | 		aes( x = long , y = lat , group = id ) , 
 943 | 		fill = 'white' 
 944 | 	)
 945 | 
 946 | # closer, eh?
 947 | the.plot + blank.layer
 948 | # that blanks out absolutely everything outside of the city limits
 949 | 
 950 | # store this plot
 951 | the.plot <- the.plot + blank.layer	
 952 | 
 953 | # but actually, we should add back in the landmasses.  like this.
 954 | the.plot + nynjct.layer + water.layer
 955 | 
 956 | # store this plot
 957 | the.plot <- the.plot + nynjct.layer + water.layer
 958 | 
 959 | # plus the borough outlines.  do you want to outline the boroughs?
 960 | the.plot + boro.layer
 961 | 
 962 | # store this layer on top of everything.
 963 | the.plot <- the.plot + boro.layer
 964 | 
 965 | # print with the same purty colors
 966 | the.plot + scale_fill_gradient( low = 'white' , high = 'black' )
 967 | the.plot + scale_fill_gradientn( colours = RdGy.11.p( 100 ) )
 968 | the.plot + scale_fill_gradientn( colours = Purples.9.p( 100 ) )
 969 | the.plot + scale_fill_gradientn( colours = YlOrBr.3.p( 100 ) )
 970 | the.plot + scale_fill_gradientn( colours = YlOrBr.9.p( 100 ) )
 971 | 
 972 | # ooh i like that one mom, can we keep it can we keep it?
 973 | final.plot <- the.plot + scale_fill_gradientn( colours = YlOrBr.9.p( 100 ) )
 974 | 
 975 | # here's the final plot
 976 | final.plot
 977 | 
 978 | # would you like to save this game?
 979 | 
 980 | # use cairo-png as your bitmap type
 981 | options( bitmapType = "cairo" )
 982 | 
 983 | # save the file to your current working directory
 984 | ggsave( 
 985 | 	"2011 new york city number of persons per room.png" ,
 986 | 	plot = final.plot ,
 987 | 	scale = 2 ,
 988 | 	type = "cairo-png" 
 989 | )
 990 | # happy?
 991 | 
 992 | # save a silly globular-projected file to your current working directory,
 993 | ggsave( 
 994 | 	"2011 new york city number of persons per room - globular.png" ,
 995 | 	plot = final.plot + coord_map( "globular" , orientation = c( 40.55 , -74.13 , -30 )  ) ,
 996 | 	scale = 2 ,
 997 | 	type = "cairo-png" 
 998 | )
 999 | # in case you're into that kinda stuff.
1000 | 
1001 | # # end of step ten # #
1002 | # # # # # # # # # # # #
1003 | 


--------------------------------------------------------------------------------