├── ACKNOWLEDGEMENTS ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── LICENSE_MODEL ├── README.md ├── cubifyanything ├── __init__.py ├── batching.py ├── boxes.py ├── capture_stream.py ├── color.py ├── cubify_transformer.py ├── dataset.py ├── imagelist.py ├── instances.py ├── measurement.py ├── orientation.py ├── pos.py ├── preprocessor.py ├── sensor.py ├── transforms.py └── vit.py ├── data ├── LICENSE_DATA ├── train.txt └── val.txt ├── requirements.txt ├── setup.py ├── teaser.jpg └── tools └── demo.py /ACKNOWLEDGEMENTS: -------------------------------------------------------------------------------- 1 | Portions of this Software may utilize the following copyrighted material, the use of which is hereby acknowledged. 2 | 3 | ------------------------------------------------ 4 | Detectron2 (https://github.com/facebookresearch/detectron2) 5 | Facebook, Inc. and its affiliates. 6 | 7 | Apache License 8 | Version 2.0, January 2004 9 | http://www.apache.org/licenses/ 10 | 11 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 12 | 13 | 1. Definitions. 14 | 15 | "License" shall mean the terms and conditions for use, reproduction, 16 | and distribution as defined by Sections 1 through 9 of this document. 17 | 18 | "Licensor" shall mean the copyright owner or entity authorized by 19 | the copyright owner that is granting the License. 20 | 21 | "Legal Entity" shall mean the union of the acting entity and all 22 | other entities that control, are controlled by, or are under common 23 | control with that entity. For the purposes of this definition, 24 | "control" means (i) the power, direct or indirect, to cause the 25 | direction or management of such entity, whether by contract or 26 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 27 | outstanding shares, or (iii) beneficial ownership of such entity. 28 | 29 | "You" (or "Your") shall mean an individual or Legal Entity 30 | exercising permissions granted by this License. 31 | 32 | "Source" form shall mean the preferred form for making modifications, 33 | including but not limited to software source code, documentation 34 | source, and configuration files. 35 | 36 | "Object" form shall mean any form resulting from mechanical 37 | transformation or translation of a Source form, including but 38 | not limited to compiled object code, generated documentation, 39 | and conversions to other media types. 40 | 41 | "Work" shall mean the work of authorship, whether in Source or 42 | Object form, made available under the License, as indicated by a 43 | copyright notice that is included in or attached to the work 44 | (an example is provided in the Appendix below). 45 | 46 | "Derivative Works" shall mean any work, whether in Source or Object 47 | form, that is based on (or derived from) the Work and for which the 48 | editorial revisions, annotations, elaborations, or other modifications 49 | represent, as a whole, an original work of authorship. For the purposes 50 | of this License, Derivative Works shall not include works that remain 51 | separable from, or merely link (or bind by name) to the interfaces of, 52 | the Work and Derivative Works thereof. 53 | 54 | "Contribution" shall mean any work of authorship, including 55 | the original version of the Work and any modifications or additions 56 | to that Work or Derivative Works thereof, that is intentionally 57 | submitted to Licensor for inclusion in the Work by the copyright owner 58 | or by an individual or Legal Entity authorized to submit on behalf of 59 | the copyright owner. For the purposes of this definition, "submitted" 60 | means any form of electronic, verbal, or written communication sent 61 | to the Licensor or its representatives, including but not limited to 62 | communication on electronic mailing lists, source code control systems, 63 | and issue tracking systems that are managed by, or on behalf of, the 64 | Licensor for the purpose of discussing and improving the Work, but 65 | excluding communication that is conspicuously marked or otherwise 66 | designated in writing by the copyright owner as "Not a Contribution." 67 | 68 | "Contributor" shall mean Licensor and any individual or Legal Entity 69 | on behalf of whom a Contribution has been received by Licensor and 70 | subsequently incorporated within the Work. 71 | 72 | 2. Grant of Copyright License. Subject to the terms and conditions of 73 | this License, each Contributor hereby grants to You a perpetual, 74 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 75 | copyright license to reproduce, prepare Derivative Works of, 76 | publicly display, publicly perform, sublicense, and distribute the 77 | Work and such Derivative Works in Source or Object form. 78 | 79 | 3. Grant of Patent License. Subject to the terms and conditions of 80 | this License, each Contributor hereby grants to You a perpetual, 81 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 82 | (except as stated in this section) patent license to make, have made, 83 | use, offer to sell, sell, import, and otherwise transfer the Work, 84 | where such license applies only to those patent claims licensable 85 | by such Contributor that are necessarily infringed by their 86 | Contribution(s) alone or by combination of their Contribution(s) 87 | with the Work to which such Contribution(s) was submitted. If You 88 | institute patent litigation against any entity (including a 89 | cross-claim or counterclaim in a lawsuit) alleging that the Work 90 | or a Contribution incorporated within the Work constitutes direct 91 | or contributory patent infringement, then any patent licenses 92 | granted to You under this License for that Work shall terminate 93 | as of the date such litigation is filed. 94 | 95 | 4. Redistribution. You may reproduce and distribute copies of the 96 | Work or Derivative Works thereof in any medium, with or without 97 | modifications, and in Source or Object form, provided that You 98 | meet the following conditions: 99 | 100 | (a) You must give any other recipients of the Work or 101 | Derivative Works a copy of this License; and 102 | 103 | (b) You must cause any modified files to carry prominent notices 104 | stating that You changed the files; and 105 | 106 | (c) You must retain, in the Source form of any Derivative Works 107 | that You distribute, all copyright, patent, trademark, and 108 | attribution notices from the Source form of the Work, 109 | excluding those notices that do not pertain to any part of 110 | the Derivative Works; and 111 | 112 | (d) If the Work includes a "NOTICE" text file as part of its 113 | distribution, then any Derivative Works that You distribute must 114 | include a readable copy of the attribution notices contained 115 | within such NOTICE file, excluding those notices that do not 116 | pertain to any part of the Derivative Works, in at least one 117 | of the following places: within a NOTICE text file distributed 118 | as part of the Derivative Works; within the Source form or 119 | documentation, if provided along with the Derivative Works; or, 120 | within a display generated by the Derivative Works, if and 121 | wherever such third-party notices normally appear. The contents 122 | of the NOTICE file are for informational purposes only and 123 | do not modify the License. You may add Your own attribution 124 | notices within Derivative Works that You distribute, alongside 125 | or as an addendum to the NOTICE text from the Work, provided 126 | that such additional attribution notices cannot be construed 127 | as modifying the License. 128 | 129 | You may add Your own copyright statement to Your modifications and 130 | may provide additional or different license terms and conditions 131 | for use, reproduction, or distribution of Your modifications, or 132 | for any such Derivative Works as a whole, provided Your use, 133 | reproduction, and distribution of the Work otherwise complies with 134 | the conditions stated in this License. 135 | 136 | 5. Submission of Contributions. Unless You explicitly state otherwise, 137 | any Contribution intentionally submitted for inclusion in the Work 138 | by You to the Licensor shall be under the terms and conditions of 139 | this License, without any additional terms or conditions. 140 | Notwithstanding the above, nothing herein shall supersede or modify 141 | the terms of any separate license agreement you may have executed 142 | with Licensor regarding such Contributions. 143 | 144 | 6. Trademarks. This License does not grant permission to use the trade 145 | names, trademarks, service marks, or product names of the Licensor, 146 | except as required for reasonable and customary use in describing the 147 | origin of the Work and reproducing the content of the NOTICE file. 148 | 149 | 7. Disclaimer of Warranty. Unless required by applicable law or 150 | agreed to in writing, Licensor provides the Work (and each 151 | Contributor provides its Contributions) on an "AS IS" BASIS, 152 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 153 | implied, including, without limitation, any warranties or conditions 154 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 155 | PARTICULAR PURPOSE. You are solely responsible for determining the 156 | appropriateness of using or redistributing the Work and assume any 157 | risks associated with Your exercise of permissions under this License. 158 | 159 | 8. Limitation of Liability. In no event and under no legal theory, 160 | whether in tort (including negligence), contract, or otherwise, 161 | unless required by applicable law (such as deliberate and grossly 162 | negligent acts) or agreed to in writing, shall any Contributor be 163 | liable to You for damages, including any direct, indirect, special, 164 | incidental, or consequential damages of any character arising as a 165 | result of this License or out of the use or inability to use the 166 | Work (including but not limited to damages for loss of goodwill, 167 | work stoppage, computer failure or malfunction, or any and all 168 | other commercial damages or losses), even if such Contributor 169 | has been advised of the possibility of such damages. 170 | 171 | 9. Accepting Warranty or Additional Liability. While redistributing 172 | the Work or Derivative Works thereof, You may choose to offer, 173 | and charge a fee for, acceptance of support, warranty, indemnity, 174 | or other liability obligations and/or rights consistent with this 175 | License. However, in accepting such obligations, You may act only 176 | on Your own behalf and on Your sole responsibility, not on behalf 177 | of any other Contributor, and only if You agree to indemnify, 178 | defend, and hold each Contributor harmless for any liability 179 | incurred by, or claims asserted against, such Contributor by reason 180 | of your accepting any such warranty or additional liability. 181 | 182 | END OF TERMS AND CONDITIONS 183 | 184 | APPENDIX: How to apply the Apache License to your work. 185 | 186 | To apply the Apache License to your work, attach the following 187 | boilerplate notice, with the fields enclosed by brackets "[]" 188 | replaced with your own identifying information. (Don't include 189 | the brackets!) The text should be enclosed in the appropriate 190 | comment syntax for the file format. We also recommend that a 191 | file or class name and description of purpose be included on the 192 | same "printed page" as the copyright notice for easier 193 | identification within third-party archives. 194 | 195 | Copyright [yyyy] [name of copyright owner] 196 | 197 | 198 | Licensed under the Apache License, Version 2.0 (the "License"); 199 | you may not use this file except in compliance with the License. 200 | You may obtain a copy of the License at 201 | 202 | http://www.apache.org/licenses/LICENSE-2.0 203 | 204 | Unless required by applicable law or agreed to in writing, software 205 | distributed under the License is distributed on an "AS IS" BASIS, 206 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 207 | See the License for the specific language governing permissions and 208 | limitations under the License. 209 | 210 | ------------------------------------------------ 211 | Plain-DETR (https://github.com/impiga/Plain-DETR) 212 | 2023 Xi'an Jiaotong University & Microsoft Research Asia. 213 | 214 | MIT License 215 | 216 | Copyright (c) 2023 Xi'an Jiaotong University and Microsoft Research Asia 217 | 218 | Permission is hereby granted, free of charge, to any person obtaining a copy 219 | of this software and associated documentation files (the "Software"), to deal 220 | in the Software without restriction, including without limitation the rights 221 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 222 | copies of the Software, and to permit persons to whom the Software is 223 | furnished to do so, subject to the following conditions: 224 | 225 | The above copyright notice and this permission notice shall be included in all 226 | copies or substantial portions of the Software. 227 | 228 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 229 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 230 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 231 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 232 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 233 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 234 | SOFTWARE. 235 | 236 | ------------------------------------------------ 237 | WebDataset (https://github.com/webdataset/webdataset) 238 | NVIDIA CORPORATION 239 | 240 | Copyright 2020 NVIDIA CORPORATION. All rights reserved. 241 | 242 | Redistribution and use in source and binary forms, with or without 243 | modification, are permitted provided that the following conditions 244 | are met: 245 | 246 | 1. Redistributions of source code must retain the above copyright notice, 247 | this list of conditions and the following disclaimer. 248 | 249 | 2. Redistributions in binary form must reproduce the above copyright 250 | notice, this list of conditions and the following disclaimer in the 251 | documentation and/or other materials provided with the distribution. 252 | 253 | 3. Neither the name of the copyright holder nor the names of its 254 | contributors may be used to endorse or promote products derived from 255 | this software without specific prior written permission. 256 | 257 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 258 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 259 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 260 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 261 | HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 262 | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED 263 | TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 264 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 265 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 266 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 267 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 268 | 269 | MMDet3D (https://github.com/open-mmlab/mmdetection3d) 270 | 271 | Copyright 2018-2019 Open-MMLab. All rights reserved. 272 | 273 | Apache License 274 | Version 2.0, January 2004 275 | http://www.apache.org/licenses/ 276 | 277 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 278 | 279 | 1. Definitions. 280 | 281 | "License" shall mean the terms and conditions for use, reproduction, 282 | and distribution as defined by Sections 1 through 9 of this document. 283 | 284 | "Licensor" shall mean the copyright owner or entity authorized by 285 | the copyright owner that is granting the License. 286 | 287 | "Legal Entity" shall mean the union of the acting entity and all 288 | other entities that control, are controlled by, or are under common 289 | control with that entity. For the purposes of this definition, 290 | "control" means (i) the power, direct or indirect, to cause the 291 | direction or management of such entity, whether by contract or 292 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 293 | outstanding shares, or (iii) beneficial ownership of such entity. 294 | 295 | "You" (or "Your") shall mean an individual or Legal Entity 296 | exercising permissions granted by this License. 297 | 298 | "Source" form shall mean the preferred form for making modifications, 299 | including but not limited to software source code, documentation 300 | source, and configuration files. 301 | 302 | "Object" form shall mean any form resulting from mechanical 303 | transformation or translation of a Source form, including but 304 | not limited to compiled object code, generated documentation, 305 | and conversions to other media types. 306 | 307 | "Work" shall mean the work of authorship, whether in Source or 308 | Object form, made available under the License, as indicated by a 309 | copyright notice that is included in or attached to the work 310 | (an example is provided in the Appendix below). 311 | 312 | "Derivative Works" shall mean any work, whether in Source or Object 313 | form, that is based on (or derived from) the Work and for which the 314 | editorial revisions, annotations, elaborations, or other modifications 315 | represent, as a whole, an original work of authorship. For the purposes 316 | of this License, Derivative Works shall not include works that remain 317 | separable from, or merely link (or bind by name) to the interfaces of, 318 | the Work and Derivative Works thereof. 319 | 320 | "Contribution" shall mean any work of authorship, including 321 | the original version of the Work and any modifications or additions 322 | to that Work or Derivative Works thereof, that is intentionally 323 | submitted to Licensor for inclusion in the Work by the copyright owner 324 | or by an individual or Legal Entity authorized to submit on behalf of 325 | the copyright owner. For the purposes of this definition, "submitted" 326 | means any form of electronic, verbal, or written communication sent 327 | to the Licensor or its representatives, including but not limited to 328 | communication on electronic mailing lists, source code control systems, 329 | and issue tracking systems that are managed by, or on behalf of, the 330 | Licensor for the purpose of discussing and improving the Work, but 331 | excluding communication that is conspicuously marked or otherwise 332 | designated in writing by the copyright owner as "Not a Contribution." 333 | 334 | "Contributor" shall mean Licensor and any individual or Legal Entity 335 | on behalf of whom a Contribution has been received by Licensor and 336 | subsequently incorporated within the Work. 337 | 338 | 2. Grant of Copyright License. Subject to the terms and conditions of 339 | this License, each Contributor hereby grants to You a perpetual, 340 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 341 | copyright license to reproduce, prepare Derivative Works of, 342 | publicly display, publicly perform, sublicense, and distribute the 343 | Work and such Derivative Works in Source or Object form. 344 | 345 | 3. Grant of Patent License. Subject to the terms and conditions of 346 | this License, each Contributor hereby grants to You a perpetual, 347 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 348 | (except as stated in this section) patent license to make, have made, 349 | use, offer to sell, sell, import, and otherwise transfer the Work, 350 | where such license applies only to those patent claims licensable 351 | by such Contributor that are necessarily infringed by their 352 | Contribution(s) alone or by combination of their Contribution(s) 353 | with the Work to which such Contribution(s) was submitted. If You 354 | institute patent litigation against any entity (including a 355 | cross-claim or counterclaim in a lawsuit) alleging that the Work 356 | or a Contribution incorporated within the Work constitutes direct 357 | or contributory patent infringement, then any patent licenses 358 | granted to You under this License for that Work shall terminate 359 | as of the date such litigation is filed. 360 | 361 | 4. Redistribution. You may reproduce and distribute copies of the 362 | Work or Derivative Works thereof in any medium, with or without 363 | modifications, and in Source or Object form, provided that You 364 | meet the following conditions: 365 | 366 | (a) You must give any other recipients of the Work or 367 | Derivative Works a copy of this License; and 368 | 369 | (b) You must cause any modified files to carry prominent notices 370 | stating that You changed the files; and 371 | 372 | (c) You must retain, in the Source form of any Derivative Works 373 | that You distribute, all copyright, patent, trademark, and 374 | attribution notices from the Source form of the Work, 375 | excluding those notices that do not pertain to any part of 376 | the Derivative Works; and 377 | 378 | (d) If the Work includes a "NOTICE" text file as part of its 379 | distribution, then any Derivative Works that You distribute must 380 | include a readable copy of the attribution notices contained 381 | within such NOTICE file, excluding those notices that do not 382 | pertain to any part of the Derivative Works, in at least one 383 | of the following places: within a NOTICE text file distributed 384 | as part of the Derivative Works; within the Source form or 385 | documentation, if provided along with the Derivative Works; or, 386 | within a display generated by the Derivative Works, if and 387 | wherever such third-party notices normally appear. The contents 388 | of the NOTICE file are for informational purposes only and 389 | do not modify the License. You may add Your own attribution 390 | notices within Derivative Works that You distribute, alongside 391 | or as an addendum to the NOTICE text from the Work, provided 392 | that such additional attribution notices cannot be construed 393 | as modifying the License. 394 | 395 | You may add Your own copyright statement to Your modifications and 396 | may provide additional or different license terms and conditions 397 | for use, reproduction, or distribution of Your modifications, or 398 | for any such Derivative Works as a whole, provided Your use, 399 | reproduction, and distribution of the Work otherwise complies with 400 | the conditions stated in this License. 401 | 402 | 5. Submission of Contributions. Unless You explicitly state otherwise, 403 | any Contribution intentionally submitted for inclusion in the Work 404 | by You to the Licensor shall be under the terms and conditions of 405 | this License, without any additional terms or conditions. 406 | Notwithstanding the above, nothing herein shall supersede or modify 407 | the terms of any separate license agreement you may have executed 408 | with Licensor regarding such Contributions. 409 | 410 | 6. Trademarks. This License does not grant permission to use the trade 411 | names, trademarks, service marks, or product names of the Licensor, 412 | except as required for reasonable and customary use in describing the 413 | origin of the Work and reproducing the content of the NOTICE file. 414 | 415 | 7. Disclaimer of Warranty. Unless required by applicable law or 416 | agreed to in writing, Licensor provides the Work (and each 417 | Contributor provides its Contributions) on an "AS IS" BASIS, 418 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 419 | implied, including, without limitation, any warranties or conditions 420 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 421 | PARTICULAR PURPOSE. You are solely responsible for determining the 422 | appropriateness of using or redistributing the Work and assume any 423 | risks associated with Your exercise of permissions under this License. 424 | 425 | 8. Limitation of Liability. In no event and under no legal theory, 426 | whether in tort (including negligence), contract, or otherwise, 427 | unless required by applicable law (such as deliberate and grossly 428 | negligent acts) or agreed to in writing, shall any Contributor be 429 | liable to You for damages, including any direct, indirect, special, 430 | incidental, or consequential damages of any character arising as a 431 | result of this License or out of the use or inability to use the 432 | Work (including but not limited to damages for loss of goodwill, 433 | work stoppage, computer failure or malfunction, or any and all 434 | other commercial damages or losses), even if such Contributor 435 | has been advised of the possibility of such damages. 436 | 437 | 9. Accepting Warranty or Additional Liability. While redistributing 438 | the Work or Derivative Works thereof, You may choose to offer, 439 | and charge a fee for, acceptance of support, warranty, indemnity, 440 | or other liability obligations and/or rights consistent with this 441 | License. However, in accepting such obligations, You may act only 442 | on Your own behalf and on Your sole responsibility, not on behalf 443 | of any other Contributor, and only if You agree to indemnify, 444 | defend, and hold each Contributor harmless for any liability 445 | incurred by, or claims asserted against, such Contributor by reason 446 | of your accepting any such warranty or additional liability. 447 | 448 | END OF TERMS AND CONDITIONS 449 | 450 | APPENDIX: How to apply the Apache License to your work. 451 | 452 | To apply the Apache License to your work, attach the following 453 | boilerplate notice, with the fields enclosed by brackets "[]" 454 | replaced with your own identifying information. (Don't include 455 | the brackets!) The text should be enclosed in the appropriate 456 | comment syntax for the file format. We also recommend that a 457 | file or class name and description of purpose be included on the 458 | same "printed page" as the copyright notice for easier 459 | identification within third-party archives. 460 | 461 | Copyright 2018-2019 Open-MMLab. 462 | 463 | Licensed under the Apache License, Version 2.0 (the "License"); 464 | you may not use this file except in compliance with the License. 465 | You may obtain a copy of the License at 466 | 467 | http://www.apache.org/licenses/LICENSE-2.0 468 | 469 | Unless required by applicable law or agreed to in writing, software 470 | distributed under the License is distributed on an "AS IS" BASIS, 471 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 472 | See the License for the specific language governing permissions and 473 | limitations under the License. 474 | 475 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to making participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies within all project spaces, and it also applies when 49 | an individual is representing the project or its community in public spaces. 50 | Examples of representing a project or community include using an official 51 | project e-mail address, posting via an official social media account, or acting 52 | as an appointed representative at an online or offline event. Representation of 53 | a project may be further defined and clarified by project maintainers. 54 | 55 | ## Enforcement 56 | 57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 58 | reported by contacting the open source team at [opensource-conduct@group.apple.com](mailto:opensource-conduct@group.apple.com). All 59 | complaints will be reviewed and investigated and will result in a response that 60 | is deemed necessary and appropriate to the circumstances. The project team is 61 | obligated to maintain confidentiality with regard to the reporter of an incident. 62 | Further details of specific enforcement policies may be posted separately. 63 | 64 | Project maintainers who do not follow or enforce the Code of Conduct in good 65 | faith may face temporary or permanent repercussions as determined by other 66 | members of the project's leadership. 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 1.4, 71 | available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct.html](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html) -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contribution Guide 2 | 3 | Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducibility, and beyond its publication there are limited plans for future development of the repository. 4 | 5 | While we welcome new pull requests and issues please note that our response may be limited. Forks and out-of-tree improvements are strongly encouraged. 6 | 7 | ## Before you get started 8 | 9 | By submitting a pull request, you represent that you have the right to license your contribution to Apple and the community, and agree by submitting the patch that your contributions are licensed under the [LICENSE](LICENSE). 10 | 11 | We ask that all community members read and observe our [Code of Conduct](CODE_OF_CONDUCT.md). 12 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (C) 2025 Apple Inc. All Rights Reserved. 2 | 3 | IMPORTANT: This Apple software is supplied to you by Apple 4 | Inc. ("Apple") in consideration of your agreement to the following 5 | terms, and your use, installation, modification or redistribution of 6 | this Apple software constitutes acceptance of these terms. If you do 7 | not agree with these terms, please do not use, install, modify or 8 | redistribute this Apple software. 9 | 10 | In consideration of your agreement to abide by the following terms, and 11 | subject to these terms, Apple grants you a personal, non-exclusive 12 | license, under Apple's copyrights in this original Apple software (the 13 | "Apple Software"), to use, reproduce, modify and redistribute the Apple 14 | Software, with or without modifications, in source and/or binary forms; 15 | provided that if you redistribute the Apple Software in its entirety and 16 | without modifications, you must retain this notice and the following 17 | text and disclaimers in all such redistributions of the Apple Software. 18 | Neither the name, trademarks, service marks or logos of Apple Inc. may 19 | be used to endorse or promote products derived from the Apple Software 20 | without specific prior written permission from Apple. Except as 21 | expressly stated in this notice, no other rights or licenses, express or 22 | implied, are granted by Apple herein, including but not limited to any 23 | patent rights that may be infringed by your derivative works or by other 24 | works in which the Apple Software may be incorporated. 25 | 26 | The Apple Software is provided by Apple on an "AS IS" basis. APPLE 27 | MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION 28 | THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS 29 | FOR A PARTICULAR PURPOSE, REGARDING THE APPLE SOFTWARE OR ITS USE AND 30 | OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS. 31 | 32 | IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL 33 | OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 34 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 35 | INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION, 36 | MODIFICATION AND/OR DISTRIBUTION OF THE APPLE SOFTWARE, HOWEVER CAUSED 37 | AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE), 38 | STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE 39 | POSSIBILITY OF SUCH DAMAGE. 40 | 41 | ------------------------------------------------------------------------------- 42 | SOFTWARE DISTRIBUTED WITH ml-cubifyanything 43 | 44 | The ml-cubifyanything software includes a number of subcomponents with separate 45 | copyright notices and license terms - please see the file ACKNOWLEDGEMENTS. 46 | ------------------------------------------------------------------------------- 47 | -------------------------------------------------------------------------------- /LICENSE_MODEL: -------------------------------------------------------------------------------- 1 | Disclaimer: IMPORTANT: This Apple Machine Learning Research Model is specifically developed and released by Apple Inc. ("Apple") for the sole purpose of scientific research of artificial intelligence and machine-learning technology. “Apple Machine Learning Research Model” means the model, including but not limited to algorithms, formulas, trained model weights, parameters, configurations, checkpoints, and any related materials (including documentation). 2 | This Apple Machine Learning Research Model is provided to You by Apple in consideration of your agreement to the following terms, and your use, modification, creation of Model Derivatives, and or redistribution of the Apple Machine Learning Research Model constitutes acceptance of this Agreement. If You do not agree with these terms, please do not use, modify, create Model Derivatives of, or distribute this Apple Machine Learning Research Model or Model Derivatives. 3 | 1. License Scope: In consideration of your agreement to abide by the following terms, and subject to these terms, Apple hereby grants you a personal, non- exclusive, worldwide, non-transferable, royalty-free, revocable, and limited license, to use, copy, modify, distribute, and create Model Derivatives (defined below) of the Apple Machine Learning Research Model exclusively for Research Purposes. You agree that any Model Derivatives You may create or that may be created for You will be limited to Research Purposes as well. “Research Purposes” means non-commercial scientific research and academic development activities, such as experimentation, analysis, testing conducted by You with the sole intent to advance scientific knowledge and research. “Research Purposes” does not include any commercial exploitation, product development or use in any commercial product or service. 4 | 2. Distribution of Apple Machine Learning Research Model and Model Derivatives: If you choose to redistribute Apple Machine Learning Research Model or its Model Derivatives, you must provide a copy of this Agreement to such third party, and ensure that the following attribution notice be provided: “Apple Machine Learning Research Model is licensed under the Apple Machine Learning Research Model License Agreement.” Additionally, all Model Derivatives must clearly be identified as such, including disclosure of modifications and changes made to the Apple Machine Learning Research Model. The name, trademarks, service marks or logos of Apple may not be used to endorse or promote Model Derivatives or the relationship between You and Apple. “Model Derivatives” means any models or any other artifacts created by modifications, improvements, adaptations, alterations to the architecture, 5 | 6 | algorithm or training processes of the Apple Machine Learning Research Model, or by any retraining, fine-tuning of the Apple Machine Learning Research Model. 7 | 3. No Other License: Except as expressly stated in this notice, no other rights or licenses, express or implied, are granted by Apple herein, including but not limited to any patent, trademark, and similar intellectual property rights worldwide that may be infringed by the Apple Machine Learning Research Model, the Model Derivatives or by other works in which the Apple Machine Learning Research Model may be incorporated. 8 | 4. Compliance with Laws: Your use of Apple Machine Learning Research Model must be in compliance with all applicable laws and regulations. 9 | 5. Term and Termination: The term of this Agreement will begin upon your acceptance of this Agreement or use of the Apple Machine Learning Research Model and will continue until terminated in accordance with the following terms. Apple may terminate this Agreement at any time if You are in breach of any term or condition of this Agreement. Upon termination of this Agreement, You must cease to use all Apple Machine Learning Research Models and Model Derivatives and permanently delete any copy thereof. Sections 3, 6 and 7 will survive termination. 10 | 6. Disclaimer and Limitation of Liability: This Apple Machine Learning Research Model and any outputs generated by the Apple Machine Learning Research Model are provided on an “AS IS” basis. APPLE MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, REGARDING THE APPLE MACHINE LEARNING RESEARCH MODEL OR OUTPUTS GENERATED BY THE APPLE MACHINE LEARNING RESEARCH MODEL. You are solely responsible for determining the appropriateness of using or redistributing the Apple Machine Learning Research Model and any outputs of the Apple Machine Learning Research Model and assume any risks associated with Your use of the Apple Machine Learning Research Model and any output and results. IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION, MODIFICATION AND/OR DISTRIBUTION OF THE APPLE MACHINE LEARNING RESEARCH MODEL AND ANY OUTPUTS OF THE APPLE MACHINE LEARNING RESEARCH MODEL, HOWEVER CAUSED AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 11 | 7. Governing Law: This Agreement will be governed by and construed under the laws of the State of California without regard to its choice of law principles. The Convention on Contracts for the International Sale of Goods shall not apply to the Agreement except that the arbitration clause and any arbitration hereunder shall be governed by the Federal Arbitration Act, Chapters 1 and 2. 12 | Copyright (c) 2025 Apple Inc. All Rights Reserved. 13 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CA-1M and Cubify Anything 2 | 3 | This repository includes the public implementation of Cubify Transformer and the 4 | associated CA-1M dataset. 5 | 6 | ## Paper 7 | 8 | **Apple** 9 | 10 | [Cubify Anything: Scaling Indoor 3D Object Detection](https://arxiv.org/abs/2412.04458) 11 | 12 | Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, Afshin Dehghan 13 | 14 | **CVPR 2025** 15 | 16 | ![Teaser](teaser.jpg?raw=true "Teaser") 17 | 18 | ## Repository Overview 19 | 20 | This repository includes: 21 | 22 | 1. Links to the underlying data and annotations of the CA-1M dataset. 23 | 2. Links to released models of the Cubify Transformer (CuTR) model from the Cubify Anything paper. 24 | 3. Basic readers and inference code to run CuTR on the provided data. 25 | 4. Basic support for using images captured from own device using the NeRF Capture app. 26 | 27 | ## Installation 28 | 29 | We recommend Python 3.10 and a recent 2.x build of PyTorch. We include a `requirements.txt` which should encapsulate 30 | all necessary dependencies. Please make sure you have `torch` installed first, e.g.,: 31 | 32 | ``` 33 | pip install torch torchvision 34 | ``` 35 | 36 | Then, within the root of the repository: 37 | 38 | ``` 39 | pip install -r requirements.txt 40 | pip install -e . 41 | ``` 42 | 43 | ## CA-1M versus ARKitScenes? 44 | 45 | This work is related to [ARKitScenes](https://machinelearning.apple.com/research/arkitscenes). We generally share 46 | the same underlying captures. Some notable differences in CA-1M: 47 | 48 | 1. Each scene has been exhaustively annotated with class-agnostic 3D boxes. We release these in the laser scanner's coordinate frame. 49 | 2. For each frame in each capture, we include "per-frame" 3D box ground-truth which was produced using the rendering 50 | process outlined in the Cubify Anything paper. These annotations are, therefore, *independent* of any pose. 51 | 52 | Some other nice things: 53 | 54 | 1. We release the GT poses (registered to laser scanner) for every frame in each capture. 55 | 2. We release the GT depth (rendered from laser scanner) at 512 x 384 for every frame in each capture. 56 | 3. Each frame has been already oriented into an upright position. 57 | 58 | **NOTE:** CA-1M will only include captures which were successfully registered to the laser scanner. Therefore 59 | not every capture including in ARKitScenes will be present in CA-1M. 60 | 61 | ## Downloading and using the CA-1M data 62 | 63 | ### Data License 64 | 65 | **All data is released under the [CC-by-NC-ND](data/LICENSE_DATA).** 66 | 67 | All links to the data are contained in `data/train.txt` and `data/val.txt`. You can use `curl` to download all files 68 | listed. If you don't need the whole dataset in advance, you can either explicitly pass these 69 | links explicitly or pass the split's `txt` file itself and use the `--video-ids` argument to filter the desired videos. 70 | 71 | If you pass the `txt` file, please note that file will be cached under `data/[split]`. 72 | 73 | ## Understanding the CA-1M data 74 | 75 | CA-1M is released in WebDataset format. Therefore, it is essentially a fancy tar archive 76 | *per* capture (i.e., a video). Therefore, a single archive `ca1m-[split]-XXXXXXX.tar` corresponds to all data 77 | of capture XXXXXXXX. 78 | 79 | Both splits are released at full frame rate. 80 | 81 | All data should be neatly loaded by `CubifyAnythingDataset`. Please refer to `dataset.py` for more 82 | specifics on how to read/parse data on disk. Some general pointers: 83 | 84 | ```python 85 | [video_id]/[integer_timestamp].wide/image.png # A 1024x768 RGB image corresponding to the main camera. 86 | [video_id]/[integer_timestamp].wide/depth.png # A 256x192 depth image stored as a UInt16 (as millimeters) derived from the capture device's onboard LiDAR (ARKit depth). 87 | [video_id]/[integer_timestamp].wide/depth/confidence.tiff # A 256x192 confidence image storing the [0, 1] confidence value of each depth measurement (currently unused). 88 | [video_id]/[integer_timestamp].wide/instances.json # A list of GT instances alongside their 3D boxes (i.e., the resulting of the GT rendering process). 89 | [video_id]/[integer_timestamp].wide/T_gravity.json # A rotation matrix which encodes the pitch/roll of the camera, which we assume is known (e.g., IMU). 90 | 91 | [video_id]/[integer_timestamp].gt/RT.json # A 4x4 (row major) JSON-encoded matrix corresponding to the registered pose in the laser-scanner space. 92 | [video_id]/[integer_timestamp].gt/depth.png # A 512x384 depth image stored as a UInt16 (as millimeters) derived from the FARO laser scanner registration. 93 | 94 | ``` 95 | 96 | Note that since we have already oriented the images, these dimensions may be transposed. GT depth may have 0 values which corresponding to unregistered points. 97 | 98 | An additional file is included as `[video_id]/world.gt/instances.json` which corresponds to the full world set of 3D annotations from which 99 | the per-frame labels are generated from. These instances include some structural labels: `wall`, `floor`, `ceiling`, `door_frame` which 100 | might aid in rendering. 101 | 102 | ## Visualization 103 | 104 | We include visualization support using [rerun](https://rerun.io). Visualization should happen 105 | automatically. If you wish to not run any models, but only visualize the data, use `--viz-only`. 106 | 107 | During inference, you may wish to inspect the 3D accuracy of the predictions. We support 108 | visualizing the predictions on the GT point cloud (derived from Faro depth) when using 109 | the `--viz-on-gt-points` flag. 110 | 111 | ### Sample command 112 | 113 | ``` bash 114 | python tools/demo.py [path_to_downloaded_data]/ca1m-val-42898570.tar --viz-only 115 | ``` 116 | 117 | ``` bash 118 | python tools/demo.py data/train.txt --viz-only --video-ids 45261548 119 | ``` 120 | 121 | ## Skipping Frames 122 | 123 | The data is provided at a high frame rate, so using `--every-nth-frame N` will only 124 | process every N frames. 125 | 126 | ## Running the CuTR models 127 | 128 | **All models are released under the Apple ML Research Model Terms of Use in [LICENSE_MODEL](LICENSE_MODEL).** 129 | 130 | 1. [RGB-D](https://ml-site.cdn-apple.com/models/cutr/cutr_rgbd.pth) 131 | 2. [RGB](https://ml-site.cdn-apple.com/models/cutr/cutr_rgb.pth) 132 | 133 | Models can be provided to `demo.py` using the `--model-path` argument. We detect whether this is an RGB 134 | or RGB-D model and disable depth accordingly. 135 | 136 | ### RGB-D 137 | 138 | The first variant of CuTR expects an RGB image and a metric depth map. We train on ARKit depth, 139 | although you may find it works with other metric depth estimators as well. 140 | 141 | #### Sample Command 142 | 143 | If your computer is MPS enabled: 144 | 145 | ``` bash 146 | python tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgbd.pth --viz-on-gt-points --device mps 147 | ``` 148 | 149 | If your computer is CUDA enabled: 150 | 151 | ``` bash 152 | python tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgbd.pth --viz-on-gt-points --device cuda 153 | ``` 154 | 155 | Otherwise: 156 | 157 | ``` bash 158 | python tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgbd.pth --viz-on-gt-points --device cpu 159 | ``` 160 | 161 | ### RGB Only 162 | 163 | The second variant of CuTR expects an RGB image alone and attempts to derive the metric scale of 164 | the scene from the image itself. 165 | 166 | #### Sample Command 167 | 168 | If your device is MPS enabled: 169 | 170 | ``` bash 171 | python tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgb.pth --viz-on-gt-points --device mps 172 | ``` 173 | 174 | ## Run on captures from your own device 175 | 176 | We also have basic support for running on RGB/Depth captured from your own device. 177 | 178 | 1. Make sure you have [NeRF Capture](https://apps.apple.com/au/app/nerfcapture/id6446518379) installed on your device 179 | 2. Start the NeRF Capture app *before* running `demo.py` (force quit and reopen if for some reason things stop working or a connection is not made). 180 | 3. Run the normal commands but pass "stream" instead of the usual tar/folder path. 181 | 4. Hit "Send" in the app to send a frame for inference. This will be visualized in the rerun window. 182 | 183 | We will continue to print "Still waiting" to show liveliness. 184 | 185 | If you have a device equipped with LiDAR, you can use this combined with the RGB-D models, otherwise, you can 186 | only use the RGB only model. 187 | 188 | #### RGB-D (on MPS) 189 | 190 | ``` bash 191 | python tools/demo.py stream --model-path [path_to_models]/cutr_rgbd.pth --device mps 192 | ``` 193 | 194 | #### RGB (on MPS) 195 | 196 | ``` bash 197 | python tools/demo.py stream --model-path [path_to_models]/cutr_rgb.pth --device mps 198 | ``` 199 | 200 | ## Citation 201 | 202 | If you use CA-1M or CuTR in your research, please use the following entry: 203 | 204 | ``` 205 | @article{lazarow2024cubify, 206 | title={Cubify Anything: Scaling Indoor 3D Object Detection}, 207 | author={Lazarow, Justin and Griffiths, David and Kohavi, Gefen and Crespo, Francisco and Dehghan, Afshin}, 208 | journal={arXiv preprint arXiv:2412.04458}, 209 | year={2024} 210 | } 211 | ``` 212 | 213 | ## Licenses 214 | 215 | The sample code is released under [Apple Sample Code License](LICENSE). 216 | 217 | The data is released under [CC-by-NC-ND](data/LICENSE_DATA). 218 | 219 | The models are released under [Apple ML Research Model Terms of Use](LICENSE_MODEL). 220 | 221 | ## Acknowledgements 222 | 223 | We use and acknowledge contributions from multiple open-source projects in [ACKNOWLEDGEMENTS](ACKNOWLEDGEMENTS). 224 | -------------------------------------------------------------------------------- /cubifyanything/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple/ml-cubifyanything/7419eb0cb9b19cb5257b4a1dc905476c155cd343/cubifyanything/__init__.py -------------------------------------------------------------------------------- /cubifyanything/batching.py: -------------------------------------------------------------------------------- 1 | # For licensing see accompanying LICENSE file. 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved. 3 | 4 | import copy 5 | import torch 6 | 7 | from typing import Any, Dict, Generic, List, Optional, Tuple, TypeVar, Union 8 | from typing_extensions import TypeAlias 9 | 10 | from cubifyanything.measurement import ( 11 | MeasurementInfo, 12 | DepthMeasurementInfo, 13 | ImageMeasurementInfo) 14 | 15 | from cubifyanything.sensor import ( 16 | SensorInfo, 17 | PosedSensorInfo) 18 | 19 | from cubifyanything.imagelist import ImageList 20 | from cubifyanything.instances import Instances3D 21 | 22 | T = TypeVar("T") 23 | I = TypeVar("I", bound=MeasurementInfo) 24 | S = TypeVar("S", bound=SensorInfo) 25 | 26 | class Measurement(Generic[T, I, S]): 27 | def __init__(self, data: T, info: I, sensor: S): 28 | self.data = data 29 | self.info = info 30 | self.sensor = sensor 31 | 32 | # This is painful, but stems from lack of multiple dispatch. 33 | @classmethod 34 | def batch(cls, args: List["Measurement"], **kwargs) -> "BatchedMeasurement": 35 | if isinstance(args[0].info, (DepthMeasurementInfo,)): 36 | return BatchedPosedDepth( 37 | ImageList.from_tensors( 38 | [a.data for a in args], 39 | **kwargs), 40 | [a.info for a in args], 41 | [a.sensor for a in args]) 42 | elif isinstance(args[0].info, (ImageMeasurementInfo,)): 43 | return BatchedPosedImage( 44 | ImageList.from_tensors( 45 | [a.data for a in args], 46 | **kwargs), 47 | [a.info for a in args], 48 | [a.sensor for a in args]) 49 | else: 50 | raise NotImplementedError 51 | 52 | def to(self, *args: Any, **kwargs: Any) -> "Measurement": 53 | return self.__orig_class__( 54 | self.data.to(*args, **kwargs), 55 | self.info.to(*args, **kwargs), 56 | self.sensor.to(*args, **kwargs)) 57 | 58 | class BatchedMeasurement(Generic[T, I, S]): 59 | def __init__(self, data: T, info: List[I], sensor: List[S]): 60 | self.data = data 61 | self.info = info 62 | self.sensor = sensor 63 | 64 | @property 65 | def padding(self) -> int: 66 | raise NotImplementedError 67 | 68 | def __getitem__(self, index): 69 | # TODO: Also give data back (sliced). 70 | return self.__orig_class__( 71 | data=self.data if isinstance(self.data, ImageList) else self.data[index], 72 | info=self.info[index], 73 | sensor=self.sensor[index]) 74 | 75 | # For now, only shallow copy sensor itself (since has recursive references). 76 | def clone(self): 77 | return self.__orig_class__( 78 | [data_.clone() if hasattr(data_, "clone") else copy.copy(data_) for data_ in self.data], 79 | [info_.clone() for info_ in self.info], 80 | copy.copy(self.sensor)) 81 | 82 | PosedImage: TypeAlias = Measurement[torch.Tensor, ImageMeasurementInfo, PosedSensorInfo] 83 | PosedDepth: TypeAlias = Measurement[torch.Tensor, DepthMeasurementInfo, PosedSensorInfo] 84 | 85 | BatchedPosedImage: TypeAlias = BatchedMeasurement[ImageList, ImageMeasurementInfo, PosedSensorInfo] 86 | BatchedPosedDepth: TypeAlias = BatchedMeasurement[ImageList, DepthMeasurementInfo, PosedSensorInfo] 87 | 88 | Sensors: TypeAlias = Dict[str, Dict[str, Measurement]] 89 | BatchedSensors: TypeAlias = Dict[str, Dict[str, BatchedMeasurement]] 90 | BatchedPosedSensor: TypeAlias = Dict[str, Union[BatchedPosedImage, BatchedPosedDepth]] 91 | -------------------------------------------------------------------------------- /cubifyanything/boxes.py: -------------------------------------------------------------------------------- 1 | # For licensing see accompanying LICENSE file. 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved. 3 | 4 | import numpy as np 5 | import torch 6 | import warnings 7 | 8 | from abc import abstractmethod 9 | from scipy.spatial.transform import Rotation 10 | from torch import Tensor 11 | from typing import Iterator, Optional, Sequence, Tuple, Union 12 | 13 | 14 | from enum import Enum 15 | class BoxDOF(Enum): 16 | All = 1 17 | GravityAligned = 2 18 | 19 | # Based on MMDet3D. 20 | def rotation_3d_in_axis( 21 | points: Union[np.ndarray, Tensor], 22 | angles: Union[np.ndarray, Tensor, float], 23 | axis: int = 0, 24 | return_mat: bool = False, 25 | clockwise: bool = False 26 | ) -> Union[Tuple[np.ndarray, np.ndarray], Tuple[Tensor, Tensor], np.ndarray, 27 | Tensor]: 28 | """Rotate points by angles according to axis. 29 | 30 | Args: 31 | points (np.ndarray or Tensor): Points with shape (N, M, 3). 32 | angles (np.ndarray or Tensor or float): Vector of angles with shape 33 | (N, ). 34 | axis (int): The axis to be rotated. Defaults to 0. 35 | return_mat (bool): Whether or not to return the rotation matrix 36 | (transposed). Defaults to False. 37 | clockwise (bool): Whether the rotation is clockwise. Defaults to False. 38 | 39 | Raises: 40 | ValueError: When the axis is not in range [-3, -2, -1, 0, 1, 2], it 41 | will raise ValueError. 42 | 43 | Returns: 44 | Tuple[np.ndarray, np.ndarray] or Tuple[Tensor, Tensor] or np.ndarray or 45 | Tensor: Rotated points with shape (N, M, 3) and rotation matrix with 46 | shape (N, 3, 3). 47 | """ 48 | batch_free = len(points.shape) == 2 49 | if batch_free: 50 | points = points[None] 51 | 52 | if isinstance(angles, float) or len(angles.shape) == 0: 53 | angles = torch.full(points.shape[:1], angles) 54 | 55 | assert len(points.shape) == 3 and len(angles.shape) == 1 and \ 56 | points.shape[0] == angles.shape[0], 'Incorrect shape of points ' \ 57 | f'angles: {points.shape}, {angles.shape}' 58 | 59 | assert points.shape[-1] in [2, 3], \ 60 | f'Points size should be 2 or 3 instead of {points.shape[-1]}' 61 | 62 | rot_sin = torch.sin(angles) 63 | rot_cos = torch.cos(angles) 64 | ones = torch.ones_like(rot_cos) 65 | zeros = torch.zeros_like(rot_cos) 66 | 67 | if points.shape[-1] == 3: 68 | if axis == 1 or axis == -2: 69 | rot_mat_T = torch.stack([ 70 | torch.stack([rot_cos, zeros, -rot_sin]), 71 | torch.stack([zeros, ones, zeros]), 72 | torch.stack([rot_sin, zeros, rot_cos]) 73 | ]) 74 | elif axis == 2 or axis == -1: 75 | rot_mat_T = torch.stack([ 76 | torch.stack([rot_cos, rot_sin, zeros]), 77 | torch.stack([-rot_sin, rot_cos, zeros]), 78 | torch.stack([zeros, zeros, ones]) 79 | ]) 80 | elif axis == 0 or axis == -3: 81 | rot_mat_T = torch.stack([ 82 | torch.stack([ones, zeros, zeros]), 83 | torch.stack([zeros, rot_cos, rot_sin]), 84 | torch.stack([zeros, -rot_sin, rot_cos]) 85 | ]) 86 | else: 87 | raise ValueError( 88 | f'axis should in range [-3, -2, -1, 0, 1, 2], got {axis}') 89 | else: 90 | rot_mat_T = torch.stack([ 91 | torch.stack([rot_cos, rot_sin]), 92 | torch.stack([-rot_sin, rot_cos]) 93 | ]) 94 | 95 | if clockwise: 96 | rot_mat_T = rot_mat_T.transpose(0, 1) 97 | 98 | if points.shape[0] == 0: 99 | points_new = points 100 | else: 101 | points_new = torch.einsum('aij,jka->aik', points, rot_mat_T) 102 | 103 | if batch_free: 104 | points_new = points_new.squeeze(0) 105 | 106 | if return_mat: 107 | rot_mat_T = torch.einsum('jka->ajk', rot_mat_T) 108 | if batch_free: 109 | rot_mat_T = rot_mat_T.squeeze(0) 110 | return points_new, rot_mat_T 111 | else: 112 | return points_new 113 | 114 | # from MMDet3D. 115 | class BaseInstance3DBoxes: 116 | """Base class for 3D Boxes. 117 | 118 | Note: 119 | The box is bottom centered, i.e. the relative position of origin in the 120 | box is (0.5, 0.5, 0). 121 | 122 | Args: 123 | tensor (Tensor or np.ndarray or Sequence[Sequence[float]]): The boxes 124 | data with shape (N, box_dim). 125 | box_dim (int): Number of the dimension of a box. Each row is 126 | (x, y, z, x_size, y_size, z_size, yaw). Defaults to 7. 127 | with_yaw (bool): Whether the box is with yaw rotation. If False, the 128 | value of yaw will be set to 0 as minmax boxes. Defaults to True. 129 | origin (Tuple[float]): Relative position of the box origin. 130 | Defaults to (0.5, 0.5, 0). This will guide the box be converted to 131 | (0.5, 0.5, 0) mode. 132 | 133 | Attributes: 134 | tensor (Tensor): Float matrix with shape (N, box_dim). 135 | box_dim (int): Integer indicating the dimension of a box. Each row is 136 | (x, y, z, x_size, y_size, z_size, yaw, ...). 137 | with_yaw (bool): If True, the value of yaw will be set to 0 as minmax 138 | boxes. 139 | """ 140 | 141 | YAW_AXIS: int = 0 142 | 143 | def __init__( 144 | self, 145 | tensor: Union[Tensor, np.ndarray, Sequence[Sequence[float]]], 146 | box_dim: int = 7, 147 | with_yaw: bool = True, 148 | origin: Tuple[float, float, float] = (0.5, 0.5, 0) 149 | ) -> None: 150 | if isinstance(tensor, Tensor): 151 | device = tensor.device 152 | else: 153 | device = torch.device('cpu') 154 | tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) 155 | if tensor.numel() == 0: 156 | # Use reshape, so we don't end up creating a new tensor that does 157 | # not depend on the inputs (and consequently confuses jit) 158 | tensor = tensor.reshape((-1, box_dim)) 159 | assert tensor.dim() == 2 and tensor.size(-1) == box_dim, \ 160 | ('The box dimension must be 2 and the length of the last ' 161 | f'dimension must be {box_dim}, but got boxes with shape ' 162 | f'{tensor.shape}.') 163 | 164 | if tensor.shape[-1] == 6: 165 | # If the dimension of boxes is 6, we expand box_dim by padding 0 as 166 | # a fake yaw and set with_yaw to False 167 | assert box_dim == 6 168 | fake_rot = tensor.new_zeros(tensor.shape[0], 1) 169 | tensor = torch.cat((tensor, fake_rot), dim=-1) 170 | self.box_dim = box_dim + 1 171 | self.with_yaw = False 172 | else: 173 | self.box_dim = box_dim 174 | self.with_yaw = with_yaw 175 | self.tensor = tensor.clone() 176 | 177 | if origin != (0.5, 0.5, 0): 178 | dst = self.tensor.new_tensor((0.5, 0.5, 0)) 179 | src = self.tensor.new_tensor(origin) 180 | self.tensor[:, :3] += self.tensor[:, 3:6] * (dst - src) 181 | 182 | @property 183 | def shape(self) -> torch.Size: 184 | """torch.Size: Shape of boxes.""" 185 | return self.tensor.shape 186 | 187 | @property 188 | def volume(self) -> Tensor: 189 | """Tensor: A vector with volume of each box in shape (N, ).""" 190 | return self.tensor[:, 3] * self.tensor[:, 4] * self.tensor[:, 5] 191 | 192 | @property 193 | def dims(self) -> Tensor: 194 | """Tensor: Size dimensions of each box in shape (N, 3).""" 195 | return self.tensor[:, 3:6] 196 | 197 | @property 198 | def yaw(self) -> Tensor: 199 | """Tensor: A vector with yaw of each box in shape (N, ).""" 200 | return self.tensor[:, 6] 201 | 202 | @property 203 | def height(self) -> Tensor: 204 | """Tensor: A vector with height of each box in shape (N, ).""" 205 | return self.tensor[:, 5] 206 | 207 | @property 208 | def top_height(self) -> Tensor: 209 | """Tensor: A vector with top height of each box in shape (N, ).""" 210 | return self.bottom_height + self.height 211 | 212 | @property 213 | def bottom_height(self) -> Tensor: 214 | """Tensor: A vector with bottom height of each box in shape (N, ).""" 215 | return self.tensor[:, 2] 216 | 217 | @property 218 | def center(self) -> Tensor: 219 | """Calculate the center of all the boxes. 220 | 221 | Note: 222 | In MMDetection3D's convention, the bottom center is usually taken 223 | as the default center. 224 | 225 | The relative position of the centers in different kinds of boxes 226 | are different, e.g., the relative center of a boxes is 227 | (0.5, 1.0, 0.5) in camera and (0.5, 0.5, 0) in lidar. It is 228 | recommended to use ``bottom_center`` or ``gravity_center`` for 229 | clearer usage. 230 | 231 | Returns: 232 | Tensor: A tensor with center of each box in shape (N, 3). 233 | """ 234 | return self.bottom_center 235 | 236 | @property 237 | def bottom_center(self) -> Tensor: 238 | """Tensor: A tensor with center of each box in shape (N, 3).""" 239 | return self.tensor[:, :3] 240 | 241 | @property 242 | def gravity_center(self) -> Tensor: 243 | """Tensor: A tensor with center of each box in shape (N, 3).""" 244 | bottom_center = self.bottom_center 245 | gravity_center = torch.zeros_like(bottom_center) 246 | gravity_center[:, :2] = bottom_center[:, :2] 247 | gravity_center[:, 2] = bottom_center[:, 2] + self.tensor[:, 5] * 0.5 248 | return gravity_center 249 | 250 | @property 251 | def corners(self) -> Tensor: 252 | """Tensor: A tensor with 8 corners of each box in shape (N, 8, 3).""" 253 | pass 254 | 255 | @abstractmethod 256 | def rotate( 257 | self, 258 | angle: Union[Tensor, np.ndarray, float], 259 | points: Optional[Union[Tensor, np.ndarray]] = None 260 | ) -> Union[Tuple[Tensor, Tensor], Tuple[np.ndarray, np.ndarray], Tuple[ 261 | Tensor], None]: 262 | """Rotate boxes with points (optional) with the given angle or rotation 263 | matrix. 264 | 265 | Args: 266 | angle (Tensor or np.ndarray or float): Rotation angle or rotation 267 | matrix. 268 | points (Tensor or np.ndarray or :obj:`BasePoints`, optional): 269 | Points to rotate. Defaults to None. 270 | 271 | Returns: 272 | tuple or None: When ``points`` is None, the function returns None, 273 | otherwise it returns the rotated points and the rotation matrix 274 | ``rot_mat_T``. 275 | """ 276 | pass 277 | 278 | def translate(self, trans_vector: Union[Tensor, np.ndarray]) -> None: 279 | """Translate boxes with the given translation vector. 280 | 281 | Args: 282 | trans_vector (Tensor or np.ndarray): Translation vector of size 283 | 1x3. 284 | """ 285 | if not isinstance(trans_vector, Tensor): 286 | trans_vector = self.tensor.new_tensor(trans_vector) 287 | 288 | self.tensor[:, :3] += trans_vector 289 | 290 | return self 291 | 292 | def in_range_3d( 293 | self, box_range: Union[Tensor, np.ndarray, 294 | Sequence[float]]) -> Tensor: 295 | """Check whether the boxes are in the given range. 296 | 297 | Args: 298 | box_range (Tensor or np.ndarray or Sequence[float]): The range of 299 | box (x_min, y_min, z_min, x_max, y_max, z_max). 300 | 301 | Note: 302 | In the original implementation of SECOND, checking whether a box in 303 | the range checks whether the points are in a convex polygon, we try 304 | to reduce the burden for simpler cases. 305 | 306 | Returns: 307 | Tensor: A binary vector indicating whether each point is inside the 308 | reference range. 309 | """ 310 | in_range_flags = ((self.tensor[:, 0] > box_range[0]) 311 | & (self.tensor[:, 1] > box_range[1]) 312 | & (self.tensor[:, 2] > box_range[2]) 313 | & (self.tensor[:, 0] < box_range[3]) 314 | & (self.tensor[:, 1] < box_range[4]) 315 | & (self.tensor[:, 2] < box_range[5])) 316 | return in_range_flags 317 | 318 | @abstractmethod 319 | def convert_to(self, 320 | dst: int, 321 | rt_mat: Optional[Union[Tensor, np.ndarray]] = None, 322 | correct_yaw: bool = False) -> 'BaseInstance3DBoxes': 323 | """Convert self to ``dst`` mode. 324 | 325 | Args: 326 | dst (int): The target Box mode. 327 | rt_mat (Tensor or np.ndarray, optional): The rotation and 328 | translation matrix between different coordinates. 329 | Defaults to None. The conversion from ``src`` coordinates to 330 | ``dst`` coordinates usually comes along the change of sensors, 331 | e.g., from camera to LiDAR. This requires a transformation 332 | matrix. 333 | correct_yaw (bool): Whether to convert the yaw angle to the target 334 | coordinate. Defaults to False. 335 | 336 | Returns: 337 | :obj:`BaseInstance3DBoxes`: The converted box of the same type in 338 | the ``dst`` mode. 339 | """ 340 | pass 341 | 342 | def scale(self, scale_factor: float) -> None: 343 | """Scale the box with horizontal and vertical scaling factors. 344 | 345 | Args: 346 | scale_factors (float): Scale factors to scale the boxes. 347 | """ 348 | self.tensor[:, :6] *= scale_factor 349 | self.tensor[:, 7:] *= scale_factor # velocity 350 | 351 | def nonempty(self, threshold: float = 0.0) -> Tensor: 352 | """Find boxes that are non-empty. 353 | 354 | A box is considered empty if either of its side is no larger than 355 | threshold. 356 | 357 | Args: 358 | threshold (float): The threshold of minimal sizes. Defaults to 0.0. 359 | 360 | Returns: 361 | Tensor: A binary vector which represents whether each box is empty 362 | (False) or non-empty (True). 363 | """ 364 | box = self.tensor 365 | size_x = box[..., 3] 366 | size_y = box[..., 4] 367 | size_z = box[..., 5] 368 | keep = ((size_x > threshold) 369 | & (size_y > threshold) & (size_z > threshold)) 370 | return keep 371 | 372 | def __getitem__( 373 | self, item: Union[int, slice, np.ndarray, 374 | Tensor]) -> 'BaseInstance3DBoxes': 375 | """ 376 | Args: 377 | item (int or slice or np.ndarray or Tensor): Index of boxes. 378 | 379 | Note: 380 | The following usage are allowed: 381 | 382 | 1. `new_boxes = boxes[3]`: Return a `Boxes` that contains only one 383 | box. 384 | 2. `new_boxes = boxes[2:10]`: Return a slice of boxes. 385 | 3. `new_boxes = boxes[vector]`: Where vector is a 386 | torch.BoolTensor with `length = len(boxes)`. Nonzero elements in 387 | the vector will be selected. 388 | 389 | Note that the returned Boxes might share storage with this Boxes, 390 | subject to PyTorch's indexing semantics. 391 | 392 | Returns: 393 | :obj:`BaseInstance3DBoxes`: A new object of 394 | :class:`BaseInstance3DBoxes` after indexing. 395 | """ 396 | original_type = type(self) 397 | if isinstance(item, int): 398 | return original_type( 399 | self.tensor[item].view(1, -1), 400 | box_dim=self.box_dim, 401 | with_yaw=self.with_yaw) 402 | b = self.tensor[item] 403 | assert b.dim() == 2, \ 404 | f'Indexing on Boxes with {item} failed to return a matrix!' 405 | return original_type(b, box_dim=self.box_dim, with_yaw=self.with_yaw) 406 | 407 | def __len__(self) -> int: 408 | """int: Number of boxes in the current object.""" 409 | return self.tensor.shape[0] 410 | 411 | def __repr__(self) -> str: 412 | """str: Return a string that describes the object.""" 413 | return self.__class__.__name__ + '(\n ' + str(self.tensor) + ')' 414 | 415 | def clone(self) -> 'BaseInstance3DBoxes': 416 | """Clone the boxes. 417 | 418 | Returns: 419 | :obj:`BaseInstance3DBoxes`: Box object with the same properties as 420 | self. 421 | """ 422 | original_type = type(self) 423 | return original_type( 424 | self.tensor.clone(), box_dim=self.box_dim, with_yaw=self.with_yaw) 425 | 426 | @classmethod 427 | def cat(cls, boxes_list: Sequence['BaseInstance3DBoxes'] 428 | ) -> 'BaseInstance3DBoxes': 429 | """Concatenate a list of Boxes into a single Boxes. 430 | 431 | Args: 432 | boxes_list (Sequence[:obj:`BaseInstance3DBoxes`]): List of boxes. 433 | 434 | Returns: 435 | :obj:`BaseInstance3DBoxes`: The concatenated boxes. 436 | """ 437 | assert isinstance(boxes_list, (list, tuple)) 438 | if len(boxes_list) == 0: 439 | return cls(torch.empty(0)) 440 | assert all(isinstance(box, cls) for box in boxes_list) 441 | 442 | # use torch.cat (v.s. layers.cat) 443 | # so the returned boxes never share storage with input 444 | cat_boxes = cls( 445 | torch.cat([b.tensor for b in boxes_list], dim=0), 446 | box_dim=boxes_list[0].box_dim, 447 | with_yaw=boxes_list[0].with_yaw) 448 | return cat_boxes 449 | 450 | @property 451 | def bev(self) -> Tensor: 452 | """Tensor: 2D BEV box of each box with rotation in XYWHR format, in 453 | shape (N, 5).""" 454 | return self.tensor[:, [0, 1, 3, 4, 6]] 455 | 456 | def new_box( 457 | self, data: Union[Tensor, np.ndarray, Sequence[Sequence[float]]] 458 | ) -> 'BaseInstance3DBoxes': 459 | """Create a new box object with data. 460 | 461 | The new box and its tensor has the similar properties as self and 462 | self.tensor, respectively. 463 | 464 | Args: 465 | data (Tensor or np.ndarray or Sequence[Sequence[float]]): Data to 466 | be copied. 467 | 468 | Returns: 469 | :obj:`BaseInstance3DBoxes`: A new bbox object with ``data``, the 470 | object's other properties are similar to ``self``. 471 | """ 472 | new_tensor = self.tensor.new_tensor(data) \ 473 | if not isinstance(data, Tensor) else data.to(self.device) 474 | original_type = type(self) 475 | return original_type( 476 | new_tensor, box_dim=self.box_dim, with_yaw=self.with_yaw) 477 | 478 | def numpy(self) -> np.ndarray: 479 | """Reload ``numpy`` from self.tensor.""" 480 | return self.tensor.numpy() 481 | 482 | def to(self, device: Union[str, torch.device], *args, 483 | **kwargs) -> 'BaseInstance3DBoxes': 484 | """Convert current boxes to a specific device. 485 | 486 | Args: 487 | device (str or :obj:`torch.device`): The name of the device. 488 | 489 | Returns: 490 | :obj:`BaseInstance3DBoxes`: A new boxes object on the specific 491 | device. 492 | """ 493 | original_type = type(self) 494 | return original_type( 495 | self.tensor.to(device, *args, **kwargs), 496 | box_dim=self.box_dim, 497 | with_yaw=self.with_yaw) 498 | 499 | @property 500 | def device(self) -> torch.device: 501 | """torch.device: The device of the boxes are on.""" 502 | return self.tensor.device 503 | 504 | def __iter__(self) -> Iterator[Tensor]: 505 | """Yield a box as a Tensor at a time. 506 | 507 | Returns: 508 | Iterator[Tensor]: A box of shape (box_dim, ). 509 | """ 510 | yield from self.tensor 511 | 512 | class DepthInstance3DBoxes(BaseInstance3DBoxes): 513 | YAW_AXIS = 2 514 | 515 | @property 516 | def gravity_center(self): 517 | """torch.Tensor: A tensor with center of each box in shape (N, 3).""" 518 | bottom_center = self.bottom_center 519 | gravity_center = torch.zeros_like(bottom_center) 520 | gravity_center[:, :2] = bottom_center[:, :2] 521 | gravity_center[:, 2] = bottom_center[:, 2] + self.tensor[:, 5] * 0.5 522 | return gravity_center 523 | 524 | @property 525 | def corners(self): 526 | if self.tensor.numel() == 0: 527 | return torch.empty([0, 8, 3], device=self.tensor.device) 528 | 529 | dims = self.dims 530 | corners_norm = torch.from_numpy( 531 | np.stack(np.unravel_index(np.arange(8), [2] * 3), axis=1)).to( 532 | device=dims.device, dtype=dims.dtype) 533 | 534 | corners_norm = corners_norm[[0, 1, 3, 2, 4, 5, 7, 6]] 535 | # use relative origin (0.5, 0.5, 0) 536 | corners_norm = corners_norm - dims.new_tensor([0.5, 0.5, 0]) 537 | corners = dims.view([-1, 1, 3]) * corners_norm.reshape([1, 8, 3]) 538 | 539 | # rotate around z axis 540 | corners = rotation_3d_in_axis( 541 | corners, self.tensor[:, 6], axis=self.YAW_AXIS) 542 | corners += self.tensor[:, :3].view(-1, 1, 3) 543 | return corners 544 | 545 | def rotate(self, angle): 546 | """Rotate boxes 547 | 548 | Args: 549 | angle (float | torch.Tensor | np.ndarray): 550 | Rotation angle or rotation matrix. 551 | points (torch.Tensor | np.ndarray | :obj:`BasePoints`, optional): 552 | Points to rotate. Defaults to None. 553 | 554 | Returns: 555 | tuple or None: When ``points`` is None, the function returns 556 | None, otherwise it returns the rotated points and the 557 | rotation matrix ``rot_mat_T``. 558 | """ 559 | if not isinstance(angle, torch.Tensor): 560 | angle = self.tensor.new_tensor(angle) 561 | 562 | assert angle.shape == torch.Size([3, 3]) or angle.numel() == 1, \ 563 | f'invalid rotation angle shape {angle.shape}' 564 | 565 | if angle.numel() == 1: 566 | self.tensor[:, 0:3], rot_mat_T = rotation_3d_in_axis( 567 | self.tensor[:, 0:3], 568 | angle, 569 | axis=self.YAW_AXIS, 570 | return_mat=True) 571 | else: 572 | rot_mat_T = angle 573 | rot_sin = rot_mat_T[0, 1] 574 | rot_cos = rot_mat_T[0, 0] 575 | angle = torch.arctan2(rot_sin, rot_cos) 576 | self.tensor[:, 0:3] = self.tensor[:, 0:3] @ rot_mat_T 577 | 578 | if self.with_yaw: 579 | self.tensor[:, 6] += angle 580 | else: 581 | # for axis-aligned boxes, we take the new 582 | # enclosing axis-aligned boxes after rotation 583 | corners_rot = self.corners @ rot_mat_T 584 | new_x_size = corners_rot[..., 0].max( 585 | dim=1, keepdim=True)[0] - corners_rot[..., 0].min( 586 | dim=1, keepdim=True)[0] 587 | new_y_size = corners_rot[..., 1].max( 588 | dim=1, keepdim=True)[0] - corners_rot[..., 1].min( 589 | dim=1, keepdim=True)[0] 590 | self.tensor[:, 3:5] = torch.cat((new_x_size, new_y_size), dim=-1) 591 | 592 | # I've modified this to remove point support and return self so this can be chained (usually you want a clone() first). 593 | return self 594 | 595 | def flip(self, bev_direction='horizontal', points=None): 596 | """Flip the boxes in BEV along given BEV direction. 597 | 598 | In Depth coordinates, it flips x (horizontal) or y (vertical) axis. 599 | 600 | Args: 601 | bev_direction (str, optional): Flip direction 602 | (horizontal or vertical). Defaults to 'horizontal'. 603 | points (torch.Tensor | np.ndarray | :obj:`BasePoints`, optional): 604 | Points to flip. Defaults to None. 605 | 606 | Returns: 607 | torch.Tensor, numpy.ndarray or None: Flipped points. 608 | """ 609 | assert bev_direction in ('horizontal', 'vertical') 610 | if bev_direction == 'horizontal': 611 | self.tensor[:, 0::7] = -self.tensor[:, 0::7] 612 | if self.with_yaw: 613 | self.tensor[:, 6] = -self.tensor[:, 6] + np.pi 614 | elif bev_direction == 'vertical': 615 | self.tensor[:, 1::7] = -self.tensor[:, 1::7] 616 | if self.with_yaw: 617 | self.tensor[:, 6] = -self.tensor[:, 6] 618 | 619 | if points is not None: 620 | assert isinstance(points, (torch.Tensor, np.ndarray, BasePoints)) 621 | if isinstance(points, (torch.Tensor, np.ndarray)): 622 | if bev_direction == 'horizontal': 623 | points[:, 0] = -points[:, 0] 624 | elif bev_direction == 'vertical': 625 | points[:, 1] = -points[:, 1] 626 | elif isinstance(points, BasePoints): 627 | points.flip(bev_direction) 628 | return points 629 | 630 | def enlarged_box(self, extra_width): 631 | """Enlarge the length, width and height boxes. 632 | Args: 633 | extra_width (float | torch.Tensor): Extra width to enlarge the box. 634 | Returns: 635 | :obj:`DepthInstance3DBoxes`: Enlarged boxes. 636 | """ 637 | enlarged_boxes = self.tensor.clone() 638 | enlarged_boxes[:, 3:6] += extra_width * 2 639 | # bottom center z minus extra_width 640 | if isinstance(extra_width, torch.Tensor) and (extra_width.shape[-1] == 3): 641 | enlarged_boxes[:, 2] -= extra_width[:, 2] 642 | else: 643 | enlarged_boxes[:, 2] -= extra_width 644 | 645 | return self.new_box(enlarged_boxes) 646 | 647 | def to_camera(self, RT) -> 'GeneralInstance3DBoxes': 648 | # corners -> expected permutation. 649 | corners = self.corners[:, [1, 5, 4, 0, 2, 6, 7, 3]] 650 | corners = torch.cat((corners, torch.ones_like(corners[..., :1])), dim=-1) 651 | corners = torch.linalg.inv(RT) @ corners.permute(0, 2, 1) 652 | corners = corners[:, :3].permute(0, 2, 1) 653 | 654 | return corners_to_camera(corners) 655 | 656 | class GeneralInstance3DBoxes(object): 657 | def __init__(self, xyzlhw, R, box_dim=6 + 3 * 3, origin=(0.5, 0.5, 0), dof=BoxDOF.All): 658 | if isinstance(xyzlhw, torch.Tensor): 659 | device = xyzlhw.device 660 | else: 661 | device = torch.device('cpu') 662 | 663 | xyzlhw = torch.as_tensor(xyzlhw, dtype=torch.float32, device=device) 664 | R = torch.as_tensor(R, dtype=torch.float32, device=device) 665 | 666 | self.dof = dof 667 | self.box_dim = box_dim 668 | self.tensor = xyzlhw.clone() 669 | self.R = R.clone() 670 | 671 | @classmethod 672 | def empty(cls, dof=BoxDOF.GravityAligned): 673 | return GeneralInstance3DBoxes( 674 | torch.zeros((0, 6)), 675 | torch.zeros((0, 3, 3)), 676 | dof=dof) 677 | 678 | @property 679 | def volume(self): 680 | """torch.Tensor: A vector with volume of each box.""" 681 | return self.tensor[:, 3] * self.tensor[:, 4] * self.tensor[:, 5] 682 | 683 | @property 684 | def dims(self): 685 | """torch.Tensor: Size dimensions of each box in shape (N, 3).""" 686 | return self.tensor[:, 3:6] 687 | 688 | @property 689 | def whl(self): 690 | return self.tensor[:, [5, 4, 3]] 691 | 692 | @property 693 | def xyzwhl(self): 694 | return self.tensor[:, [0, 1, 2, 5, 4, 3]] 695 | 696 | @property 697 | def center(self): 698 | """Calculate the center of all the boxes. 699 | 700 | Note: 701 | In MMDetection3D's convention, the bottom center is 702 | usually taken as the default center. 703 | 704 | The relative position of the centers in different kinds of 705 | boxes are different, e.g., the relative center of a boxes is 706 | (0.5, 1.0, 0.5) in camera and (0.5, 0.5, 0) in lidar. 707 | It is recommended to use ``bottom_center`` or ``gravity_center`` 708 | for clearer usage. 709 | 710 | Returns: 711 | torch.Tensor: A tensor with center of each box in shape (N, 3). 712 | """ 713 | return self.gravity_center 714 | 715 | @property 716 | def bottom_center(self): 717 | """torch.Tensor: A tensor with center of each box in shape (N, 3).""" 718 | raise ValueError("not supported") 719 | 720 | @property 721 | def gravity_center(self): 722 | """torch.Tensor: A tensor with center of each box in shape (N, 3).""" 723 | return self.tensor[:, :3] 724 | 725 | @property 726 | def corners(self): 727 | """torch.Tensor: 728 | a tensor with 8 corners of each box in shape (N, 8, 3).""" 729 | x3d = self.tensor[:, 0].unsqueeze(1) 730 | y3d = self.tensor[:, 1].unsqueeze(1) 731 | z3d = self.tensor[:, 2].unsqueeze(1) 732 | w3d = self.tensor[:, 5].unsqueeze(1) 733 | h3d = self.tensor[:, 4].unsqueeze(1) 734 | l3d = self.tensor[:, 3].unsqueeze(1) 735 | 736 | ''' 737 | v4_____________________v5 738 | /| /| 739 | / | / | 740 | / | / | 741 | /___|_________________/ | 742 | v0| | |v1 | 743 | | | | | 744 | | | | | 745 | | | | | 746 | | |_________________|___| 747 | | / v7 | /v6 748 | | / | / 749 | | / | / 750 | |/_____________________|/ 751 | v3 v2 752 | ''' 753 | 754 | verts = torch.zeros([len(self), 3, 8], device=self.device) 755 | 756 | # setup X 757 | verts[:, 0, [0, 3, 4, 7]] = -l3d / 2 758 | verts[:, 0, [1, 2, 5, 6]] = l3d / 2 759 | 760 | # setup Y 761 | verts[:, 1, [0, 1, 4, 5]] = -h3d / 2 762 | verts[:, 1, [2, 3, 6, 7]] = h3d / 2 763 | 764 | # setup Z 765 | verts[:, 2, [0, 1, 2, 3]] = -w3d / 2 766 | verts[:, 2, [4, 5, 6, 7]] = w3d / 2 767 | 768 | # rotate 769 | verts = self.R @ verts 770 | 771 | # translate 772 | verts[:, 0, :] += x3d 773 | verts[:, 1, :] += y3d 774 | verts[:, 2, :] += z3d 775 | 776 | verts = verts.transpose(1, 2) 777 | return verts 778 | 779 | @property 780 | def bev(self): 781 | """torch.Tensor: 2D BEV box of each box with rotation 782 | in XYWHR format, in shape (N, 5).""" 783 | pass 784 | 785 | @property 786 | def nearest_bev(self): 787 | """torch.Tensor: A tensor of 2D BEV box of each box 788 | without rotation.""" 789 | pass 790 | 791 | def in_range_bev(self, box_range): 792 | """Check whether the boxes are in the given range. 793 | 794 | Args: 795 | box_range (list | torch.Tensor): the range of box 796 | (x_min, y_min, x_max, y_max) 797 | 798 | Note: 799 | The original implementation of SECOND checks whether boxes in 800 | a range by checking whether the points are in a convex 801 | polygon, we reduce the burden for simpler cases. 802 | 803 | Returns: 804 | torch.Tensor: Whether each box is inside the reference range. 805 | """ 806 | raise ValueError("not supported") 807 | 808 | @abstractmethod 809 | def rotate(self, angle, points=None): 810 | """Rotate boxes with points (optional) with the given angle or rotation 811 | matrix. 812 | 813 | Args: 814 | angle (float | torch.Tensor | np.ndarray): 815 | Rotation angle or rotation matrix. 816 | points (torch.Tensor | numpy.ndarray | 817 | :obj:`BasePoints`, optional): 818 | Points to rotate. Defaults to None. 819 | """ 820 | pass 821 | 822 | def translate(self, trans_vector): 823 | """Translate boxes with the given translation vector. 824 | 825 | Args: 826 | trans_vector (torch.Tensor): Translation vector of size (1, 3). 827 | """ 828 | if not isinstance(trans_vector, torch.Tensor): 829 | trans_vector = self.tensor.new_tensor(trans_vector) 830 | self.tensor[:, :3] += trans_vector 831 | 832 | def __getitem__(self, item): 833 | original_type = type(self) 834 | if isinstance(item, int): 835 | return original_type( 836 | self.tensor[item].view(1, -1), 837 | self.R[item].view(1, 3, 3), 838 | dof=self.dof) 839 | 840 | b = self.tensor[item] 841 | r = self.R[item] 842 | assert b.dim() == 2, \ 843 | f'Indexing on Boxes with {item} failed to return a matrix!' 844 | return original_type(b, r, dof=self.dof) 845 | 846 | def __len__(self): 847 | """int: Number of boxes in the current object.""" 848 | return self.tensor.shape[0] 849 | 850 | def __repr__(self): 851 | """str: Return a strings that describes the object.""" 852 | return self.__class__.__name__ + '(\n ' + str(self.tensor) + ')' 853 | 854 | @classmethod 855 | def cat(cls, boxes_list): 856 | """Concatenate a list of Boxes into a single Boxes. 857 | 858 | Args: 859 | boxes_list (list[:obj:`BaseInstance3DBoxes`]): List of boxes. 860 | 861 | Returns: 862 | :obj:`BaseInstance3DBoxes`: The concatenated Boxes. 863 | """ 864 | assert isinstance(boxes_list, (list, tuple)) 865 | if len(boxes_list) == 0: 866 | return cls(torch.empty(0)) 867 | assert all(isinstance(box, cls) for box in boxes_list) 868 | 869 | first_dof = boxes_list[0].dof 870 | assert all(box.dof == first_dof for box in boxes_list) 871 | 872 | # use torch.cat (v.s. layers.cat) 873 | # so the returned boxes never share storage with input 874 | cat_boxes = cls( 875 | xyzlhw=torch.cat([b.tensor for b in boxes_list], dim=0), 876 | R=torch.cat([b.R for b in boxes_list], dim=0), 877 | dof=boxes_list[0].dof) 878 | 879 | return cat_boxes 880 | 881 | def split(self, split_size_or_sections): 882 | tensors = torch.split(self.tensor, split_size_or_sections) 883 | Rs = torch.split(self.R, split_size_or_sections) 884 | 885 | return [ 886 | type(self)( 887 | xyzlhw=tensor, 888 | R=R, 889 | dof=self.dof 890 | ) for tensor, R in zip(tensors, Rs) 891 | ] 892 | 893 | def to(self, device): 894 | """Convert current boxes to a specific device. 895 | 896 | Args: 897 | device (str | :obj:`torch.device`): The name of the device. 898 | 899 | Returns: 900 | :obj:`BaseInstance3DBoxes`: A new boxes object on the 901 | specific device. 902 | """ 903 | return GeneralInstance3DBoxes( 904 | self.tensor.to(device), 905 | R=self.R.to(device), 906 | dof=self.dof) 907 | 908 | def clone(self) -> 'GeneralInstance3DBoxes': 909 | """Clone the boxes. 910 | 911 | Returns: 912 | :obj:`GeneralInstance3DBoxes`: Box object with the same properties as 913 | self. 914 | """ 915 | original_type = type(self) 916 | return original_type( 917 | self.tensor.clone(), self.R.clone(), dof=self.dof) 918 | 919 | @property 920 | def device(self): 921 | """str: The device of the boxes are on.""" 922 | return self.tensor.device 923 | 924 | def __iter__(self): 925 | """Yield a box as a Tensor of shape (4,) at a time. 926 | 927 | Returns: 928 | torch.Tensor: A box of shape (4,). 929 | """ 930 | yield from self.tensor 931 | -------------------------------------------------------------------------------- /cubifyanything/capture_stream.py: -------------------------------------------------------------------------------- 1 | """ 2 | Dataset to stream RGB-D data from the NeRFCapture iOS App -> Cubify Transformer 3 | 4 | Adapted from SplaTaM: https://github.com/spla-tam/SplaTAM 5 | """ 6 | 7 | import numpy as np 8 | import time 9 | import torch 10 | 11 | import cyclonedds.idl as idl 12 | import cyclonedds.idl.annotations as annotate 13 | import cyclonedds.idl.types as types 14 | 15 | from dataclasses import dataclass 16 | from cyclonedds.domain import DomainParticipant, Domain 17 | from cyclonedds.core import Qos, Policy 18 | from cyclonedds.sub import DataReader 19 | from cyclonedds.topic import Topic 20 | from cyclonedds.util import duration 21 | 22 | from PIL import Image 23 | from scipy.spatial.transform import Rotation 24 | from torch.utils.data import IterableDataset 25 | 26 | from cubifyanything.boxes import DepthInstance3DBoxes 27 | from cubifyanything.measurement import ImageMeasurementInfo, DepthMeasurementInfo 28 | from cubifyanything.orientation import ImageOrientation, rotate_tensor, ROT_Z 29 | from cubifyanything.sensor import SensorArrayInfo, SensorInfo, PosedSensorInfo 30 | 31 | # DDS 32 | # ================================================================================================== 33 | @dataclass 34 | @annotate.final 35 | @annotate.autoid("sequential") 36 | class CaptureFrame(idl.IdlStruct, typename="CaptureData.CaptureFrame"): 37 | id: types.uint32 38 | annotate.key("id") 39 | timestamp: types.float64 40 | fl_x: types.float32 41 | fl_y: types.float32 42 | cx: types.float32 43 | cy: types.float32 44 | transform_matrix: types.array[types.float32, 16] 45 | width: types.uint32 46 | height: types.uint32 47 | image: types.sequence[types.uint8] 48 | has_depth: bool 49 | depth_width: types.uint32 50 | depth_height: types.uint32 51 | depth_scale: types.float32 52 | depth_image: types.sequence[types.uint8] 53 | 54 | # 8 MB seems to work for me, but not 10 MB. 55 | dds_config = """ \ 56 | \ 57 | \ 58 | \ 59 | 8MB \ 60 | \ 61 | \ 62 | config \ 63 | stdout \ 64 | \ 65 | \ 66 | \ 67 | """ 68 | 69 | T_RW_to_VW = np.array([[0, 0, -1, 0], 70 | [-1, 0, 0, 0], 71 | [0, 1, 0, 0], 72 | [ 0, 0, 0, 1]]).reshape((4,4)).astype(np.float32) 73 | 74 | T_RC_to_VC = np.array([[1, 0, 0, 0], 75 | [0, -1, 0, 0], 76 | [0, 0, -1, 0], 77 | [0, 0, 0, 1]]).reshape((4,4)).astype(np.float32) 78 | 79 | T_VC_to_RC = np.array([[1, 0, 0, 0], 80 | [0, -1, 0, 0], 81 | [0, 0, -1, 0], 82 | [0, 0, 0, 1]]).reshape((4,4)).astype(np.float32) 83 | 84 | def compute_VC2VW_from_RC2RW(T_RC_to_RW): 85 | T_vc2rw = np.matmul(T_RC_to_RW,T_VC_to_RC) 86 | T_vc2vw = np.matmul(T_RW_to_VW,T_vc2rw) 87 | return T_vc2vw 88 | 89 | def get_camera_to_gravity_transform(pose, current, target=ImageOrientation.UPRIGHT): 90 | z_rot_4x4 = torch.eye(4).float() 91 | z_rot_4x4[:3, :3] = ROT_Z[(current, target)] 92 | pose = pose @ torch.linalg.inv(z_rot_4x4.to(pose)) 93 | 94 | # This is somewhat lazy. 95 | fake_corners = DepthInstance3DBoxes( 96 | np.array([[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0]])).corners[:, [1, 5, 4, 0, 2, 6, 7, 3]] 97 | fake_corners = torch.cat((fake_corners, torch.ones_like(fake_corners[..., :1])), dim=-1).to(pose) 98 | 99 | fake_corners = (torch.linalg.inv(pose) @ fake_corners.permute(0, 2, 1)).permute(0, 2, 1)[..., :3] 100 | fake_basis = torch.stack([ 101 | (fake_corners[:, 1] - fake_corners[:, 0]) / torch.linalg.norm(fake_corners[:, 1] - fake_corners[:, 0], dim=-1)[:, None], 102 | (fake_corners[:, 3] - fake_corners[:, 0]) / torch.linalg.norm(fake_corners[:, 3] - fake_corners[:, 0], dim=-1)[:, None], 103 | (fake_corners[:, 4] - fake_corners[:, 0]) / torch.linalg.norm(fake_corners[:, 4] - fake_corners[:, 0], dim=-1)[:, None], 104 | ], dim=1).permute(0, 2, 1) 105 | 106 | # this gets applied _after_ predictions to put it in camera space. 107 | T = Rotation.from_euler("xz", Rotation.from_matrix(fake_basis[-1].cpu().numpy()).as_euler("yxz")[1:]).as_matrix() 108 | 109 | return torch.tensor(T).to(pose) 110 | 111 | MAX_LONG_SIDE = 1024 112 | 113 | # Acts like CubifyAnythingDataset but reads from the NeRFCapture stream. 114 | class CaptureDataset(IterableDataset): 115 | def __init__(self, load_arkit_depth=True): 116 | super(CaptureDataset, self).__init__() 117 | 118 | self.load_arkit_depth = load_arkit_depth 119 | 120 | self.domain = Domain(domain_id=0, config=dds_config) 121 | self.participant = DomainParticipant() 122 | self.qos = Qos(Policy.Reliability.Reliable( 123 | max_blocking_time=duration(seconds=1))) 124 | self.topic = Topic(self.participant, "Frames", CaptureFrame, qos=self.qos) 125 | self.reader = DataReader(self.participant, self.topic) 126 | 127 | def __iter__(self): 128 | print("Waiting for frames...") 129 | video_id = 0 130 | 131 | # Start DDS Loop 132 | while True: 133 | sample = self.reader.read_next() 134 | if not sample: 135 | print("Still waiting...") 136 | time.sleep(0.05) 137 | continue 138 | 139 | result = dict(wide=dict()) 140 | wide = PosedSensorInfo() 141 | 142 | # OK, we have a frame. Fill on the requisite data/fields. 143 | image_info = ImageMeasurementInfo( 144 | size=(sample.width, sample.height), 145 | K=torch.tensor([ 146 | [sample.fl_x, 0.0, sample.cx], 147 | [0.0, sample.fl_y, sample.cy], 148 | [0.0, 0.0, 1.0] 149 | ])[None]) 150 | 151 | print(image_info.size) 152 | 153 | image = np.asarray(sample.image, dtype=np.uint8).reshape((sample.height, sample.width, 3)) 154 | wide.image = image_info 155 | result["wide"]["image"] = torch.tensor(np.moveaxis(image, -1, 0))[None] 156 | 157 | if self.load_arkit_depth and not sample.has_depth: 158 | raise ValueError("Depth was not found, you likely can only run the RGB only model with your device") 159 | 160 | depth_info = None 161 | if sample.has_depth: 162 | # We'll eventually ensure this is 1/4. 163 | rgb_depth_ratio = sample.width / sample.depth_width 164 | depth_info = DepthMeasurementInfo( 165 | size=(sample.depth_width, sample.depth_height), 166 | K=torch.tensor([ 167 | [sample.fl_x / rgb_depth_ratio , 0.0, sample.cx / rgb_depth_ratio], 168 | [0.0, sample.fl_y / rgb_depth_ratio, sample.cy / rgb_depth_ratio], 169 | [0.0, 0.0, 1.0] 170 | ])[None]) 171 | 172 | # Is this an encoding thing? 173 | depth_scale = sample.depth_scale 174 | print(depth_scale) 175 | wide.depth = depth_info 176 | 177 | # If I understand this correctly, it looks like this might just want the lower 16 bits? 178 | depth = torch.tensor( 179 | np.asarray(sample.depth_image, dtype=np.uint8).view(dtype=np.float32).reshape((sample.depth_height, sample.depth_width)))[None].float() 180 | result["wide"]["depth"] = depth 181 | 182 | desired_image_size = (4 * depth_info.size[0], 4 * depth_info.size[1]) 183 | wide.image = wide.image.resize(desired_image_size) 184 | result["wide"]["image"] = torch.tensor(np.moveaxis(np.array(Image.fromarray(image).resize(desired_image_size)), -1, 0))[None] 185 | else: 186 | # Even for RGB-only, only support a certain long size. 187 | if max(wide.image.size) > MAX_LONG_SIDE: 188 | scale_factor = MAX_LONG_SIDE / max(wide.image.size) 189 | 190 | new_size = (int(wide.image.size[0] * scale_factor), int(wide.image.size[1] * scale_factor)) 191 | wide.image = wide.image.resize(new_size) 192 | result["wide"]["image"] = torch.tensor(np.moveaxis(np.array(Image.fromarray(image).resize(new_size)), -1, 0))[None] 193 | 194 | # ARKit sends W2C? 195 | # While we don't necessarily care about pose, we use it to derive the orientation 196 | # and T_gravity. 197 | RT = torch.tensor( 198 | compute_VC2VW_from_RC2RW(np.asarray(sample.transform_matrix).astype(np.float32).reshape((4, 4)).T)) 199 | wide.RT = RT[None] 200 | 201 | current_orientation = wide.orientation 202 | target_orientation = ImageOrientation.UPRIGHT 203 | 204 | T_gravity = get_camera_to_gravity_transform(wide.RT[-1], current_orientation, target=target_orientation) 205 | wide = wide.orient(current_orientation, target_orientation) 206 | 207 | result["wide"]["image"] = rotate_tensor(result["wide"]["image"], current_orientation, target=target_orientation) 208 | if wide.has("depth"): 209 | result["wide"]["depth"] = rotate_tensor(result["wide"]["depth"], current_orientation, target=target_orientation) 210 | 211 | # No need for pose anymore. 212 | wide.RT = torch.eye(4)[None] 213 | wide.T_gravity = T_gravity[None] 214 | 215 | sensor_info = SensorArrayInfo() 216 | sensor_info.wide = wide 217 | 218 | result["meta"] = dict(video_id=video_id, timestamp=sample.timestamp) 219 | result["sensor_info"] = sensor_info 220 | 221 | yield result 222 | -------------------------------------------------------------------------------- /cubifyanything/color.py: -------------------------------------------------------------------------------- 1 | # Detectron2's colors. 2 | import numpy as np 3 | 4 | _COLORS = np.array( 5 | [ 6 | 0.000, 0.447, 0.741, 7 | 0.850, 0.325, 0.098, 8 | 0.929, 0.694, 0.125, 9 | 0.494, 0.184, 0.556, 10 | 0.466, 0.674, 0.188, 11 | 0.301, 0.745, 0.933, 12 | 0.635, 0.078, 0.184, 13 | 0.300, 0.300, 0.300, 14 | 0.600, 0.600, 0.600, 15 | 1.000, 0.000, 0.000, 16 | 1.000, 0.500, 0.000, 17 | 0.749, 0.749, 0.000, 18 | 0.000, 1.000, 0.000, 19 | 0.000, 0.000, 1.000, 20 | 0.667, 0.000, 1.000, 21 | 0.333, 0.333, 0.000, 22 | 0.333, 0.667, 0.000, 23 | 0.333, 1.000, 0.000, 24 | 0.667, 0.333, 0.000, 25 | 0.667, 0.667, 0.000, 26 | 0.667, 1.000, 0.000, 27 | 1.000, 0.333, 0.000, 28 | 1.000, 0.667, 0.000, 29 | 1.000, 1.000, 0.000, 30 | 0.000, 0.333, 0.500, 31 | 0.000, 0.667, 0.500, 32 | 0.000, 1.000, 0.500, 33 | 0.333, 0.000, 0.500, 34 | 0.333, 0.333, 0.500, 35 | 0.333, 0.667, 0.500, 36 | 0.333, 1.000, 0.500, 37 | 0.667, 0.000, 0.500, 38 | 0.667, 0.333, 0.500, 39 | 0.667, 0.667, 0.500, 40 | 0.667, 1.000, 0.500, 41 | 1.000, 0.000, 0.500, 42 | 1.000, 0.333, 0.500, 43 | 1.000, 0.667, 0.500, 44 | 1.000, 1.000, 0.500, 45 | 0.000, 0.333, 1.000, 46 | 0.000, 0.667, 1.000, 47 | 0.000, 1.000, 1.000, 48 | 0.333, 0.000, 1.000, 49 | 0.333, 0.333, 1.000, 50 | 0.333, 0.667, 1.000, 51 | 0.333, 1.000, 1.000, 52 | 0.667, 0.000, 1.000, 53 | 0.667, 0.333, 1.000, 54 | 0.667, 0.667, 1.000, 55 | 0.667, 1.000, 1.000, 56 | 1.000, 0.000, 1.000, 57 | 1.000, 0.333, 1.000, 58 | 1.000, 0.667, 1.000, 59 | 0.333, 0.000, 0.000, 60 | 0.500, 0.000, 0.000, 61 | 0.667, 0.000, 0.000, 62 | 0.833, 0.000, 0.000, 63 | 1.000, 0.000, 0.000, 64 | 0.000, 0.167, 0.000, 65 | 0.000, 0.333, 0.000, 66 | 0.000, 0.500, 0.000, 67 | 0.000, 0.667, 0.000, 68 | 0.000, 0.833, 0.000, 69 | 0.000, 1.000, 0.000, 70 | 0.000, 0.000, 0.167, 71 | 0.000, 0.000, 0.333, 72 | 0.000, 0.000, 0.500, 73 | 0.000, 0.000, 0.667, 74 | 0.000, 0.000, 0.833, 75 | 0.000, 0.000, 1.000, 76 | 0.000, 0.000, 0.000, 77 | 0.143, 0.143, 0.143, 78 | 0.857, 0.857, 0.857, 79 | 1.000, 1.000, 1.000 80 | ] 81 | ).astype(np.float32).reshape(-1, 3) 82 | 83 | def random_color(rgb=False, maximum=255): 84 | """ 85 | Args: 86 | rgb (bool): whether to return RGB colors or BGR colors. 87 | maximum (int): either 255 or 1 88 | 89 | Returns: 90 | ndarray: a vector of 3 numbers 91 | """ 92 | idx = np.random.randint(0, len(_COLORS)) 93 | ret = _COLORS[idx] * maximum 94 | if not rgb: 95 | ret = ret[::-1] 96 | return ret 97 | -------------------------------------------------------------------------------- /cubifyanything/dataset.py: -------------------------------------------------------------------------------- 1 | # For licensing see accompanying LICENSE file. 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved. 3 | 4 | import functools 5 | import io 6 | import numpy as np 7 | import torch 8 | import json 9 | import tifffile 10 | import webdataset 11 | 12 | from pathlib import Path 13 | from PIL import Image 14 | from typing import Any, Callable, Dict, Iterable, Iterator, Optional, Set, Tuple 15 | 16 | from webdataset.cache import cached_url_opener 17 | from webdataset.handlers import reraise_exception 18 | 19 | from cubifyanything.boxes import GeneralInstance3DBoxes, BoxDOF 20 | from cubifyanything.instances import Instances3D 21 | from cubifyanything.measurement import ImageMeasurementInfo, DepthMeasurementInfo 22 | from cubifyanything.sensor import SensorArrayInfo, SensorInfo, PosedSensorInfo 23 | 24 | def custom_pipe_cleaner(spec): 25 | # This should only be called when using links directly to the MLR CDN, so assume some stuff. 26 | return "/".join(Path(spec).parts[-2:]) 27 | 28 | custom_cached_url_opener = functools.partial(cached_url_opener, cache_dir="data", url_to_name=custom_pipe_cleaner) 29 | 30 | PREFIX_SEPARATOR = "." 31 | WORLD_PREFIX = "world" 32 | 33 | def split_into_prefix_suffix(name): 34 | return name.split(PREFIX_SEPARATOR)[:2] 35 | 36 | # All samples should be stored with keys like [video_id]/[integer timestamp].[sensor_name]/[measurement_name] (or world/). 37 | def group_by_video_and_timestamp( 38 | data: Iterable[Dict[str, Any]], 39 | keys: Callable[[str], Tuple[str, str]] = split_into_prefix_suffix, 40 | lcase: bool = True, 41 | suffixes: Optional[Set[str]] = None, 42 | handler: Callable[[Exception], bool] = reraise_exception, 43 | ) -> Iterator[Dict[str, Any]]: 44 | return webdataset.tariterators.group_by_keys(data, keys, lcase, suffixes, handler) 45 | 46 | TIME_SCALE = 1e9 47 | MM_TO_M = 1000.0 48 | 49 | # Parsers. 50 | def parse_json(data, key): 51 | return json.loads(data[key].decode("utf-8")) 52 | 53 | def parse_size(data): 54 | return tuple(int(x) for x in data.decode("utf-8").strip("[]").split(", ")) 55 | 56 | def parse_transform_3x3(data): 57 | return torch.tensor(np.array(json.loads(data)).reshape(3, 3).astype(np.float32)) 58 | 59 | def parse_transform_4x4(data): 60 | return torch.tensor(np.array(json.loads(data)).reshape(4, 4).astype(np.float32)) 61 | 62 | def read_image_bytes(image_bytes, expected_size, channels_first=True): 63 | if image_bytes.startswith(b"\x89PNG"): 64 | # PNG. 65 | image = np.array(Image.open(io.BytesIO(image_bytes))) 66 | elif image_bytes.startswith(b"II*\x00") or image_bytes.startswith(b"MM\x00*"): 67 | # TIFF. 68 | image = tifffile.imread(io.BytesIO(image_bytes)) 69 | else: 70 | raise ValueError("Unknown image format") 71 | 72 | assert (image.shape[1], image.shape[0]) == expected_size 73 | 74 | if channels_first and (image.ndim > 2): 75 | image = np.moveaxis(image, -1, 0) 76 | 77 | return torch.tensor(image) 78 | 79 | def read_instances(data): 80 | instances_data = json.loads(data) 81 | instances = Instances3D() 82 | 83 | if len(instances_data) == 0: 84 | # Empty. 85 | instances.set("gt_ids", []) 86 | instances.set("gt_names", []) 87 | instances.set("gt_boxes_3d", empty_box(box_type)) 88 | for src_key_2d, dst_key_2d in [("box_2d_rend", "gt_boxes_2d_trunc"), ("box_2d_proj", "gt_boxes_2d_proj")]: 89 | instances.set(dst_key_2d, np.empty((0, 4))) 90 | 91 | return instances 92 | 93 | instances.set("gt_ids", [bi["id"] for bi in instances_data]) 94 | instances.set("gt_names", [bi["category"] for bi in instances_data]) 95 | instances.set("gt_boxes_3d", GeneralInstance3DBoxes( 96 | np.concatenate(( 97 | np.array([bi["position"] for bi in instances_data]), 98 | np.array([bi["scale"] for bi in instances_data])), axis=-1), 99 | np.array([bi["R"] for bi in instances_data]))) 100 | 101 | return instances 102 | 103 | class CubifyAnythingDataset(webdataset.DataPipeline): 104 | def __init__(self, url, box_dof=BoxDOF.GravityAligned, yield_world_instances=False, load_arkit_depth=True, use_cache=False): 105 | self._url = url 106 | self._yield_world_instances = yield_world_instances 107 | self._use_cache = use_cache 108 | 109 | super(CubifyAnythingDataset, self).__init__( 110 | webdataset.SimpleShardList(url), 111 | custom_cached_url_opener if self._use_cache else webdataset.tariterators.url_opener, 112 | webdataset.tariterators.tar_file_expander, 113 | group_by_video_and_timestamp, 114 | self._map_samples) 115 | 116 | self.load_arkit_depth = load_arkit_depth 117 | 118 | def _map_sample(self, sample): 119 | video_id, timestamp = sample["__key__"].split("/") 120 | video_id = int(video_id) 121 | if timestamp == "world": 122 | return dict( 123 | world=dict(instances=read_instances(sample["gt/instances"])), 124 | meta=dict(video_id=video_id)) 125 | 126 | gt_depth_size = parse_size(sample["_gt/depth/size"]) 127 | 128 | timestamp = float(timestamp) / 1e9 129 | 130 | # At this point, everything is in camera coordinates. 131 | wide = PosedSensorInfo() 132 | wide.RT = torch.eye(4)[None] 133 | wide.image = ImageMeasurementInfo( 134 | size=parse_size(sample["_wide/image/size"]), 135 | K=parse_transform_3x3(sample["wide/image/k"])[None], 136 | ) 137 | 138 | if self.load_arkit_depth: 139 | wide.depth = DepthMeasurementInfo( 140 | size=parse_size(sample["_wide/depth/size"]), 141 | K=parse_transform_3x3(sample["wide/depth/k"])[None]) 142 | 143 | wide.T_gravity = parse_transform_3x3(sample["wide/t_gravity"])[None] 144 | 145 | gt = PosedSensorInfo() 146 | gt.RT = parse_transform_4x4(sample["gt/rt"])[None] 147 | gt.depth = DepthMeasurementInfo( 148 | size=parse_size(sample["_gt/depth/size"]), 149 | K=parse_transform_3x3(sample["gt/depth/k"])[None]) 150 | 151 | sensor_info = SensorArrayInfo() 152 | sensor_info.wide = wide 153 | sensor_info.gt = gt 154 | 155 | result = dict( 156 | sensor_info=sensor_info, 157 | wide=dict( 158 | image=read_image_bytes(sample["wide/image"], expected_size=wide.image.size)[None], 159 | instances=read_instances(sample["wide/instances"])), 160 | gt=dict( 161 | # NOTE: 0.0 values here correspond to failed registration areas. 162 | depth=read_image_bytes(sample["gt/depth"], expected_size=gt.depth.size)[None].float() / MM_TO_M), 163 | meta=dict(video_id=video_id, timestamp=timestamp)) 164 | 165 | if self.load_arkit_depth: 166 | result["wide"]["depth"] = read_image_bytes(sample["wide/depth"], expected_size=wide.depth.size)[None].float() / MM_TO_M 167 | 168 | return result 169 | 170 | def _map_samples(self, samples): 171 | for sample in samples: 172 | # Don't map the world instances unless requested to (since these are timeless). 173 | if sample["__key__"].endswith("/world"): 174 | if not self._yield_world_instances: 175 | continue 176 | 177 | yield self._map_sample(sample) 178 | 179 | if __name__ == "__main__": 180 | dataset = CubifyAnythingDataset("file:/tmp/lupine-train-49739919.tar") 181 | for blah in iter(dataset): 182 | import pdb 183 | pdb.set_trace() 184 | -------------------------------------------------------------------------------- /cubifyanything/imagelist.py: -------------------------------------------------------------------------------- 1 | # Detectron2's ImageList. 2 | from typing import Any, Dict, List, Optional, Tuple 3 | import torch 4 | from torch import device 5 | from torch.nn import functional as F 6 | 7 | class ImageList: 8 | """ 9 | Structure that holds a list of images (of possibly 10 | varying sizes) as a single tensor. 11 | This works by padding the images to the same size. 12 | The original sizes of each image is stored in `image_sizes`. 13 | 14 | Attributes: 15 | image_sizes (list[tuple[int, int]]): each tuple is (h, w). 16 | During tracing, it becomes list[Tensor] instead. 17 | """ 18 | 19 | def __init__(self, tensor: torch.Tensor, image_sizes: List[Tuple[int, int]]): 20 | """ 21 | Arguments: 22 | tensor (Tensor): of shape (N, H, W) or (N, C_1, ..., C_K, H, W) where K >= 1 23 | image_sizes (list[tuple[int, int]]): Each tuple is (h, w). It can 24 | be smaller than (H, W) due to padding. 25 | """ 26 | self.tensor = tensor 27 | self.image_sizes = image_sizes 28 | 29 | def __len__(self) -> int: 30 | return len(self.image_sizes) 31 | 32 | def __getitem__(self, idx) -> torch.Tensor: 33 | """ 34 | Access the individual image in its original size. 35 | 36 | Args: 37 | idx: int or slice 38 | 39 | Returns: 40 | Tensor: an image of shape (H, W) or (C_1, ..., C_K, H, W) where K >= 1 41 | """ 42 | size = self.image_sizes[idx] 43 | return self.tensor[idx, ..., : size[0], : size[1]] 44 | 45 | @torch.jit.unused 46 | def to(self, *args: Any, **kwargs: Any) -> "ImageList": 47 | cast_tensor = self.tensor.to(*args, **kwargs) 48 | return ImageList(cast_tensor, self.image_sizes) 49 | 50 | @property 51 | def device(self) -> device: 52 | return self.tensor.device 53 | 54 | @staticmethod 55 | def from_tensors( 56 | tensors: List[torch.Tensor], 57 | size_divisibility: int = 0, 58 | pad_value: float = 0.0, 59 | padding_constraints: Optional[Dict[str, int]] = None, 60 | ) -> "ImageList": 61 | """ 62 | Args: 63 | tensors: a tuple or list of `torch.Tensor`, each of shape (Hi, Wi) or 64 | (C_1, ..., C_K, Hi, Wi) where K >= 1. The Tensors will be padded 65 | to the same shape with `pad_value`. 66 | size_divisibility (int): If `size_divisibility > 0`, add padding to ensure 67 | the common height and width is divisible by `size_divisibility`. 68 | This depends on the model and many models need a divisibility of 32. 69 | pad_value (float): value to pad. 70 | padding_constraints (optional[Dict]): If given, it would follow the format as 71 | {"size_divisibility": int, "square_size": int}, where `size_divisibility` will 72 | overwrite the above one if presented and `square_size` indicates the 73 | square padding size if `square_size` > 0. 74 | Returns: 75 | an `ImageList`. 76 | """ 77 | assert len(tensors) > 0 78 | assert isinstance(tensors, (tuple, list)) 79 | for t in tensors: 80 | assert isinstance(t, torch.Tensor), type(t) 81 | assert t.shape[:-2] == tensors[0].shape[:-2], t.shape 82 | 83 | image_sizes = [(im.shape[-2], im.shape[-1]) for im in tensors] 84 | image_sizes_tensor = [torch.as_tensor(x) for x in image_sizes] 85 | max_size = torch.stack(image_sizes_tensor).max(0).values 86 | 87 | if padding_constraints is not None: 88 | square_size = padding_constraints.get("square_size", 0) 89 | if square_size > 0: 90 | # pad to square. 91 | max_size[0] = max_size[1] = square_size 92 | if "size_divisibility" in padding_constraints: 93 | size_divisibility = padding_constraints["size_divisibility"] 94 | if size_divisibility > 1: 95 | stride = size_divisibility 96 | # the last two dims are H,W, both subject to divisibility requirement 97 | max_size = (max_size + (stride - 1)).div(stride, rounding_mode="floor") * stride 98 | 99 | # handle weirdness of scripting and tracing ... 100 | if torch.jit.is_scripting(): 101 | max_size: List[int] = max_size.to(dtype=torch.long).tolist() 102 | else: 103 | if torch.jit.is_tracing(): 104 | image_sizes = image_sizes_tensor 105 | 106 | if len(tensors) == 1: 107 | # This seems slightly (2%) faster. 108 | # TODO: check whether it's faster for multiple images as well 109 | image_size = image_sizes[0] 110 | padding_size = [0, max_size[-1] - image_size[1], 0, max_size[-2] - image_size[0]] 111 | batched_imgs = F.pad(tensors[0], padding_size, value=pad_value).unsqueeze_(0) 112 | else: 113 | raise NotImplementedError 114 | 115 | return ImageList(batched_imgs.contiguous(), image_sizes) 116 | -------------------------------------------------------------------------------- /cubifyanything/instances.py: -------------------------------------------------------------------------------- 1 | # For licensing see accompanying LICENSE file. 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved. 3 | 4 | # Based on D2's Instances. 5 | import itertools 6 | import warnings 7 | from typing import Any, Dict, List, Tuple, Union 8 | 9 | import numpy as np 10 | import torch 11 | 12 | # Provides basic compatibility with D2. 13 | class Instances3D: 14 | """ 15 | This class represents a list of instances in _the world_. 16 | """ 17 | def __init__(self, image_size: Tuple[int, int] = (0, 0), **kwargs: Any): 18 | # image_size is here for Detectron2 compatibility. 19 | self._image_size = image_size 20 | self._fields: Dict[str, Any] = {} 21 | for k, v in kwargs.items(): 22 | self.set(k, v) 23 | 24 | @property 25 | def image_size(self) -> Tuple[int, int]: 26 | """ 27 | Returns: 28 | tuple: height, width (note: opposite of cubifycore). 29 | 30 | Here for D2 compatibility. You probably shouldn't be using this. 31 | """ 32 | return self._image_size 33 | 34 | def __setattr__(self, name: str, val: Any) -> None: 35 | if name.startswith("_"): 36 | super().__setattr__(name, val) 37 | else: 38 | self.set(name, val) 39 | 40 | def __getattr__(self, name: str) -> Any: 41 | if name == "_fields" or name not in self._fields: 42 | raise AttributeError("Cannot find field '{}' in the given Instances3D!".format(name)) 43 | return self._fields[name] 44 | 45 | def set(self, name: str, value: Any) -> None: 46 | """ 47 | Set the field named `name` to `value`. 48 | The length of `value` must be the number of instances, 49 | and must agree with other existing fields in this object. 50 | """ 51 | with warnings.catch_warnings(record=True): 52 | data_len = len(value) 53 | if len(self._fields): 54 | assert ( 55 | len(self) == data_len 56 | ), "Adding a field of length {} to a Instances3D of length {}".format(data_len, len(self)) 57 | self._fields[name] = value 58 | 59 | def has(self, name: str) -> bool: 60 | """ 61 | Returns: 62 | bool: whether the field called `name` exists. 63 | """ 64 | return name in self._fields 65 | 66 | def remove(self, name: str) -> None: 67 | """ 68 | Remove the field called `name`. 69 | """ 70 | del self._fields[name] 71 | 72 | def get(self, name: str) -> Any: 73 | """ 74 | Returns the field called `name`. 75 | """ 76 | return self._fields[name] 77 | 78 | def get_fields(self) -> Dict[str, Any]: 79 | """ 80 | Returns: 81 | dict: a dict which maps names (str) to data of the fields 82 | 83 | Modifying the returned dict will modify this instance. 84 | """ 85 | return self._fields 86 | 87 | # Tensor-like methods 88 | def to(self, *args: Any, **kwargs: Any) -> "Instances3D": 89 | """ 90 | Returns: 91 | Instances: all fields are called with a `to(device)`, if the field has this method. 92 | """ 93 | ret = Instances3D(image_size=self._image_size) 94 | # Copy fields that were explicitly added to this object (e.g., hidden fields) 95 | for name, value in self.__dict__.items(): 96 | if (name not in ["_fields"]) and name.startswith("_"): 97 | setattr(ret, name, value.to(*args, **kwargs) if hasattr(value, "to") else value) 98 | 99 | for k, v in self._fields.items(): 100 | if hasattr(v, "to"): 101 | v = v.to(*args, **kwargs) 102 | ret.set(k, v) 103 | 104 | return ret 105 | 106 | def __getitem__(self, item: Union[int, slice, torch.BoolTensor]) -> "Instances3D": 107 | """ 108 | Args: 109 | item: an index-like object and will be used to index all the fields. 110 | 111 | Returns: 112 | If `item` is a string, return the data in the corresponding field. 113 | Otherwise, returns an `Instances3D` where all fields are indexed by `item`. 114 | """ 115 | if type(item) == int: 116 | if item >= len(self) or item < -len(self): 117 | raise IndexError("Instances3D index out of range!") 118 | else: 119 | item = slice(item, None, len(self)) 120 | 121 | ret = Instances3D(image_size=self.image_size) 122 | for name, value in self.__dict__.items(): 123 | if (name not in ["_fields"]) and name.startswith("_"): 124 | setattr(ret, name, value) 125 | 126 | for k, v in self._fields.items(): 127 | if isinstance(v, (torch.Tensor, np.ndarray)) or hasattr(v, "tensor"): 128 | # assume if has .tensor, then this is piped into __getitem__. 129 | # Make sure to match underlying types. 130 | if isinstance(v, np.ndarray) and isinstance(item, torch.Tensor): 131 | ret.set(k, v[item.cpu().numpy()]) 132 | else: 133 | ret.set(k, v[item]) 134 | elif hasattr(v, "__iter__"): 135 | # handle non-Tensor types like lists, etc. 136 | if isinstance(item, np.ndarray) and (item.dtype == np.bool_): 137 | ret.set(k, [v_ for i_, v_ in enumerate(v) if item[i_]]) 138 | elif isinstance(item, torch.BoolTensor) or (isinstance(item, torch.Tensor) and (item.dtype == torch.bool)): 139 | ret.set(k, [v_ for i_, v_ in enumerate(v) if item[i_].item()]) 140 | elif isinstance(item, torch.LongTensor) or (isinstance(item, torch.Tensor) and (item.dtype == torch.int64)): 141 | # Can this be right? 142 | ret.set(k, [v[i_.item()] for i_ in item]) 143 | elif isinstance(item, slice): 144 | ret.set(k, v[item]) 145 | else: 146 | raise ValueError("Expected Bool or Long Tensor") 147 | else: 148 | raise ValueError("Not supported!") 149 | 150 | return ret 151 | 152 | def __len__(self) -> int: 153 | for v in self._fields.values(): 154 | # use __len__ because len() has to be int and is not friendly to tracing 155 | return v.__len__() 156 | raise NotImplementedError("Empty Instances3D does not support __len__!") 157 | 158 | def __iter__(self): 159 | raise NotImplementedError("`Instances3D` object is not iterable!") 160 | 161 | def split(self, split_size_or_sections): 162 | indexes = torch.arange(len(self)) 163 | splits = torch.split(indexes, split_size_or_sections) 164 | 165 | return [self[split] for split in splits] 166 | 167 | def clone(self): 168 | import copy 169 | 170 | ret = Instances3D(image_size=self._image_size) 171 | for k, v in self._fields.items(): 172 | if hasattr(v, "clone"): 173 | v = v.clone() 174 | elif isinstance(v, np.ndarray): 175 | v = np.copy(v) 176 | elif isinstance(v, (str, list, tuple)): 177 | v = copy.copy(v) 178 | elif hasattr(v, "tensor"): 179 | v = type(v)(v.tensor.clone()) 180 | else: 181 | raise NotImplementedError 182 | 183 | ret.set(k, v) 184 | 185 | return ret 186 | 187 | @staticmethod 188 | def cat(instance_lists: List["Instances3D"]) -> "Instances3D": 189 | """ 190 | Args: 191 | instance_lists (list[Instances]) 192 | 193 | Returns: 194 | Instances 195 | """ 196 | assert all(isinstance(i, Instances3D) for i in instance_lists) 197 | assert len(instance_lists) > 0 198 | if len(instance_lists) == 1: 199 | return instance_lists[0] 200 | 201 | ret = Instances3D(image_size=instance_lists[0]._image_size) 202 | for k in instance_lists[0]._fields.keys(): 203 | values = [i.get(k) for i in instance_lists] 204 | v0 = values[0] 205 | if isinstance(v0, torch.Tensor): 206 | values = torch.cat(values, dim=0) 207 | elif isinstance(v0, list): 208 | values = list(itertools.chain(*values)) 209 | elif hasattr(type(v0), "cat"): 210 | values = type(v0).cat(values) 211 | else: 212 | raise ValueError("Unsupported type {} for concatenation".format(type(v0))) 213 | ret.set(k, values) 214 | return ret 215 | 216 | def translate(self, translation): 217 | # in-place. 218 | for field_name, field in self._fields.items(): 219 | if hasattr(field, "translate"): 220 | field.translate(translation) 221 | 222 | def __str__(self) -> str: 223 | s = self.__class__.__name__ + "(" 224 | s += "num_instances={}, ".format(len(self)) 225 | s += "fields=[{}])".format(", ".join((f"{k}: {v}" for k, v in self._fields.items()))) 226 | return s 227 | 228 | __repr__ = __str__ 229 | -------------------------------------------------------------------------------- /cubifyanything/measurement.py: -------------------------------------------------------------------------------- 1 | # For licensing see accompanying LICENSE file. 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved. 3 | 4 | import numpy as np 5 | import torch 6 | 7 | from typing import Any, Dict, List, Tuple, Union 8 | 9 | from cubifyanything.orientation import ImageOrientation, rotate_K 10 | 11 | class BaseMeasurementInfo(object): 12 | def __init__(self, meta=None, **kwargs): 13 | super(BaseMeasurementInfo, self).__init__() 14 | self.meta = meta 15 | 16 | @property 17 | def ts(self): 18 | if (self.meta is not None) and hasattr(self.meta, "ts"): 19 | return self.meta.ts 20 | 21 | return None 22 | 23 | class MeasurementInfo(BaseMeasurementInfo): 24 | pass 25 | 26 | class ImageMeasurementInfo(MeasurementInfo): 27 | def __init__(self, size, K, meta=None, original_size=None): 28 | super(ImageMeasurementInfo, self).__init__(meta=meta) 29 | self.size = size 30 | if isinstance(self.size, torch.Tensor) and not torch.jit.is_tracing(): 31 | self.size = (self.size[0].item(), self.size[1].item()) 32 | 33 | self.original_size = original_size or self.size 34 | 35 | # check for normalized. 36 | if ((K[..., 2] >= 0) & (K[..., 2] < 1)).all(): 37 | raise ValueError("Normalized intrinsics are not supported") 38 | 39 | # No float64 support on MPS. 40 | self.K = K.float() 41 | 42 | @property 43 | def device(self): 44 | return self.K.device 45 | 46 | def _get_fields(self): 47 | # Don't support anything fancy for now. 48 | return dict( 49 | size=torch.tensor(self.size), 50 | K=self.K) 51 | 52 | def __len__(self): 53 | return len(self.K) 54 | 55 | def __getitem__(self, item): 56 | ret = type(self)(self.size, self.K.__getitem__(item), meta=self.meta, original_size=self.original_size) 57 | return ret 58 | 59 | def to(self, *args: Any, **kwargs: Any) -> "ImageMeasurementInfo": 60 | ret = type(self)(self.size, self.K.to(*args, **kwargs), meta=self.meta, original_size=self.original_size) 61 | return ret 62 | 63 | @classmethod 64 | def cat(self, info_list): 65 | return type(info_list[0])( 66 | size=info_list[0].size, 67 | K=torch.cat([info_.K for info_ in info_list]), 68 | ) 69 | 70 | def _get_oriented_size(self, current_orientation, target_orientation, size): 71 | if (target_orientation != ImageOrientation.UPRIGHT) and (current_orientation != ImageOrientation.UPRIGHT): 72 | raise NotImplementedError 73 | 74 | if ((current_orientation, target_orientation) in [ 75 | (ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT), 76 | (ImageOrientation.UPSIDE_DOWN, ImageOrientation.UPRIGHT), 77 | (ImageOrientation.UPRIGHT, ImageOrientation.UPSIDE_DOWN), 78 | (ImageOrientation.LEFT, ImageOrientation.RIGHT), 79 | (ImageOrientation.RIGHT, ImageOrientation.LEFT) 80 | ]): 81 | # Nothing changes. 82 | new_size = size 83 | else: 84 | # Swap. 85 | new_size = (size[1], size[0]) 86 | 87 | return new_size 88 | 89 | def orient(self, current_orientation, target_orientation): 90 | if (target_orientation != ImageOrientation.UPRIGHT) and (current_orientation != ImageOrientation.UPRIGHT): 91 | raise NotImplementedError 92 | 93 | new_K = rotate_K(self.K, current_orientation, self.size, target=target_orientation) 94 | new_size = self._get_oriented_size(current_orientation, target_orientation, self.size) 95 | 96 | ret = type(self)( 97 | new_size, 98 | new_K, 99 | meta=self.meta, 100 | original_size=self._get_oriented_size(current_orientation, target_orientation, self.original_size)) 101 | 102 | return ret 103 | 104 | def rescale(self, factor): 105 | old_size = self.size 106 | new_size = (int(old_size[0] * factor), int(old_size[1] * factor)) 107 | 108 | new_K = self.K.clone() 109 | new_K[..., :2, :] = new_K[..., :2, :] * factor 110 | 111 | return type(self)(new_size, new_K, meta=self.meta, original_size=self.original_size) 112 | 113 | def resize(self, new_size): 114 | if isinstance(new_size, float): 115 | return self.rescale(new_size) 116 | 117 | width_scale = new_size[0] / self.size[0] 118 | height_scale = new_size[1] / self.size[1] 119 | 120 | # Might be some some pixel errors. 121 | if not np.isclose(height_scale, width_scale, atol=0.025): 122 | print(f"Rescaling from {self.size} to {new_size}. This does not seem uniform but may be due to discretization error.") 123 | 124 | result = self.rescale(height_scale) 125 | # Even if it's not the best idea, always make sure the given size is 126 | # reflected. 127 | result.size = tuple(new_size) 128 | return result 129 | 130 | class DepthMeasurementInfo(ImageMeasurementInfo): 131 | def normalize(self, parameters): 132 | return WhitenedDepthMeasurementInfo( 133 | size=self.size, 134 | K=self.K, 135 | meta=self.meta, 136 | parameters=parameters, 137 | original_size=self.original_size) 138 | 139 | class WhitenedDepthMeasurementInfo(DepthMeasurementInfo): 140 | def __init__(self, size, K, meta=None, parameters=None, original_size=None): 141 | super(WhitenedDepthMeasurementInfo, self).__init__(size, K, meta=meta, original_size=original_size) 142 | 143 | # Whitening parameters. 144 | self.parameters = parameters 145 | 146 | def _get_fields(self): 147 | return dict( 148 | size=torch.tensor(self.size), 149 | K=self.K, 150 | parameters=self.parameters) 151 | 152 | -------------------------------------------------------------------------------- /cubifyanything/orientation.py: -------------------------------------------------------------------------------- 1 | # For licensing see accompanying LICENSE file. 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved. 3 | 4 | import numpy as np 5 | import torch 6 | 7 | from enum import Enum 8 | from scipy.spatial.transform import Rotation 9 | 10 | class ImageOrientation(Enum): 11 | UPRIGHT = 0 12 | LEFT = 1 13 | UPSIDE_DOWN = 2 14 | RIGHT = 3 15 | ORIGINAL = 4 16 | 17 | ROT_Z = { 18 | (ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT): torch.tensor(Rotation.from_euler('z', 0).as_matrix()).float(), 19 | (ImageOrientation.LEFT, ImageOrientation.UPRIGHT): torch.tensor(Rotation.from_euler('z', np.pi / 2).as_matrix()).float(), 20 | (ImageOrientation.UPSIDE_DOWN, ImageOrientation.UPRIGHT): torch.tensor(Rotation.from_euler('z', np.pi).as_matrix()).float(), 21 | (ImageOrientation.RIGHT, ImageOrientation.UPRIGHT): torch.tensor(Rotation.from_euler('z', -np.pi / 2).as_matrix()).float(), 22 | 23 | # Inverses. 24 | (ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT): torch.tensor(Rotation.from_euler('z', 0).as_matrix()).float(), 25 | (ImageOrientation.UPRIGHT, ImageOrientation.LEFT): torch.tensor(Rotation.from_euler('z', -np.pi / 2).as_matrix()).float(), 26 | (ImageOrientation.UPRIGHT, ImageOrientation.UPSIDE_DOWN): torch.tensor(Rotation.from_euler('z', -np.pi).as_matrix()).float(), 27 | (ImageOrientation.UPRIGHT, ImageOrientation.RIGHT): torch.tensor(Rotation.from_euler('z', np.pi / 2).as_matrix()).float(), 28 | } 29 | 30 | ROT_K = { 31 | (ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT): 0, 32 | (ImageOrientation.LEFT, ImageOrientation.UPRIGHT): -1, 33 | (ImageOrientation.UPSIDE_DOWN, ImageOrientation.UPRIGHT): 2, 34 | (ImageOrientation.RIGHT, ImageOrientation.UPRIGHT): 1, 35 | 36 | # Inverses. 37 | (ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT): 0, 38 | (ImageOrientation.UPRIGHT, ImageOrientation.LEFT): 1, 39 | (ImageOrientation.UPRIGHT, ImageOrientation.UPSIDE_DOWN): -2, 40 | (ImageOrientation.UPRIGHT, ImageOrientation.RIGHT): -1 41 | } 42 | 43 | def get_orientation(pose): 44 | z_vec = pose[..., 2, :3] 45 | z_orien = torch.tensor(np.array( 46 | [ 47 | [0.0, -1.0, 0.0], # upright 48 | [-1.0, 0.0, 0.0], # left 49 | [0.0, 1.0, 0.0], # upside-down 50 | [1.0, 0.0, 0.0], 51 | ] # right 52 | )).to(pose) 53 | 54 | corr = (z_orien @ z_vec.T).T 55 | corr_max = corr.argmax(dim=-1) 56 | 57 | return corr_max 58 | 59 | def rotate_K(K, current, image_size, target=ImageOrientation.UPRIGHT): 60 | # TODO: use image_size to properly compute the new (cx, cy) 61 | if (current, target) in [(ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT)]: 62 | return K.clone() 63 | elif (current, target) in [(ImageOrientation.LEFT, ImageOrientation.UPRIGHT), (ImageOrientation.UPRIGHT, ImageOrientation.RIGHT)]: 64 | return torch.stack([ 65 | torch.stack([K[:, 1, 1], K[:, 0, 1], K[:, 1, 2]], dim=1), 66 | torch.stack([K[:, 1, 0], K[:, 0, 0], K[:, 0, 2]], dim=1), 67 | torch.stack([K[:, 2, 0], K[:, 2, 1], K[:, 2, 2]], dim=1) 68 | ], dim=1).to(K) 69 | elif (current, target) in [(ImageOrientation.UPSIDE_DOWN, ImageOrientation.UPRIGHT), (ImageOrientation.UPRIGHT, ImageOrientation.UPSIDE_DOWN)]: 70 | return torch.stack([ 71 | torch.stack([K[:, 0, 0], K[:, 0, 1], image_size[0] - K[:, 0, 2]], dim=1), 72 | torch.stack([K[:, 1, 0], K[:, 1, 1], image_size[1] - K[:, 1, 2]], dim=1), 73 | torch.stack([K[:, 2, 0], K[:, 2, 1], K[:, 2, 2]], dim=1) 74 | ], dim=1).to(K) 75 | elif (current, target) in [(ImageOrientation.RIGHT, ImageOrientation.UPRIGHT), (ImageOrientation.UPRIGHT, ImageOrientation.LEFT)]: 76 | return torch.stack([ 77 | torch.stack([K[:, 1, 1], K[:, 0, 1], K[:, 1, 2]], dim=1), 78 | torch.stack([K[:, 1, 0], K[:, 0, 0], K[:, 0, 2]], dim=1), 79 | torch.stack([K[:, 2, 0], K[:, 2, 1], K[:, 2, 2]], dim=1) 80 | ], dim=1).to(K) 81 | 82 | raise ValueError("unknown orientation") 83 | 84 | def rotate_pose(pose, current, target=ImageOrientation.UPRIGHT): 85 | rot_z = ROT_Z[(current, target)].to(pose) 86 | rot_z_4x4 = torch.eye(4, device=pose.device).float() 87 | rot_z_4x4[:3, :3] = rot_z 88 | 89 | return pose @ torch.linalg.inv(rot_z_4x4) 90 | 91 | def rotate_xyz(xyz, current, target=ImageOrientation.UPRIGHT): 92 | rot_z = ROT_Z[(current, target)].to(xyz) 93 | return rot_z @ xyz 94 | 95 | def rotate_tensor(tensor, current, target=ImageOrientation.UPRIGHT): 96 | return torch.rot90(tensor, ROT_K[(current, target)], dims=(-2, -1)) 97 | -------------------------------------------------------------------------------- /cubifyanything/pos.py: -------------------------------------------------------------------------------- 1 | import math 2 | import torch 3 | import warnings 4 | 5 | from torch import nn 6 | from torch.nn import functional as F 7 | 8 | from math import log2, pi 9 | 10 | class PositionEmbeddingSine(nn.Module): 11 | """ 12 | This is a more standard version of the position embedding, very similar to the one 13 | used by the Attention is all you need paper, generalized to work on images. 14 | """ 15 | 16 | def __init__( 17 | self, num_pos_feats=64, temperature=10000, normalize=False, scale=None 18 | ): 19 | super().__init__() 20 | self.num_pos_feats = num_pos_feats 21 | self.temperature = temperature 22 | self.normalize = normalize 23 | if scale is not None and normalize is False: 24 | raise ValueError("normalize should be True if scale is passed") 25 | if scale is None: 26 | scale = 2 * math.pi 27 | self.scale = scale 28 | 29 | def forward(self, tensor_list, sensor): 30 | x = tensor_list.tensors 31 | mask = tensor_list.mask 32 | assert mask is not None 33 | not_mask = ~mask 34 | y_embed = not_mask.cumsum(1, dtype=torch.float32) 35 | x_embed = not_mask.cumsum(2, dtype=torch.float32) 36 | if self.normalize: 37 | eps = 1e-6 38 | y_embed = (y_embed - 0.5) / (y_embed[:, -1:, :] + eps) * self.scale 39 | x_embed = (x_embed - 0.5) / (x_embed[:, :, -1:] + eps) * self.scale 40 | else: 41 | y_embed = (y_embed - 0.5) * self.scale 42 | x_embed = (x_embed - 0.5) * self.scale 43 | 44 | dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device) 45 | with warnings.catch_warnings(): 46 | warnings.simplefilter("ignore") 47 | dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats) 48 | 49 | pos_x = x_embed[:, :, :, None] / dim_t 50 | pos_y = y_embed[:, :, :, None] / dim_t 51 | pos_x = torch.stack( 52 | (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4 53 | ).flatten(3) 54 | pos_y = torch.stack( 55 | (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4 56 | ).flatten(3) 57 | pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2) 58 | 59 | return pos 60 | 61 | def generate_rays( 62 | info, image_shape, noisy: bool = False 63 | ): 64 | camera_intrinsics = info.K[-1][None] 65 | batch_size, device, dtype = ( 66 | camera_intrinsics.shape[0], 67 | camera_intrinsics.device, 68 | camera_intrinsics.dtype, 69 | ) 70 | height, width = image_shape 71 | # Generate grid of pixel coordinates 72 | pixel_coords_x = torch.linspace(0, width - 1, width, device=device, dtype=dtype) 73 | pixel_coords_y = torch.linspace(0, height - 1, height, device=device, dtype=dtype) 74 | if noisy: 75 | pixel_coords_x += torch.rand_like(pixel_coords_x) - 0.5 76 | pixel_coords_y += torch.rand_like(pixel_coords_y) - 0.5 77 | pixel_coords = torch.stack( 78 | [pixel_coords_x.repeat(height, 1), pixel_coords_y.repeat(width, 1).t()], dim=2 79 | ) # (H, W, 2) 80 | pixel_coords = pixel_coords + 0.5 81 | 82 | # Handle radial distortion. 83 | ray_is_valid = torch.ones((height, width), dtype=torch.bool, device=device) 84 | 85 | # Calculate ray directions 86 | intrinsics_inv = torch.eye(3, device=device).unsqueeze(0).repeat(batch_size, 1, 1) 87 | intrinsics_inv[:, 0, 0] = 1.0 / camera_intrinsics[:, 0, 0] 88 | intrinsics_inv[:, 1, 1] = 1.0 / camera_intrinsics[:, 1, 1] 89 | intrinsics_inv[:, 0, 2] = -camera_intrinsics[:, 0, 2] / camera_intrinsics[:, 0, 0] 90 | intrinsics_inv[:, 1, 2] = -camera_intrinsics[:, 1, 2] / camera_intrinsics[:, 1, 1] 91 | homogeneous_coords = torch.cat( 92 | [pixel_coords, torch.ones_like(pixel_coords[:, :, :1])], dim=2 93 | ) # (H, W, 3) 94 | 95 | ray_directions = torch.matmul( 96 | intrinsics_inv, homogeneous_coords.permute(2, 0, 1).flatten(-2)).view( 97 | 3, height, width).permute(1, 2, 0) # (3, H*W) 98 | 99 | ray_directions = F.normalize(ray_directions, dim=-1) # (B, 3, H*W) 100 | theta = torch.atan2(ray_directions[..., 0], ray_directions[..., -1]) 101 | phi = torch.acos(ray_directions[..., 1]) 102 | angles = torch.stack([theta, phi], dim=-1) 103 | 104 | # Ensure we set anything invalid to just 0? 105 | ray_directions[~ray_is_valid] = 0.0 106 | angles[~ray_is_valid] = 0.0 107 | 108 | return ray_directions, angles 109 | 110 | def generate_fourier_features( 111 | x: torch.Tensor, 112 | dim: int = 256, 113 | max_freq: int = 64, 114 | use_cos: bool = False, 115 | use_log: bool = False, 116 | cat_orig: bool = False, 117 | ): 118 | x_orig = x 119 | device, dtype, input_dim = x.device, x.dtype, x.shape[-1] 120 | num_bands = dim // (2 * input_dim) if use_cos else dim // input_dim 121 | 122 | if use_log: 123 | scales = 2.0 ** torch.linspace( 124 | 0.0, log2(max_freq), steps=num_bands, device=device, dtype=dtype 125 | ) 126 | else: 127 | scales = torch.linspace( 128 | 1.0, max_freq / 2, num_bands, device=device, dtype=dtype 129 | ) 130 | 131 | x = x.unsqueeze(-1) 132 | scales = scales[(*((None,) * (len(x.shape) - 1)), Ellipsis)] 133 | 134 | x = x * scales * pi 135 | x = torch.cat( 136 | ( 137 | [x.sin(), x.cos()] 138 | if use_cos 139 | else [ 140 | x.sin(), 141 | ] 142 | ), 143 | dim=-1, 144 | ) 145 | 146 | if cat_orig: 147 | raise NotImplementedError 148 | 149 | return x.flatten(3) 150 | 151 | # Adopted from UniDepth. I don't think this is necessary, but keeping until we re-train models. 152 | class CameraRayEmbedding(nn.Module): 153 | def __init__(self, dim): 154 | super().__init__() 155 | 156 | self.dim = dim 157 | self.proj = nn.Linear(255, self.dim) 158 | 159 | def forward(self, tensor_list, sensor): 160 | x = tensor_list.tensors 161 | 162 | feat_size = tensor_list.tensors.shape[-1] 163 | # Hard-coded stride. 164 | square_pad = feat_size * 16 165 | 166 | # Generate the rays for the original images. 167 | ray_dirs = [] 168 | for info_ in sensor["image"].info: 169 | ray_dirs_, angles_ = generate_rays(info_, (info_.size[1], info_.size[0])) 170 | ray_dirs_ = F.pad(ray_dirs_, (0, 0, 0, square_pad - ray_dirs_.shape[1], 0, square_pad - ray_dirs_.shape[0])) 171 | ray_dirs.append(ray_dirs_) 172 | 173 | ray_dirs = torch.stack(ray_dirs) 174 | 175 | rays_embedding = F.interpolate(ray_dirs.permute(0, 3, 1, 2), (feat_size, feat_size), mode="nearest").permute(0, 2, 3, 1) 176 | rays_embedding = F.normalize(rays_embedding, dim=-1) 177 | rays_embedding = generate_fourier_features( 178 | rays_embedding, 179 | dim=self.dim, 180 | max_freq=feat_size // 2, 181 | use_log=True, 182 | cat_orig=False, 183 | ) 184 | 185 | rays_embedding = self.proj(rays_embedding) 186 | return rays_embedding.permute(0, 3, 1, 2).contiguous() 187 | -------------------------------------------------------------------------------- /cubifyanything/preprocessor.py: -------------------------------------------------------------------------------- 1 | # For licensing see accompanying LICENSE file. 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved. 3 | 4 | import copy 5 | import os 6 | import torch 7 | 8 | from cubifyanything.measurement import ( 9 | DepthMeasurementInfo, 10 | ImageMeasurementInfo) 11 | 12 | from cubifyanything.batching import ( 13 | Measurement, 14 | PosedImage, 15 | PosedDepth, 16 | BatchedSensors, 17 | Sensors) 18 | 19 | from typing import Dict, List 20 | 21 | IGNORE_KEYS = ["sensor_info", "__key__", "gt", "video_info", "meta"] 22 | 23 | def move_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor: 24 | try: 25 | return src.to(dst) 26 | except: 27 | return src.to(dst.device) 28 | 29 | def move_to_current_device(x, t): 30 | if isinstance(x, (list, tuple)): 31 | return [move_device_like(x_, t) for x_ in x] 32 | 33 | return move_device_like(x, t) 34 | 35 | def move_input_to_current_device(batched_input: Sensors, t: torch.Tensor): 36 | # Assume only two levels of nesting for now. 37 | return { name: { name_: move_to_current_device(m, t) for name_, m in s.items() } for name, s in batched_input.items() } 38 | 39 | class Augmentor(object): 40 | def __init__(self, measurement_keys=None): 41 | self.measurement_keys = measurement_keys 42 | 43 | def package(self, sample) -> Dict[str, Dict[str, Measurement]]: 44 | # Simply everything into "Packages" to make it more amenable for a training pipeline. 45 | # Essentially return Dict 46 | # Make sure everything is contiguous. channels -> first. 47 | result = {} 48 | for sensor_name, sensor_data in sample.items(): 49 | if sensor_name in IGNORE_KEYS: 50 | continue 51 | 52 | if not isinstance(sensor_data, dict): 53 | continue 54 | 55 | sensor_result = {} 56 | sensor_info = copy.deepcopy(getattr(sample["sensor_info"], sensor_name)) 57 | for measurement_name, measurement in sensor_data.items(): 58 | measurement_key = os.path.join(sensor_name, measurement_name) 59 | if (self.measurement_keys is not None) and (measurement_key not in self.measurement_keys): 60 | # Make sure to delete from sensor info as well. 61 | if sensor_info.has(measurement_name): 62 | sensor_info.remove(measurement_name) 63 | 64 | continue 65 | 66 | measurement_info = getattr(sensor_info, measurement_name) 67 | if isinstance(measurement_info, DepthMeasurementInfo): 68 | sensor_result[measurement_name] = PosedDepth( 69 | sample[sensor_name][measurement_name][-1], 70 | measurement_info, 71 | sensor_info) 72 | elif isinstance(measurement_info, ImageMeasurementInfo): 73 | sensor_result[measurement_name] = PosedImage( 74 | sample[sensor_name][measurement_name][-1], 75 | measurement_info, 76 | sensor_info) 77 | 78 | # Don't include if empty. 79 | if sensor_result: 80 | result[sensor_name] = sensor_result 81 | 82 | return result 83 | 84 | class Preprocessor(object): 85 | def __init__(self, 86 | square_pad=[256, 384, 512, 640, 768, 896, 1024], 87 | size_divisibility=32, 88 | pixel_mean=[123.675, 116.28, 103.53], 89 | pixel_std=[58.395, 57.12, 57.375], 90 | device=None): 91 | self.square_pad = square_pad 92 | self.size_divisibility = size_divisibility 93 | self.pixel_mean = torch.tensor(pixel_mean).view(-1, 1, 1) 94 | self.pixel_std = torch.tensor(pixel_std).view(-1, 1, 1) 95 | self.device = device 96 | 97 | @staticmethod 98 | def standardize_depth_map(img, trunc_value=0.1): 99 | # Always do this on CPU! MPS has some surprising behavior. 100 | device = img.device 101 | img = img.cpu() 102 | img[img <= 0.0] = torch.nan 103 | 104 | sorted_img = torch.sort(torch.flatten(img))[0] 105 | # Remove nan, nan at the end of sort 106 | num_nan = sorted_img.isnan().sum() 107 | if num_nan > 0: 108 | sorted_img = sorted_img[:-num_nan] 109 | # Remove outliers 110 | trunc_img = sorted_img[int(trunc_value * len(sorted_img)): int((1 - trunc_value) * len(sorted_img))] 111 | if len(trunc_img) <= 1: 112 | # guard against no valid Jasper. 113 | trunc_mean = torch.tensor(0.0).to(img) 114 | trunc_std = torch.tensor(1.0).to(img) 115 | else: 116 | trunc_mean = trunc_img.mean() 117 | trunc_var = trunc_img.var() 118 | 119 | eps = 1e-2 120 | trunc_std = torch.sqrt(trunc_var + eps) 121 | 122 | # Replace nan by mean 123 | img = torch.nan_to_num(img, nan=trunc_mean) 124 | 125 | # Standardize 126 | img = (img - trunc_mean) / trunc_std 127 | 128 | # return the scale parameters for encoding. 129 | return img.to(device), torch.tensor([trunc_mean, trunc_std]).to(device) 130 | 131 | def normalize(self, batched_input: Sensors): 132 | # Happens in-place. 133 | for sensor_name, sensor in batched_input.items(): 134 | for measurement_name, measurement in sensor.items(): 135 | if measurement_name in ["features"]: 136 | continue 137 | 138 | if measurement.__orig_class__ in (PosedDepth,): 139 | measurement.data, scaling = Preprocessor.standardize_depth_map(measurement.data) 140 | measurement.info = measurement.info.normalize(scaling[None]) 141 | elif measurement.__orig_class__ in (PosedImage,): 142 | measurement.data = (measurement.data.float() - self.pixel_mean.to(measurement.data)) / self.pixel_std.to(measurement.data) 143 | 144 | return batched_input 145 | 146 | def batch(self, batched_inputs: List[Sensors]) -> List[BatchedSensors]: 147 | sensor_names = batched_inputs[0].keys() 148 | result = {} 149 | for sensor_name in sensor_names: 150 | measurement_names = batched_inputs[0][sensor_name].keys() 151 | sensor_result = {} 152 | for measurement_name in measurement_names: 153 | batched_measurements = [bi[sensor_name][measurement_name] for bi in batched_inputs] 154 | if measurement_name in ["features"]: 155 | # TODO! 156 | sensor_result["features"] = batched_measurements[0] 157 | continue 158 | 159 | # Hacky way to pass some additional constraints. 160 | batching_kwargs = {} 161 | if batched_measurements[0].__orig_class__ in (PosedDepth,): 162 | # Very bad, but assume the PosedImage here gets processed first, so that 163 | # square_pad and rgb_size are assigned. 164 | rgb_to_depth_ratio = round(rgb_size[0] / batched_measurements[0].info.size[0]) 165 | if rgb_to_depth_ratio not in [1, 2, 4]: 166 | raise ValueError(f"Unsupported rgb -> depth ratio: {rgb_to_depth_ratio}") 167 | 168 | # note: square_pad should always be divisible by the given ratios: e.g. 1, 2, 4. 169 | batching_kwargs = dict( 170 | size_divisibility=self.size_divisibility, 171 | padding_constraints={ 172 | "size_divisibility": self.size_divisibility, 173 | "square_size": square_pad // rgb_to_depth_ratio 174 | }) 175 | elif batched_measurements[0].__orig_class__ in (PosedImage,): 176 | # Backbone sizes are computed w.r.t image. We may need 177 | # to adjust them to depth or other sensors with different sizes. 178 | square_pad = self.square_pad 179 | rgb_size = batched_measurements[0].info.size 180 | if isinstance(square_pad, (list,)): 181 | longest_edge = max([max(bm.info.size) for bm in batched_measurements]) 182 | square_pad = int(min([s for s in square_pad if s >= longest_edge])) 183 | 184 | batching_kwargs = dict( 185 | size_divisibility=self.size_divisibility, 186 | padding_constraints={ 187 | "size_divisibility": self.size_divisibility, 188 | "square_size": square_pad 189 | }) 190 | 191 | batched_measurements = Measurement.batch( 192 | batched_measurements, 193 | **batching_kwargs) 194 | 195 | sensor_result[measurement_name] = batched_measurements 196 | 197 | result[sensor_name] = sensor_result 198 | 199 | return result 200 | 201 | def __call__(self, batches): 202 | for batch in batches: 203 | if isinstance(batch, tuple): 204 | # Probably inference with GT. 205 | input_, gt_ = batch 206 | if self.device is not None: 207 | input_ = move_input_to_current_device(input_, self.device) 208 | 209 | yield self.preprocess([input_]), gt_ 210 | else: 211 | yield self.preprocess(batch) 212 | 213 | def preprocess(self, batched_inputs: List[Sensors]) -> List[Sensors]: 214 | batched_inputs = [self.normalize(bi) for bi in batched_inputs] 215 | 216 | return self.batch(batched_inputs) 217 | -------------------------------------------------------------------------------- /cubifyanything/sensor.py: -------------------------------------------------------------------------------- 1 | # For licensing see accompanying LICENSE file. 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved. 3 | 4 | import numpy as np 5 | import torch 6 | import warnings 7 | 8 | from typing import Any, Dict, List, Tuple, Union 9 | 10 | from cubifyanything.measurement import BaseMeasurementInfo 11 | from cubifyanything.orientation import ImageOrientation, get_orientation, rotate_pose 12 | 13 | # Extends some of the ideas of D2's "Instances" to more broad sensors. 14 | class SensorInfo(object): 15 | def __init__(self, **kwargs): 16 | self._measurements: Dict[str, Any] = {} 17 | self._other = {} 18 | self._meta_keys = [] 19 | 20 | for k, v in kwargs.items(): 21 | self.set(k, v) 22 | 23 | def __setattr__(self, name: str, val: Any) -> None: 24 | if name in ["_other", "_measurements", "_RT", "_meta_keys"]: 25 | super(SensorInfo, self).__setattr__(name, val) 26 | elif name.startswith("_"): 27 | self._other[name] = val 28 | else: 29 | self.set(name, val) 30 | 31 | def __getattr__(self, name: str) -> Any: 32 | if name == "_other": 33 | return self.__getattribute__("_other") 34 | 35 | if name.startswith("_"): 36 | if name in self._other: 37 | return self._other[name] 38 | 39 | return self.__getattribute__(name) 40 | 41 | if name not in self._measurements: 42 | raise AttributeError("Cannot find field '{}' in the given measurements!".format(name)) 43 | 44 | return self._measurements[name] 45 | 46 | def __getstate__(self): 47 | return {"_measurements": self._measurements, "_other": self._other, "_meta_keys": self._meta_keys} 48 | 49 | def __setstate__(self, s): 50 | self._measurements = s["_measurements"] 51 | self._other = s["_other"] 52 | self._meta_keys = s["_meta_keys"] 53 | 54 | @property 55 | def ts(self): 56 | if isinstance(self, PosedSensorInfo): 57 | return self._RT_meta.ts 58 | 59 | # TODO: Take first measurement and ask for ts? 60 | return None 61 | 62 | def translate(self, t): 63 | raise NotImplementedError 64 | 65 | def set(self, name: str, value: Any) -> None: 66 | """ 67 | Set the field named `name` to `value`. 68 | The length of `value` must be the number of instances, 69 | and must agree with other existing fields in this object. 70 | """ 71 | with warnings.catch_warnings(record=True): 72 | data_len = len(value) 73 | 74 | if len(self._measurements): 75 | assert ( 76 | len(self) == data_len 77 | ), "Adding a field of length {} to a measurement of length {}".format(data_len, len(self)) 78 | 79 | self._measurements[name] = value 80 | 81 | def has(self, name: str) -> bool: 82 | """ 83 | Returns: 84 | bool: whether the field called `name` exists. 85 | """ 86 | return name in self._measurements 87 | 88 | def remove(self, name: str) -> None: 89 | """ 90 | Remove the field called `name`. 91 | """ 92 | del self._measurements[name] 93 | 94 | def get(self, name: str) -> Any: 95 | """ 96 | Returns the field called `name`. 97 | """ 98 | return self._measurements[name] 99 | 100 | @classmethod 101 | def cat(self, sensor_list): 102 | # TODO: Flesh this out better. 103 | measurement_names = sensor_list[0].get_measurements().keys() 104 | measurements = {} 105 | 106 | for measurement_name in measurement_names: 107 | info_list = [getattr(sensor_list_, measurement_name) for sensor_list_ in sensor_list] 108 | measurements[measurement_name] = type(info_list[0]).cat(info_list) 109 | 110 | return type(sensor_list[0])(**measurements) 111 | 112 | def __len__(self) -> int: 113 | for v in self._measurements.values(): 114 | # use __len__ because len() has to be int and is not friendly to tracing 115 | return v.__len__() 116 | 117 | def get_measurements(self) -> Dict[str, Any]: 118 | """ 119 | Returns: 120 | dict: a dict which maps names (str) to data of the fields 121 | 122 | Modifying the returned dict will modify this instance. 123 | """ 124 | # for now, only return subclasses of MeasurementInfo 125 | return { k: m for k, m in self._measurements.items() if isinstance(m, (MeasurementInfo,)) } 126 | 127 | def orient(self, current_orientation, target_orientation): 128 | new_sensor_info = type(self)() 129 | new_sensor_info._other = dict(self._other) 130 | 131 | # Save this for the ability to restore? 132 | new_sensor_info._original_orientation = current_orientation 133 | 134 | # One of these needs to be UPRIGHT for now. 135 | if (current_orientation != ImageOrientation.UPRIGHT) and (target_orientation != ImageOrientation.UPRIGHT): 136 | raise NotImplementedError 137 | 138 | for measurement_name, measurement in self._measurements.items(): 139 | # TODO: fix this as an _other_? 140 | if measurement_name == "RT": 141 | new_sensor_info.RT = rotate_pose(self.RT, current_orientation, target=target_orientation) 142 | elif measurement_name == "ts": 143 | new_sensor_info.ts = self.ts.clone() 144 | elif isinstance(measurement, BaseMeasurementInfo): 145 | setattr(new_sensor_info, measurement_name, measurement.orient(current_orientation, target_orientation)) 146 | 147 | if isinstance(self, PosedSensorInfo): 148 | new_sensor_info._RT = self._RT.clone() 149 | 150 | # Make sure we continue to use the override. 151 | if hasattr(self, "_orientation"): 152 | setattr(new_sensor_info, "_orientation", target_orientation) 153 | 154 | return new_sensor_info 155 | 156 | def to(self, *args: Any, **kwargs: Any) -> "SensorInfo": 157 | ret = type(self)() 158 | for k, v in self._measurements.items(): 159 | if hasattr(v, "to"): 160 | v = v.to(*args, **kwargs) 161 | ret.set(k, v) 162 | 163 | ret._other = dict(self._other) 164 | for meta_key in self._meta_keys: 165 | ret._meta_keys.append(meta_key) 166 | setattr(ret, meta_key, getattr(self, meta_key)) 167 | 168 | return ret 169 | 170 | # TODO: this should enforce "RT" (i.e. pose) existing. 171 | class PosedSensorInfo(SensorInfo): 172 | @property 173 | def orientation(self): 174 | # Allow override. 175 | if hasattr(self, "_orientation"): 176 | return self._orientation 177 | 178 | # for now, assume we're dealing with a single orientation. majority vote. 179 | if len(self.RT) == 1: 180 | return ImageOrientation(get_orientation(self.RT)[-1].item()) 181 | 182 | orientations = get_orientation(self.RT).cpu().numpy() 183 | unique_orientations, counts = np.unique(orientations, return_counts=True) 184 | most_frequent_orientation = unique_orientations[np.argmax(counts)] 185 | 186 | return ImageOrientation(most_frequent_orientation) 187 | 188 | @property 189 | def device(self): 190 | return self.RT.device 191 | 192 | def set(self, name: str, value: Any) -> None: 193 | if name == "RT": 194 | # only write if we don't already have 195 | if not hasattr(self, "_RT"): 196 | self._RT = value.clone() 197 | 198 | super(PosedSensorInfo, self).set(name, value) 199 | 200 | def apply_transform(self, transform_4x4): 201 | new_sensor_info = PosedSensorInfo() 202 | new_sensor_info._RT = self._RT.clone() 203 | new_sensor_info._other = dict(self._other) 204 | 205 | for measurement_name, measurement in self._measurements.items(): 206 | if measurement_name == "RT": 207 | new_sensor_info.RT = transform_4x4 @ self.RT 208 | elif measurement_name == "ts": 209 | new_sensor_info.ts = self.ts.clone() 210 | elif hasattr(measurement, "apply_transform"): 211 | setattr(new_sensor_info, measurement_name, measurement.apply_transform(transform_4x4)) 212 | else: 213 | 214 | setattr(new_sensor_info, measurement_name, measurement) 215 | 216 | return new_sensor_info 217 | 218 | def translate(self, t): 219 | translation_4x4 = torch.eye(4)[None, ...].to(t.device) 220 | translation_4x4[:, :3, -1] = t 221 | 222 | return self.apply_transform(translation_4x4) 223 | 224 | @classmethod 225 | def cat(cls, sensor_list): 226 | new_sensor_info = SensorInfo.cat(sensor_list) 227 | new_sensor_info.RT = torch.cat([sensor_info.RT for sensor_info in sensor_list]) 228 | 229 | return new_sensor_info 230 | 231 | class SensorArrayInfo(object): 232 | def __init__(self, **kwargs: Any): 233 | self._sensors: Dict[str, SensorInfo] = {} 234 | self._rel_transforms: Dict[Tuple[str, str], torch.Tensor] = {} 235 | 236 | for k, v in kwargs.items(): 237 | self.set(k, v) 238 | 239 | def __setattr__(self, name: str, val: Any) -> None: 240 | if name.startswith("_"): 241 | super().__setattr__(name, val) 242 | else: 243 | self.set(name, val) 244 | 245 | def __getattr__(self, name: str) -> Any: 246 | if name == "_sensors" or name not in self._sensors: 247 | raise AttributeError("Cannot find field '{}' in the given sensors!".format(name)) 248 | 249 | return self._sensors[name] 250 | 251 | def __getstate__(self): 252 | return self._sensors 253 | 254 | def __setstate__(self, d): 255 | self._sensors = d 256 | 257 | def set(self, name: str, value: Any) -> None: 258 | self._sensors[name] = value 259 | 260 | def has(self, name: str) -> bool: 261 | """ 262 | Returns: 263 | bool: whether the field called `name` exists. 264 | """ 265 | return name in self._sensors 266 | 267 | def remove(self, name: str) -> None: 268 | """ 269 | Remove the field called `name`. 270 | """ 271 | del self._sensors[name] 272 | 273 | def get(self, name: str) -> Any: 274 | """ 275 | Returns the field called `name`. 276 | """ 277 | return self._sensors[name] 278 | 279 | # This is not really always a good idea because sensor's don't _have_ to 280 | # have the same length (although they often do). 281 | def uniform_length(self) -> int: 282 | for v in self._sensors.values(): 283 | # use __len__ because len() has to be int and is not friendly to tracing 284 | return v.__len__() 285 | 286 | def to(self, *args: Any, **kwargs: Any) -> "SensorArrayInfo": 287 | ret = type(self)() 288 | for k, v in self._sensors.items(): 289 | if hasattr(v, "to"): 290 | v = v.to(*args, **kwargs) 291 | ret.set(k, v) 292 | 293 | return ret 294 | 295 | -------------------------------------------------------------------------------- /cubifyanything/transforms.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | # These functions are taken from PyTorch3D. 4 | 5 | def _axis_angle_rotation(axis: str, angle: torch.Tensor) -> torch.Tensor: 6 | """ 7 | Return the rotation matrices for one of the rotations about an axis 8 | of which Euler angles describe, for each value of the angle given. 9 | 10 | Args: 11 | axis: Axis label "X" or "Y or "Z". 12 | angle: any shape tensor of Euler angles in radians 13 | 14 | Returns: 15 | Rotation matrices as tensor of shape (..., 3, 3). 16 | """ 17 | 18 | cos = torch.cos(angle) 19 | sin = torch.sin(angle) 20 | one = torch.ones_like(angle) 21 | zero = torch.zeros_like(angle) 22 | 23 | if axis == "X": 24 | R_flat = (one, zero, zero, zero, cos, -sin, zero, sin, cos) 25 | elif axis == "Y": 26 | R_flat = (cos, zero, sin, zero, one, zero, -sin, zero, cos) 27 | elif axis == "Z": 28 | R_flat = (cos, -sin, zero, sin, cos, zero, zero, zero, one) 29 | else: 30 | raise ValueError("letter must be either X, Y or Z.") 31 | 32 | return torch.stack(R_flat, -1).reshape(angle.shape + (3, 3)) 33 | 34 | def euler_angles_to_matrix(euler_angles: torch.Tensor, convention: str) -> torch.Tensor: 35 | """ 36 | Convert rotations given as Euler angles in radians to rotation matrices. 37 | 38 | Args: 39 | euler_angles: Euler angles in radians as tensor of shape (..., 3). 40 | convention: Convention string of three uppercase letters from 41 | {"X", "Y", and "Z"}. 42 | 43 | Returns: 44 | Rotation matrices as tensor of shape (..., 3, 3). 45 | """ 46 | if euler_angles.dim() == 0 or euler_angles.shape[-1] != 3: 47 | raise ValueError("Invalid input euler angles.") 48 | if len(convention) != 3: 49 | raise ValueError("Convention must have 3 letters.") 50 | if convention[1] in (convention[0], convention[2]): 51 | raise ValueError(f"Invalid convention {convention}.") 52 | for letter in convention: 53 | if letter not in ("X", "Y", "Z"): 54 | raise ValueError(f"Invalid letter {letter} in convention string.") 55 | matrices = [ 56 | _axis_angle_rotation(c, e) 57 | for c, e in zip(convention, torch.unbind(euler_angles, -1)) 58 | ] 59 | # return functools.reduce(torch.matmul, matrices) 60 | return torch.matmul(torch.matmul(matrices[0], matrices[1]), matrices[2]) 61 | -------------------------------------------------------------------------------- /cubifyanything/vit.py: -------------------------------------------------------------------------------- 1 | # This is a self-contained version of Detectron2's ViT with additional modifications (only meant for inference). 2 | import math 3 | import numpy as np 4 | import torch 5 | import torch.nn as nn 6 | import torch.nn.functional as F 7 | 8 | from timm.layers import Mlp 9 | from typing import Union 10 | 11 | from cubifyanything.batching import BatchedPosedSensor 12 | 13 | __all__ = ["ViT"] 14 | 15 | # NOTE: We replicate some functions here which need modifications for tracing. 16 | def window_partition(x, window_size): 17 | """ 18 | Partition into non-overlapping windows with padding if needed. 19 | Args: 20 | x (tensor): input tokens with [B, H, W, C]. 21 | window_size (int): window size. 22 | 23 | Returns: 24 | windows: windows after partition with [B * num_windows, window_size, window_size, C]. 25 | (Hp, Wp): padded height and width before partition 26 | """ 27 | B, H, W, C = x.shape 28 | 29 | pad_h = (window_size - H % window_size) % window_size 30 | pad_w = (window_size - W % window_size) % window_size 31 | 32 | x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h)) 33 | Hp, Wp = H + pad_h, W + pad_w 34 | 35 | x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C) 36 | windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C) 37 | return windows, (Hp, Wp) 38 | 39 | def window_unpartition(windows, window_size, pad_hw, hw): 40 | """ 41 | Window unpartition into original sequences and removing padding. 42 | Args: 43 | x (tensor): input tokens with [B * num_windows, window_size, window_size, C]. 44 | window_size (int): window size. 45 | pad_hw (Tuple): padded height and width (Hp, Wp). 46 | hw (Tuple): original height and width (H, W) before padding. 47 | 48 | Returns: 49 | x: unpartitioned sequences with [B, H, W, C]. 50 | """ 51 | Hp, Wp = pad_hw 52 | H, W = hw 53 | B = windows.shape[0] // (Hp * Wp // window_size // window_size) 54 | x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1) 55 | x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1) 56 | x = x[:, :H, :W, :].contiguous() 57 | 58 | return x 59 | 60 | def get_abs_pos(abs_pos, has_cls_token, hw): 61 | """ 62 | Calculate absolute positional embeddings. If needed, resize embeddings and remove cls_token 63 | dimension for the original embeddings. 64 | Args: 65 | abs_pos (Tensor): absolute positional embeddings with (1, num_position, C). 66 | has_cls_token (bool): If true, has 1 embedding in abs_pos for cls token. 67 | hw (Tuple): size of input image tokens. 68 | 69 | Returns: 70 | Absolute positional embeddings after processing with shape (1, H, W, C) 71 | """ 72 | h, w = hw 73 | if has_cls_token: 74 | abs_pos = abs_pos[:, 1:] 75 | xy_num = abs_pos.shape[1] 76 | size = int(math.sqrt(xy_num)) 77 | assert size * size == xy_num 78 | 79 | new_abs_pos = F.interpolate( 80 | abs_pos.reshape(1, size, size, -1).permute(0, 3, 1, 2), 81 | size=(h, w), 82 | mode="bicubic", 83 | align_corners=False, 84 | ) 85 | 86 | return new_abs_pos.permute(0, 2, 3, 1) 87 | 88 | class LayerScale(nn.Module): 89 | def __init__( 90 | self, 91 | dim: int, 92 | init_values: Union[float, torch.Tensor] = 1e-5, 93 | inplace: bool = False, 94 | ) -> None: 95 | super().__init__() 96 | self.inplace = inplace 97 | self.gamma = nn.Parameter(init_values * torch.ones(dim)) 98 | 99 | def forward(self, x: torch.Tensor) -> torch.Tensor: 100 | return x.mul_(self.gamma) if self.inplace else x * self.gamma 101 | 102 | class PatchEmbed(nn.Module): 103 | """ 104 | Image to Patch Embedding. 105 | """ 106 | 107 | def __init__( 108 | self, kernel_size=(16, 16), stride=(16, 16), padding=(0, 0), in_chans=3, embed_dim=768, bias=True 109 | ): 110 | """ 111 | Args: 112 | kernel_size (Tuple): kernel size of the projection layer. 113 | stride (Tuple): stride of the projection layer. 114 | padding (Tuple): padding size of the projection layer. 115 | in_chans (int): Number of input image channels. 116 | embed_dim (int): embed_dim (int): Patch embedding dimension. 117 | """ 118 | super().__init__() 119 | 120 | self.proj = nn.Conv2d( 121 | in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding, bias=bias 122 | ) 123 | 124 | def forward(self, x): 125 | x = self.proj(x) 126 | # B C H W -> B H W C 127 | x = x.permute(0, 2, 3, 1) 128 | return x 129 | 130 | class Attention(nn.Module): 131 | """Multi-head Attention block with relative position embeddings.""" 132 | 133 | def __init__( 134 | self, 135 | dim, 136 | num_heads=8, 137 | qkv_bias=True, 138 | proj_bias=True, 139 | use_rel_pos=False, 140 | rel_pos_zero_init=True, 141 | input_size=None, 142 | depth_modality=False, 143 | depth_input_size=None, 144 | ): 145 | """ 146 | Args: 147 | dim (int): Number of input channels. 148 | num_heads (int): Number of attention heads. 149 | qkv_bias (bool: If True, add a learnable bias to query, key, value. 150 | rel_pos (bool): If True, add relative positional embeddings to the attention map. 151 | rel_pos_zero_init (bool): If True, zero initialize relative positional parameters. 152 | input_size (int or None): Input resolution for calculating the relative positional 153 | parameter size. 154 | """ 155 | super().__init__() 156 | self.num_heads = num_heads 157 | head_dim = dim // num_heads 158 | self.scale = head_dim**-0.5 159 | 160 | self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias) 161 | self.proj = nn.Linear(dim, dim, bias=proj_bias) 162 | 163 | self.use_rel_pos = use_rel_pos 164 | if self.use_rel_pos: 165 | # Not supported. 166 | raise NotImplementedError 167 | 168 | self.depth_modality = depth_modality 169 | 170 | def forward(self, x, depth=None): 171 | B, H, W, _ = x.shape 172 | # qkv with shape (3, B, nHead, H * W, C) 173 | qkv = self.qkv(x).reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4) 174 | 175 | # q, k, v with shape (B * nHead, H * W, C) 176 | q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0) 177 | 178 | if self.depth_modality and (depth is not None): 179 | B, H_d, W_d, _ = depth.shape 180 | 181 | # qkv with shape (3, B, nHead, H * W, C) 182 | qkv_depth = self.qkv(depth).reshape(B, H_d * W_d, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4) 183 | 184 | # q, k, v with shape (B * nHead, H * W, C) 185 | q_d, k_d, v_d = qkv_depth.reshape(3, B * self.num_heads, H_d * W_d, -1).unbind(0) 186 | q, k, v = torch.cat((q, q_d), dim=1), torch.cat((k, k_d), dim=1), torch.cat((v, v_d), dim=1) 187 | 188 | # presumably, concatenate q, k. split (and then reconcatenate) attn. 189 | 190 | attn = (q * self.scale) @ k.transpose(-2, -1) 191 | if self.depth_modality and (depth is not None): 192 | attn, attn_d = torch.split(attn, (H * W, H_d * W_d), dim=1) 193 | 194 | attn = attn.softmax(dim=-1) 195 | x = (attn @ v).view(B, self.num_heads, H, W, -1).permute(0, 2, 3, 1, 4).reshape(B, H, W, -1) 196 | 197 | if self.depth_modality and (depth is not None): 198 | attn_d = attn_d.softmax(dim=-1) 199 | depth = (attn_d @ v).view(B, self.num_heads, H_d, W_d, -1).permute(0, 2, 3, 1, 4).reshape(B, H_d, W_d, -1) 200 | depth = self.proj(depth) 201 | 202 | x = self.proj(x) 203 | return x, depth 204 | 205 | DEPTH_WINDOW_SIZES = [4, 8, 16] 206 | class Block(nn.Module): 207 | """Transformer blocks with support of window attention and residual propagation blocks""" 208 | 209 | def __init__( 210 | self, 211 | dim, 212 | num_heads, 213 | mlp_ratio=4.0, 214 | qkv_bias=True, 215 | proj_bias=True, 216 | mlp_bias=True, 217 | norm_layer=nn.LayerNorm, 218 | act_layer=nn.GELU, 219 | use_rel_pos=False, 220 | rel_pos_zero_init=True, 221 | window_size=0, 222 | use_residual_block=False, 223 | input_size=None, 224 | depth_modality=False, 225 | depth_window_size=0, 226 | layer_scale=False 227 | ): 228 | """ 229 | Args: 230 | dim (int): Number of input channels. 231 | num_heads (int): Number of attention heads in each ViT block. 232 | mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. 233 | qkv_bias (bool): If True, add a learnable bias to query, key, value. 234 | norm_layer (nn.Module): Normalization layer. 235 | act_layer (nn.Module): Activation layer. 236 | use_rel_pos (bool): If True, add relative positional embeddings to the attention map. 237 | rel_pos_zero_init (bool): If True, zero initialize relative positional parameters. 238 | window_size (int): Window size for window attention blocks. If it equals 0, then not 239 | use window attention. 240 | use_residual_block (bool): If True, use a residual block after the MLP block. 241 | input_size (int or None): Input resolution for calculating the relative positional 242 | parameter size. 243 | """ 244 | super().__init__() 245 | 246 | if depth_modality and (depth_window_size == 0): 247 | raise ValueError("unsupported") 248 | 249 | self.norm1 = norm_layer(dim) 250 | self.attn = Attention( 251 | dim, 252 | num_heads=num_heads, 253 | qkv_bias=qkv_bias, 254 | proj_bias=proj_bias, 255 | use_rel_pos=use_rel_pos, 256 | rel_pos_zero_init=rel_pos_zero_init, 257 | input_size=input_size if window_size == 0 else (window_size, window_size), 258 | depth_modality=depth_modality, 259 | depth_input_size=(depth_window_size, depth_window_size) if depth_modality else None, 260 | ) 261 | 262 | self.ls1 = None 263 | self.ls2 = None 264 | 265 | if layer_scale: 266 | self.ls1 = LayerScale(dim, 1.) 267 | self.ls2 = LayerScale(dim, 1.) 268 | 269 | self.depth_modality = depth_modality 270 | 271 | self.norm2 = norm_layer(dim) 272 | mlp_hidden_dim = int(dim * mlp_ratio) 273 | 274 | self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, bias=mlp_bias) 275 | self.drop_path = nn.Identity() 276 | 277 | self.window_size = window_size 278 | self.depth_window_size = depth_window_size 279 | 280 | def forward(self, x, depth=None): 281 | shortcut = x 282 | 283 | x = self.norm1(x) 284 | # Window partition 285 | if self.window_size > 0: 286 | H, W = x.shape[1], x.shape[2] 287 | x, pad_hw = window_partition(x, self.window_size) 288 | 289 | if self.depth_modality and (depth is not None): 290 | shortcut_depth = depth 291 | depth = self.norm1(depth) 292 | 293 | H_depth, W_depth = depth.shape[1], depth.shape[2] 294 | 295 | # Aggressive checking for now. 296 | depth_window_size = self.depth_window_size or (self.window_size // (H / H_depth)) 297 | if isinstance(depth_window_size, torch.Tensor): 298 | depth_window_size = depth_window_size.int() 299 | if not depth_window_size.item() in DEPTH_WINDOW_SIZES: 300 | raise ValueError(f"Unexpected window size {depth_window_size}") 301 | else: 302 | depth_window_size = int(depth_window_size) 303 | if not depth_window_size in DEPTH_WINDOW_SIZES: 304 | raise ValueError(f"Unexpected window size {depth_window_size}") 305 | 306 | # if depth_window_size is not given, dynamically compute it based on the RGB window size and relative scale. 307 | depth, pad_hw_depth = window_partition(depth, depth_window_size) 308 | 309 | x, depth = self.attn(x, depth=depth) 310 | 311 | if self.depth_modality and (depth is not None): 312 | if self.window_size > 0: 313 | depth = window_unpartition(depth, depth_window_size, pad_hw_depth, (H_depth, W_depth)) 314 | 315 | # Reverse window partition 316 | if self.window_size > 0: 317 | x = window_unpartition(x, self.window_size, pad_hw, (H, W)) 318 | 319 | if self.ls1 is not None: 320 | x = self.ls1(x) 321 | if self.depth_modality and (depth is not None): 322 | depth = self.ls1(depth) 323 | 324 | x = shortcut + self.drop_path(x) 325 | shortcut = x 326 | x = self.mlp(self.norm2(x)) 327 | 328 | if self.ls2 is not None: 329 | x = self.ls2(x) 330 | 331 | x = shortcut + self.drop_path(x) 332 | 333 | if self.depth_modality and (depth is not None): 334 | depth = shortcut_depth + self.drop_path(depth) 335 | shortcut_depth = depth 336 | depth = self.mlp(self.norm2(depth)) 337 | if self.ls2 is not None: 338 | depth = self.ls2(depth) 339 | 340 | depth = shortcut_depth + self.drop_path(depth) 341 | 342 | return x, depth 343 | 344 | class ViT(nn.Module): 345 | """ 346 | This module implements Vision Transformer (ViT) backbone in :paper:`vitdet`. 347 | "Exploring Plain Vision Transformer Backbones for Object Detection", 348 | https://arxiv.org/abs/2203.16527 349 | """ 350 | 351 | def __init__( 352 | self, 353 | img_size=None, 354 | patch_size=16, 355 | in_chans=3, 356 | embed_dim=768, 357 | depth=12, 358 | num_heads=12, 359 | mlp_ratio=4.0, 360 | qkv_bias=True, 361 | proj_bias=True, 362 | mlp_bias=True, 363 | patch_embed_bias=True, 364 | drop_path_rate=0.0, 365 | norm_layer=nn.LayerNorm, 366 | act_layer=nn.GELU, 367 | gated_mlp=False, 368 | use_abs_pos=True, 369 | use_rel_pos=False, 370 | rel_pos_zero_init=True, 371 | window_size=0, 372 | window_block_indexes=(), 373 | residual_block_indexes=(), 374 | use_act_checkpoint=False, 375 | pretrain_img_size=224, 376 | pretrain_use_cls_token=True, 377 | out_feature="last_feat", 378 | depth_modality=False, 379 | depth_window_size=0, 380 | encoder_norm=False, 381 | layer_scale=False, 382 | image_name="image", 383 | depth_name="depth" 384 | ): 385 | """ 386 | Args: 387 | img_size (int): Input image size. 388 | patch_size (int): Patch size. 389 | in_chans (int): Number of input image channels. 390 | embed_dim (int): Patch embedding dimension. 391 | depth (int): Depth of ViT. 392 | num_heads (int): Number of attention heads in each ViT block. 393 | mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. 394 | qkv_bias (bool): If True, add a learnable bias to query, key, value. 395 | drop_path_rate (float): Stochastic depth rate. 396 | norm_layer (nn.Module): Normalization layer. 397 | act_layer (nn.Module): Activation layer. 398 | use_abs_pos (bool): If True, use absolute positional embeddings. 399 | use_rel_pos (bool): If True, add relative positional embeddings to the attention map. 400 | rel_pos_zero_init (bool): If True, zero initialize relative positional parameters. 401 | window_size (int): Window size for window attention blocks. 402 | window_block_indexes (list): Indexes for blocks using window attention. 403 | residual_block_indexes (list): Indexes for blocks using conv propagation. 404 | use_act_checkpoint (bool): If True, use activation checkpointing. 405 | pretrain_img_size (int): input image size for pretraining models. 406 | pretrain_use_cls_token (bool): If True, pretrainig models use class token. 407 | out_feature (str): name of the feature from the last block. 408 | """ 409 | super().__init__() 410 | self.pretrain_use_cls_token = pretrain_use_cls_token 411 | self.depth_modality = depth_modality 412 | 413 | self.image_name = image_name 414 | self.depth_name = depth_name 415 | 416 | self.patch_embed = PatchEmbed( 417 | kernel_size=(patch_size, patch_size), 418 | stride=(patch_size, patch_size), 419 | in_chans=in_chans, 420 | embed_dim=embed_dim, 421 | bias=patch_embed_bias, 422 | ) 423 | 424 | if use_abs_pos: 425 | # Initialize absolute positional embedding with pretrain image size. 426 | num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size) 427 | num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches 428 | self.pos_embed = nn.Parameter(torch.zeros(1, num_positions, embed_dim)) 429 | nn.init.trunc_normal_(self.pos_embed, std=0.02) 430 | else: 431 | self.pos_embed = None 432 | 433 | self.pos_embed_depth = None 434 | if self.depth_modality: 435 | self.patch_embed_depth = PatchEmbed( 436 | kernel_size=(16, 16), 437 | stride=(16, 16), 438 | in_chans=1, 439 | embed_dim=embed_dim, 440 | ) 441 | 442 | if use_abs_pos: 443 | # note, depth gets its own pos embed. 444 | # Initialize absolute positional embedding with pretrain image size. 445 | # at some point, this size may differ from RGB's size. 446 | num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size) 447 | num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches 448 | self.pos_embed_depth = nn.Parameter(torch.zeros(1, num_positions, embed_dim)) 449 | 450 | self.blocks = nn.ModuleList() 451 | for i in range(depth): 452 | block = Block( 453 | dim=embed_dim, 454 | num_heads=num_heads, 455 | mlp_ratio=mlp_ratio, 456 | qkv_bias=qkv_bias, 457 | proj_bias=proj_bias, 458 | mlp_bias=mlp_bias, 459 | norm_layer=norm_layer, 460 | act_layer=act_layer, 461 | use_rel_pos=use_rel_pos, 462 | rel_pos_zero_init=rel_pos_zero_init, 463 | window_size=window_size if i in window_block_indexes else 0, 464 | use_residual_block=i in residual_block_indexes, 465 | input_size=img_size, 466 | depth_modality=depth_modality and (i in window_block_indexes), # (for now, only attend to depth if windowing) 467 | depth_window_size=depth_window_size if i in window_block_indexes else 0, 468 | layer_scale=layer_scale 469 | ) 470 | 471 | self.blocks.append(block) 472 | 473 | self.encoder_norm = norm_layer(embed_dim) if encoder_norm else nn.Identity() 474 | 475 | self._out_feature_channels = {out_feature: embed_dim} 476 | self._out_feature_strides = {out_feature: patch_size} 477 | self._out_features = [out_feature] 478 | self.window_block_indexes = window_block_indexes 479 | 480 | self.drop_path = nn.Identity() 481 | 482 | self._square_pad = [256, 384, 512, 640, 768, 896, 1024, 1280] 483 | 484 | @property 485 | def num_channels(self): 486 | return list(self._out_feature_channels.values()) 487 | 488 | @property 489 | def size_divisibility(self): 490 | return next(iter(self._out_feature_strides.values())) 491 | 492 | def forward(self, s: BatchedPosedSensor): 493 | x = s[self.image_name].data.tensor 494 | image_shape = (x.shape[2], x.shape[3]) 495 | x = self.patch_embed(x) 496 | if self.pos_embed is not None: 497 | x = x + get_abs_pos(self.pos_embed, self.pretrain_use_cls_token, (x.shape[1], x.shape[2])) 498 | 499 | has_depth = self.depth_name in s 500 | has_depth_dropped = self.depth_modality and not has_depth 501 | 502 | if self.depth_modality: 503 | depth = s[self.depth_name].data.tensor[:, None] 504 | depth = self.patch_embed_depth(depth) 505 | if self.pos_embed_depth is not None: 506 | depth = depth + get_abs_pos( 507 | self.pos_embed_depth, self.pretrain_use_cls_token, (depth.shape[1], depth.shape[2])) 508 | else: 509 | depth = None 510 | 511 | for i, blk in enumerate(self.blocks): 512 | if blk.depth_modality and has_depth: 513 | x, depth = blk(x, depth=depth) 514 | else: 515 | x, *_ = blk(x) 516 | 517 | x = self.encoder_norm(x) 518 | 519 | outputs = {self._out_features[0]: x.permute(0, 3, 1, 2)} 520 | return outputs 521 | 522 | -------------------------------------------------------------------------------- /data/LICENSE_DATA: -------------------------------------------------------------------------------- 1 | Attribution-NonCommercial-NoDerivatives 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 58 | International Public License 59 | 60 | By exercising the Licensed Rights (defined below), You accept and agree 61 | to be bound by the terms and conditions of this Creative Commons 62 | Attribution-NonCommercial-NoDerivatives 4.0 International Public 63 | License ("Public License"). To the extent this Public License may be 64 | interpreted as a contract, You are granted the Licensed Rights in 65 | consideration of Your acceptance of these terms and conditions, and the 66 | Licensor grants You such rights in consideration of benefits the 67 | Licensor receives from making the Licensed Material available under 68 | these terms and conditions. 69 | 70 | 71 | Section 1 -- Definitions. 72 | 73 | a. Adapted Material means material subject to Copyright and Similar 74 | Rights that is derived from or based upon the Licensed Material 75 | and in which the Licensed Material is translated, altered, 76 | arranged, transformed, or otherwise modified in a manner requiring 77 | permission under the Copyright and Similar Rights held by the 78 | Licensor. For purposes of this Public License, where the Licensed 79 | Material is a musical work, performance, or sound recording, 80 | Adapted Material is always produced where the Licensed Material is 81 | synched in timed relation with a moving image. 82 | 83 | b. Copyright and Similar Rights means copyright and/or similar rights 84 | closely related to copyright including, without limitation, 85 | performance, broadcast, sound recording, and Sui Generis Database 86 | Rights, without regard to how the rights are labeled or 87 | categorized. For purposes of this Public License, the rights 88 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 89 | Rights. 90 | 91 | c. Effective Technological Measures means those measures that, in the 92 | absence of proper authority, may not be circumvented under laws 93 | fulfilling obligations under Article 11 of the WIPO Copyright 94 | Treaty adopted on December 20, 1996, and/or similar international 95 | agreements. 96 | 97 | d. Exceptions and Limitations means fair use, fair dealing, and/or 98 | any other exception or limitation to Copyright and Similar Rights 99 | that applies to Your use of the Licensed Material. 100 | 101 | e. Licensed Material means the artistic or literary work, database, 102 | or other material to which the Licensor applied this Public 103 | License. 104 | 105 | f. Licensed Rights means the rights granted to You subject to the 106 | terms and conditions of this Public License, which are limited to 107 | all Copyright and Similar Rights that apply to Your use of the 108 | Licensed Material and that the Licensor has authority to license. 109 | 110 | g. Licensor means the individual(s) or entity(ies) granting rights 111 | under this Public License. 112 | 113 | h. NonCommercial means not primarily intended for or directed towards 114 | commercial advantage or monetary compensation. For purposes of 115 | this Public License, the exchange of the Licensed Material for 116 | other material subject to Copyright and Similar Rights by digital 117 | file-sharing or similar means is NonCommercial provided there is 118 | no payment of monetary compensation in connection with the 119 | exchange. 120 | 121 | i. Share means to provide material to the public by any means or 122 | process that requires permission under the Licensed Rights, such 123 | as reproduction, public display, public performance, distribution, 124 | dissemination, communication, or importation, and to make material 125 | available to the public including in ways that members of the 126 | public may access the material from a place and at a time 127 | individually chosen by them. 128 | 129 | j. Sui Generis Database Rights means rights other than copyright 130 | resulting from Directive 96/9/EC of the European Parliament and of 131 | the Council of 11 March 1996 on the legal protection of databases, 132 | as amended and/or succeeded, as well as other essentially 133 | equivalent rights anywhere in the world. 134 | 135 | k. You means the individual or entity exercising the Licensed Rights 136 | under this Public License. Your has a corresponding meaning. 137 | 138 | 139 | Section 2 -- Scope. 140 | 141 | a. License grant. 142 | 143 | 1. Subject to the terms and conditions of this Public License, 144 | the Licensor hereby grants You a worldwide, royalty-free, 145 | non-sublicensable, non-exclusive, irrevocable license to 146 | exercise the Licensed Rights in the Licensed Material to: 147 | 148 | a. reproduce and Share the Licensed Material, in whole or 149 | in part, for NonCommercial purposes only; and 150 | 151 | b. produce and reproduce, but not Share, Adapted Material 152 | for NonCommercial purposes only. 153 | 154 | 2. Exceptions and Limitations. For the avoidance of doubt, where 155 | Exceptions and Limitations apply to Your use, this Public 156 | License does not apply, and You do not need to comply with 157 | its terms and conditions. 158 | 159 | 3. Term. The term of this Public License is specified in Section 160 | 6(a). 161 | 162 | 4. Media and formats; technical modifications allowed. The 163 | Licensor authorizes You to exercise the Licensed Rights in 164 | all media and formats whether now known or hereafter created, 165 | and to make technical modifications necessary to do so. The 166 | Licensor waives and/or agrees not to assert any right or 167 | authority to forbid You from making technical modifications 168 | necessary to exercise the Licensed Rights, including 169 | technical modifications necessary to circumvent Effective 170 | Technological Measures. For purposes of this Public License, 171 | simply making modifications authorized by this Section 2(a) 172 | (4) never produces Adapted Material. 173 | 174 | 5. Downstream recipients. 175 | 176 | a. Offer from the Licensor -- Licensed Material. Every 177 | recipient of the Licensed Material automatically 178 | receives an offer from the Licensor to exercise the 179 | Licensed Rights under the terms and conditions of this 180 | Public License. 181 | 182 | b. No downstream restrictions. You may not offer or impose 183 | any additional or different terms or conditions on, or 184 | apply any Effective Technological Measures to, the 185 | Licensed Material if doing so restricts exercise of the 186 | Licensed Rights by any recipient of the Licensed 187 | Material. 188 | 189 | 6. No endorsement. Nothing in this Public License constitutes or 190 | may be construed as permission to assert or imply that You 191 | are, or that Your use of the Licensed Material is, connected 192 | with, or sponsored, endorsed, or granted official status by, 193 | the Licensor or others designated to receive attribution as 194 | provided in Section 3(a)(1)(A)(i). 195 | 196 | b. Other rights. 197 | 198 | 1. Moral rights, such as the right of integrity, are not 199 | licensed under this Public License, nor are publicity, 200 | privacy, and/or other similar personality rights; however, to 201 | the extent possible, the Licensor waives and/or agrees not to 202 | assert any such rights held by the Licensor to the limited 203 | extent necessary to allow You to exercise the Licensed 204 | Rights, but not otherwise. 205 | 206 | 2. Patent and trademark rights are not licensed under this 207 | Public License. 208 | 209 | 3. To the extent possible, the Licensor waives any right to 210 | collect royalties from You for the exercise of the Licensed 211 | Rights, whether directly or through a collecting society 212 | under any voluntary or waivable statutory or compulsory 213 | licensing scheme. In all other cases the Licensor expressly 214 | reserves any right to collect such royalties, including when 215 | the Licensed Material is used other than for NonCommercial 216 | purposes. 217 | 218 | 219 | Section 3 -- License Conditions. 220 | 221 | Your exercise of the Licensed Rights is expressly made subject to the 222 | following conditions. 223 | 224 | a. Attribution. 225 | 226 | 1. If You Share the Licensed Material, You must: 227 | 228 | a. retain the following if it is supplied by the Licensor 229 | with the Licensed Material: 230 | 231 | i. identification of the creator(s) of the Licensed 232 | Material and any others designated to receive 233 | attribution, in any reasonable manner requested by 234 | the Licensor (including by pseudonym if 235 | designated); 236 | 237 | ii. a copyright notice; 238 | 239 | iii. a notice that refers to this Public License; 240 | 241 | iv. a notice that refers to the disclaimer of 242 | warranties; 243 | 244 | v. a URI or hyperlink to the Licensed Material to the 245 | extent reasonably practicable; 246 | 247 | b. indicate if You modified the Licensed Material and 248 | retain an indication of any previous modifications; and 249 | 250 | c. indicate the Licensed Material is licensed under this 251 | Public License, and include the text of, or the URI or 252 | hyperlink to, this Public License. 253 | 254 | For the avoidance of doubt, You do not have permission under 255 | this Public License to Share Adapted Material. 256 | 257 | 2. You may satisfy the conditions in Section 3(a)(1) in any 258 | reasonable manner based on the medium, means, and context in 259 | which You Share the Licensed Material. For example, it may be 260 | reasonable to satisfy the conditions by providing a URI or 261 | hyperlink to a resource that includes the required 262 | information. 263 | 264 | 3. If requested by the Licensor, You must remove any of the 265 | information required by Section 3(a)(1)(A) to the extent 266 | reasonably practicable. 267 | 268 | 269 | Section 4 -- Sui Generis Database Rights. 270 | 271 | Where the Licensed Rights include Sui Generis Database Rights that 272 | apply to Your use of the Licensed Material: 273 | 274 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 275 | to extract, reuse, reproduce, and Share all or a substantial 276 | portion of the contents of the database for NonCommercial purposes 277 | only and provided You do not Share Adapted Material; 278 | 279 | b. if You include all or a substantial portion of the database 280 | contents in a database in which You have Sui Generis Database 281 | Rights, then the database in which You have Sui Generis Database 282 | Rights (but not its individual contents) is Adapted Material; and 283 | 284 | c. You must comply with the conditions in Section 3(a) if You Share 285 | all or a substantial portion of the contents of the database. 286 | 287 | For the avoidance of doubt, this Section 4 supplements and does not 288 | replace Your obligations under this Public License where the Licensed 289 | Rights include other Copyright and Similar Rights. 290 | 291 | 292 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 293 | 294 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 295 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 296 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 297 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 298 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 299 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 300 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 301 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 302 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 303 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 304 | 305 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 306 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 307 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 308 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 309 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 310 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 311 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 312 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 313 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 314 | 315 | c. The disclaimer of warranties and limitation of liability provided 316 | above shall be interpreted in a manner that, to the extent 317 | possible, most closely approximates an absolute disclaimer and 318 | waiver of all liability. 319 | 320 | 321 | Section 6 -- Term and Termination. 322 | 323 | a. This Public License applies for the term of the Copyright and 324 | Similar Rights licensed here. However, if You fail to comply with 325 | this Public License, then Your rights under this Public License 326 | terminate automatically. 327 | 328 | b. Where Your right to use the Licensed Material has terminated under 329 | Section 6(a), it reinstates: 330 | 331 | 1. automatically as of the date the violation is cured, provided 332 | it is cured within 30 days of Your discovery of the 333 | violation; or 334 | 335 | 2. upon express reinstatement by the Licensor. 336 | 337 | For the avoidance of doubt, this Section 6(b) does not affect any 338 | right the Licensor may have to seek remedies for Your violations 339 | of this Public License. 340 | 341 | c. For the avoidance of doubt, the Licensor may also offer the 342 | Licensed Material under separate terms or conditions or stop 343 | distributing the Licensed Material at any time; however, doing so 344 | will not terminate this Public License. 345 | 346 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 347 | License. 348 | 349 | 350 | Section 7 -- Other Terms and Conditions. 351 | 352 | a. The Licensor shall not be bound by any additional or different 353 | terms or conditions communicated by You unless expressly agreed. 354 | 355 | b. Any arrangements, understandings, or agreements regarding the 356 | Licensed Material not stated herein are separate from and 357 | independent of the terms and conditions of this Public License. 358 | 359 | 360 | Section 8 -- Interpretation. 361 | 362 | a. For the avoidance of doubt, this Public License does not, and 363 | shall not be interpreted to, reduce, limit, restrict, or impose 364 | conditions on any use of the Licensed Material that could lawfully 365 | be made without permission under this Public License. 366 | 367 | b. To the extent possible, if any provision of this Public License is 368 | deemed unenforceable, it shall be automatically reformed to the 369 | minimum extent necessary to make it enforceable. If the provision 370 | cannot be reformed, it shall be severed from this Public License 371 | without affecting the enforceability of the remaining terms and 372 | conditions. 373 | 374 | c. No term or condition of this Public License will be waived and no 375 | failure to comply consented to unless expressly agreed to by the 376 | Licensor. 377 | 378 | d. Nothing in this Public License constitutes or may be interpreted 379 | as a limitation upon, or waiver of, any privileges and immunities 380 | that apply to the Licensor or You, including from the legal 381 | processes of any jurisdiction or authority. 382 | 383 | ======================================================================= 384 | 385 | Creative Commons is not a party to its public 386 | licenses. Notwithstanding, Creative Commons may elect to apply one of 387 | its public licenses to material it publishes and in those instances 388 | will be considered the "Licensor". The text of the Creative Commons 389 | public licenses is dedicated to the public domain under the CC0 Public 390 | Domain Dedication. Except for the limited purpose of indicating that 391 | material is shared under a Creative Commons public license or as 392 | otherwise permitted by the Creative Commons policies published at 393 | creativecommons.org/policies, Creative Commons does not authorize the 394 | use of the trademark "Creative Commons" or any other trademark or logo 395 | of Creative Commons without its prior written consent including, 396 | without limitation, in connection with any unauthorized modifications 397 | to any of its public licenses or any other arrangements, 398 | understandings, or agreements concerning use of licensed material. For 399 | the avoidance of doubt, this paragraph does not form part of the 400 | public licenses. 401 | 402 | Creative Commons may be contacted at creativecommons.org. 403 | -------------------------------------------------------------------------------- /data/val.txt: -------------------------------------------------------------------------------- 1 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45662921.tar 2 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261179.tar 3 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47115543.tar 4 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261143.tar 5 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261615.tar 6 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897545.tar 7 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261133.tar 8 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897552.tar 9 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45663113.tar 10 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897521.tar 11 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897501.tar 12 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261587.tar 13 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45260903.tar 14 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42446540.tar 15 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47204559.tar 16 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897561.tar 17 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331068.tar 18 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897538.tar 19 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331262.tar 20 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898486.tar 21 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47204552.tar 22 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897599.tar 23 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47332893.tar 24 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897692.tar 25 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331311.tar 26 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897647.tar 27 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333923.tar 28 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898811.tar 29 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47204573.tar 30 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898521.tar 31 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331651.tar 32 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898538.tar 33 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47115452.tar 34 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47204605.tar 35 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897688.tar 36 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898570.tar 37 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331319.tar 38 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898867.tar 39 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331661.tar 40 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331971.tar 41 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899617.tar 42 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333452.tar 43 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899611.tar 44 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898849.tar 45 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47332000.tar 46 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899459.tar 47 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47332885.tar 48 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899698.tar 49 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331988.tar 50 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899679.tar 51 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331963.tar 52 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899691.tar 53 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333431.tar 54 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899725.tar 55 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333898.tar 56 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899729.tar 57 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47332915.tar 58 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-43896260.tar 59 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333440.tar 60 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-43896321.tar 61 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-43896330.tar 62 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333916.tar 63 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261631.tar 64 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899736.tar 65 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-44358442.tar 66 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47334107.tar 67 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45260898.tar 68 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48458415.tar 69 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45260854.tar 70 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47334239.tar 71 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899712.tar 72 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-44358451.tar 73 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333927.tar 74 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47895552.tar 75 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261121.tar 76 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333934.tar 77 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45260920.tar 78 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47334115.tar 79 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261575.tar 80 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45662942.tar 81 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47115469.tar 82 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018730.tar 83 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45662981.tar 84 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018375.tar 85 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47115525.tar 86 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45662970.tar 87 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47334234.tar 88 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45663164.tar 89 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45663149.tar 90 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47334256.tar 91 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47430475.tar 92 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47895534.tar 93 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47895341.tar 94 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47895542.tar 95 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47430485.tar 96 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018345.tar 97 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018367.tar 98 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018559.tar 99 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018566.tar 100 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018382.tar 101 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018947.tar 102 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47895364.tar 103 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018737.tar 104 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48458481.tar 105 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48458427.tar 106 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48458654.tar 107 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48458647.tar -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | cyclonedds 2 | rerun-sdk 3 | scipy 4 | tifffile 5 | timm 6 | torch 7 | torchvision 8 | webdataset==0.2.86 9 | Pillow 10 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import os 3 | import platform 4 | import shutil 5 | import sys 6 | import warnings 7 | from os import path as osp 8 | from pkg_resources import DistributionNotFound, get_distribution 9 | from setuptools import find_packages, setup 10 | 11 | 12 | if __name__ == '__main__': 13 | setup( 14 | name='cubifyanything', 15 | version='0.0.1', 16 | description=("Public release of Cubify Anything"), 17 | author='Apple Inc.', 18 | author_email='jlazarow@apple.com', 19 | url='https://github.com/apple/ml-cubifyanything', 20 | packages=find_packages(), 21 | include_package_data=True, 22 | zip_safe=False) 23 | -------------------------------------------------------------------------------- /teaser.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple/ml-cubifyanything/7419eb0cb9b19cb5257b4a1dc905476c155cd343/teaser.jpg -------------------------------------------------------------------------------- /tools/demo.py: -------------------------------------------------------------------------------- 1 | # For licensing see accompanying LICENSE file. 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved. 3 | import os 4 | os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" 5 | 6 | import argparse 7 | import glob 8 | import itertools 9 | import numpy as np 10 | import rerun 11 | import rerun.blueprint as rrb 12 | import torch 13 | import torchvision 14 | import sys 15 | import uuid 16 | 17 | from pathlib import Path 18 | from PIL import Image 19 | from scipy.spatial.transform import Rotation 20 | 21 | from cubifyanything.batching import Sensors 22 | from cubifyanything.boxes import GeneralInstance3DBoxes 23 | from cubifyanything.capture_stream import CaptureDataset 24 | from cubifyanything.color import random_color 25 | from cubifyanything.cubify_transformer import make_cubify_transformer 26 | from cubifyanything.dataset import CubifyAnythingDataset 27 | from cubifyanything.instances import Instances3D 28 | from cubifyanything.preprocessor import Augmentor, Preprocessor 29 | 30 | def move_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor: 31 | try: 32 | return src.to(dst) 33 | except: 34 | return src.to(dst.device) 35 | 36 | def move_to_current_device(x, t): 37 | if isinstance(x, (list, tuple)): 38 | return [move_device_like(x_, t) for x_ in x] 39 | 40 | return move_device_like(x, t) 41 | 42 | def move_input_to_current_device(batched_input: Sensors, t: torch.Tensor): 43 | # Assume only two levels of nesting for now. 44 | return { name: { name_: move_to_current_device(m, t) for name_, m in s.items() } for name, s in batched_input.items() } 45 | 46 | # A global dictionary we use to for consistent colors for instances across frames. 47 | ID_TO_COLOR = {} 48 | 49 | def log_instances(instances, prefix, boxes_3d_name="gt_boxes_3d", ids_name="gt_ids", log_instances_name="instances", **kwargs): 50 | global ID_TO_COLOR 51 | boxes_3d = instances.get(boxes_3d_name) 52 | 53 | colors = [] 54 | if instances.has(ids_name): 55 | ids = instances.get(ids_name) 56 | for id_ in ids: 57 | ID_TO_COLOR[id_] = ID_TO_COLOR.get(id_, random_color(rgb=True)) 58 | colors.append(ID_TO_COLOR[id_]) 59 | else: 60 | ids = None 61 | colors = [random_color(rgb=True) for _ in range(len(instances))] 62 | 63 | quaternions = [ 64 | rerun.Quaternion( 65 | xyzw=Rotation.from_matrix(r).as_quat() 66 | ) 67 | 68 | for r in boxes_3d.R.cpu().numpy() 69 | ] 70 | 71 | # Hard-code these suffixes. 72 | rerun.log( 73 | f"{prefix}/{log_instances_name}", 74 | rerun.Boxes3D( 75 | centers=boxes_3d.gravity_center.cpu().numpy(), 76 | sizes=boxes_3d.dims.cpu().numpy(), 77 | quaternions=quaternions, 78 | colors=colors, 79 | labels=ids, 80 | show_labels=False), 81 | **kwargs) 82 | 83 | def load_data_and_visualize(dataset): 84 | blueprint = rrb.Blueprint( 85 | rrb.Vertical( 86 | contents=[ 87 | rrb.Spatial3DView( 88 | name="World", 89 | origin="/world"), 90 | rrb.Horizontal( 91 | contents=[ 92 | rrb.Spatial2DView( 93 | name="Image", 94 | origin="/device/wide/image", 95 | contents=[ 96 | "+ $origin/**", 97 | "+ /device/wide/instances/**" 98 | ]), 99 | rrb.Spatial2DView( 100 | name="Depth", 101 | origin="/device/wide/depth"), 102 | rrb.Spatial2DView( 103 | name="Depth (GT)", 104 | origin="/device/gt/depth"), 105 | ], 106 | name="Wide") 107 | ])) 108 | 109 | recording = None 110 | video_id = None 111 | for sample in dataset: 112 | sample_video_id = sample["meta"]["video_id"] 113 | if (recording is None) or (video_id != sample_video_id): 114 | new_recording = rerun.new_recording( 115 | application_id=str(sample_video_id), recording_id=uuid.uuid4(), make_default=True) 116 | 117 | new_recording.send_blueprint(blueprint, make_active=True) 118 | rerun.spawn() 119 | 120 | recording = new_recording 121 | video_id = sample_video_id 122 | 123 | # Check for the world. Note that this may not show if --every-nth-frame is used. 124 | if "world" in sample: 125 | world_instances = sample["world"]["instances"] 126 | log_instances(world_instances, prefix="/world", static=True) 127 | continue 128 | 129 | rerun.set_time_seconds("pts", sample["meta"]["timestamp"], recording=recording) 130 | 131 | # -> channels last. 132 | image = np.moveaxis(sample["wide"]["image"][-1].numpy(), 0, -1) 133 | camera = rerun.Pinhole( 134 | image_from_camera=sample["sensor_info"].wide.image.K[-1].numpy(), resolution=sample["sensor_info"].wide.image.size) 135 | 136 | # Log this to both the device (per-frame) and to the world. 137 | rerun.log("/device/wide/image", rerun.Image(image).compress()) 138 | rerun.log("/device/wide/image", camera) 139 | 140 | # RT here corresponds to the laser-scanner space, as registered to the capture device, so this allows us 141 | # to visualize the camera with respect to the annotation space. 142 | RT = sample["sensor_info"].gt.RT[-1].numpy() 143 | pose_transform = rerun.Transform3D( 144 | translation=RT[:3, 3], 145 | rotation=rerun.Quaternion(xyzw=Rotation.from_matrix(RT[:3, :3]).as_quat())) 146 | 147 | rerun.log("/world/image", pose_transform) 148 | rerun.log("/world/image", camera) 149 | rerun.log("/world/image/image", rerun.Image(image, opacity=0.5)) 150 | 151 | rerun.log("/device/wide/depth", rerun.DepthImage(sample["wide"]["depth"][-1].numpy())) 152 | rerun.log("/device/gt/depth", rerun.DepthImage(sample["gt"]["depth"][-1].numpy())) 153 | 154 | per_frame_instances = sample["wide"]["instances"] 155 | log_instances(per_frame_instances, prefix="/device/wide") 156 | 157 | def get_camera_coords(depth): 158 | height, width = depth.shape 159 | device = depth.device 160 | 161 | # camera xy. 162 | camera_coords = torch.stack( 163 | torch.meshgrid( 164 | torch.arange(0, width, device=device), 165 | torch.arange(0, height, device=device), indexing="xy"), 166 | dim=-1) 167 | 168 | return camera_coords 169 | 170 | def unproject(depth, K, RT, max_depth=10.0): 171 | camera_coords = get_camera_coords(depth) * depth[..., None] 172 | 173 | intrinsics_4x4 = torch.eye(4, device=depth.device) 174 | intrinsics_4x4[:3, :3] = K 175 | 176 | valid = depth > 0 177 | if max_depth is not None: 178 | valid &= (depth < max_depth) 179 | 180 | depth = depth[..., None] 181 | uvd = torch.cat((camera_coords, depth, torch.ones_like(depth)), dim=-1) 182 | 183 | camera_xyz = torch.linalg.inv(intrinsics_4x4) @ uvd.view(-1, 4).T 184 | world_xyz = RT @ camera_xyz 185 | 186 | return world_xyz.T[..., :-1].reshape(uvd.shape[0], uvd.shape[1], 3), valid 187 | 188 | def load_data_and_execute_model(model, dataset, augmentor, preprocessor, score_thresh=0.0, viz_on_gt_points=False): 189 | is_depth_model = "wide/depth" in augmentor.measurement_keys 190 | blueprint = rrb.Blueprint( 191 | rrb.Vertical( 192 | contents=[ 193 | rrb.Spatial3DView( 194 | name="World", 195 | contents=[ 196 | "+ $origin/**", 197 | "+ /device/wide/pred_instances/**" 198 | ], 199 | origin="/world"), 200 | rrb.Horizontal( 201 | contents=([ 202 | rrb.Spatial2DView( 203 | name="Image", 204 | origin="/device/wide/image", 205 | contents=[ 206 | "+ $origin/**", 207 | "+ /device/wide/pred_instances/**" 208 | ]) 209 | ] + ([ 210 | # Only show this for RGB-D. 211 | rrb.Spatial2DView( 212 | name="Depth", 213 | origin="/device/wide/depth") 214 | ] if is_depth_model else [])), 215 | name="Wide") 216 | ])) 217 | 218 | recording = None 219 | video_id = None 220 | 221 | device = model.pixel_mean 222 | for sample in dataset: 223 | sample_video_id = sample["meta"]["video_id"] 224 | if (recording is None) or (video_id != sample_video_id): 225 | new_recording = rerun.new_recording( 226 | application_id=str(sample_video_id), recording_id=uuid.uuid4(), make_default=True) 227 | new_recording.send_blueprint(blueprint, make_active=True) 228 | rerun.spawn() 229 | 230 | recording = new_recording 231 | video_id = sample_video_id 232 | 233 | # Keep things in image space, so adjust accordingly. 234 | rerun.log("/world", rerun.ViewCoordinates.RIGHT_HAND_Y_DOWN, static=True) 235 | 236 | rerun.set_time_seconds("pts", sample["meta"]["timestamp"], recording=recording) 237 | 238 | # -> channels last. 239 | image = np.moveaxis(sample["wide"]["image"][-1].numpy(), 0, -1) 240 | color_camera = rerun.Pinhole( 241 | image_from_camera=sample["sensor_info"].wide.image.K[-1].numpy(), resolution=sample["sensor_info"].wide.image.size) 242 | 243 | if is_depth_model: 244 | # Show the depth being sent to the model. 245 | depth_camera = rerun.Pinhole( 246 | image_from_camera=sample["sensor_info"].wide.depth.K[-1].numpy(), resolution=sample["sensor_info"].wide.depth.size) 247 | 248 | xyzrgb = None 249 | if viz_on_gt_points and sample["sensor_info"].has("gt"): 250 | # Backproject GT depth to world so we can compare our predictions. 251 | depth_gt = sample["gt"]["depth"][-1] 252 | matched_image = torch.tensor(np.array(Image.fromarray(image).resize((depth_gt.shape[1], depth_gt.shape[0])))) 253 | 254 | # Feel free to change max_depth, but know CA is only trained up to 5m. 255 | xyz, valid = unproject(depth_gt, sample["sensor_info"].gt.depth.K[-1], torch.eye(4), max_depth=10.0) 256 | xyzrgb = torch.cat((xyz, matched_image / 255.0), dim=-1)[valid] 257 | 258 | packaged = augmentor.package(sample) 259 | packaged = move_input_to_current_device(packaged, device) 260 | packaged = preprocessor.preprocess([packaged]) 261 | 262 | with torch.no_grad(): 263 | pred_instances = model(packaged)[0] 264 | 265 | pred_instances = pred_instances[pred_instances.scores >= score_thresh] 266 | 267 | # Hold off on logging anything until now, since the delay might confuse the user in the visualizer. 268 | rerun.log("/device/wide/image", rerun.Image(image).compress()) 269 | rerun.log("/device/wide/image", color_camera) 270 | 271 | if is_depth_model: 272 | rerun.log("/device/wide/depth", rerun.DepthImage(sample["wide"]["depth"][-1].numpy())) 273 | rerun.log("/device/wide/depth", depth_camera) 274 | 275 | if xyzrgb is not None: 276 | rerun.log("/world/xyz", rerun.Points3D(positions=xyzrgb[..., :3], colors=xyzrgb[..., 3:], radii=None)) 277 | 278 | log_instances(pred_instances, prefix="/device/wide", boxes_3d_name="pred_boxes_3d", ids_name=None, log_instances_name="pred_instances") 279 | 280 | if __name__ == "__main__": 281 | parser = argparse.ArgumentParser() 282 | 283 | parser.add_argument("dataset_path", help="Path to the directory containing the .tar files, the full path to a single tar file (recommended), or a path to a txt file containing HTTP links. Using the value \"stream\" will attempt to stream from your device using the NeRFCapture app") 284 | parser.add_argument("--model-path", help="Path to the model to load") 285 | parser.add_argument("--no-depth", default=False, action="store_true", help="Skip loading depth.") 286 | parser.add_argument("--score-thresh", default=0.25, help="Threshold for detections") 287 | parser.add_argument("--every-nth-frame", default=None, type=int, help="Load every `n` frames") 288 | parser.add_argument("--viz-only", default=False, action="store_true", help="Skip loading a model and only visualize data.") 289 | parser.add_argument("--viz-on-gt-points", default=False, action="store_true", help="Backproject the GT depth to form a point cloud in order to visualize the predictions") 290 | parser.add_argument("--device", default="cpu", help="Which device to push the model to (cpu, mps, cuda)") 291 | parser.add_argument("--video-ids", nargs="+", help="Subset of videos to execute on. By default, all. Ignored if a tar file is explicitly given or in stream mode.") 292 | 293 | args = parser.parse_args() 294 | print("Command Line Args:", args) 295 | 296 | dataset_path = args.dataset_path 297 | use_cache = False 298 | 299 | if dataset_path == "stream": 300 | dataset = CaptureDataset() 301 | else: 302 | dataset_files = [] 303 | 304 | # Allow the user to specify a single tar or a txt file containing an http link per line. 305 | if os.path.isfile(dataset_path): 306 | if dataset_path.endswith(".txt"): 307 | with open(dataset_path, "r") as dataset_file: 308 | dataset_files = [l.strip() for l in dataset_file.readlines()] 309 | 310 | # Cache these files locally to prevent repeated downlods. 311 | use_cache = True 312 | else: 313 | args.video_ids = None 314 | dataset_files = [dataset_path] 315 | else: 316 | # Try to glob all files matching ca1m-*.tar 317 | dataset_files = glob.glob(os.path.join(dataset_path, "ca1m-*.tar")) 318 | if len(dataset_files) == 0: 319 | raise ValueError(f"Failed to find any .tar files matching ca1m- prefix at {dataset_path}") 320 | 321 | if args.video_ids is not None: 322 | dataset_files = [df for df in dataset_files if Path(df).with_suffix("").name.split("-")[-1] in args.video_ids] 323 | 324 | if len(dataset_files) == 0: 325 | raise ValueError("No data was found") 326 | 327 | dataset = CubifyAnythingDataset( 328 | [Path(df).as_uri() if not df.startswith("https://") else df for df in dataset_files], 329 | yield_world_instances=args.viz_only, 330 | load_arkit_depth=not args.no_depth, 331 | use_cache=use_cache) 332 | 333 | if args.viz_only: 334 | if args.every_nth_frame is not None: 335 | dataset = itertools.islice(dataset, 0, None, args.every_nth_frame) 336 | 337 | load_data_and_visualize(dataset) 338 | sys.exit(0) 339 | 340 | assert args.model_path is not None 341 | checkpoint = torch.load(args.model_path, map_location=args.device or "cpu")["model"] 342 | 343 | # Figure out which model this is based on the weights. 344 | 345 | # Basic detection of the actual ViT backbone being used (for our setup, dimension is 1:1 with which ViT). 346 | backbone_embedding_dimension = checkpoint["backbone.0.patch_embed.proj.weight"].shape[0] 347 | 348 | # We need to detect RGB or RGB only models so we can disable sending depth. 349 | is_depth_model = any(k.startswith("backbone.0.patch_embed_depth.") for k in checkpoint.keys()) 350 | 351 | model = make_cubify_transformer(dimension=backbone_embedding_dimension, depth_model=is_depth_model).eval() 352 | model.load_state_dict(checkpoint) 353 | 354 | # No need for ARKit depth if running an RGB only model. 355 | dataset.load_arkit_depth = is_depth_model 356 | if args.every_nth_frame is not None: 357 | dataset = itertools.islice(dataset, 0, None, args.every_nth_frame) 358 | 359 | augmentor = Augmentor(("wide/image", "wide/depth") if is_depth_model else ("wide/image",)) 360 | preprocessor = Preprocessor() 361 | 362 | if args.device is not None: 363 | model = model.to(args.device) 364 | 365 | load_data_and_execute_model(model, dataset, augmentor, preprocessor, score_thresh=args.score_thresh, viz_on_gt_points=args.viz_on_gt_points) 366 | --------------------------------------------------------------------------------