├── ACKNOWLEDGEMENTS
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── LICENSE_MODEL
├── README.md
├── cubifyanything
    ├── __init__.py
    ├── batching.py
    ├── boxes.py
    ├── capture_stream.py
    ├── color.py
    ├── cubify_transformer.py
    ├── dataset.py
    ├── imagelist.py
    ├── instances.py
    ├── measurement.py
    ├── orientation.py
    ├── pos.py
    ├── preprocessor.py
    ├── sensor.py
    ├── transforms.py
    └── vit.py
├── data
    ├── LICENSE_DATA
    ├── train.txt
    └── val.txt
├── requirements.txt
├── setup.py
├── teaser.jpg
└── tools
    └── demo.py


/ACKNOWLEDGEMENTS:
--------------------------------------------------------------------------------
  1 | Portions of this Software may utilize the following copyrighted material, the use of which is hereby acknowledged.
  2 | 
  3 | ------------------------------------------------
  4 | Detectron2 (https://github.com/facebookresearch/detectron2)
  5 | Facebook, Inc. and its affiliates.
  6 | 
  7 | Apache License
  8 | Version 2.0, January 2004
  9 | http://www.apache.org/licenses/
 10 | 
 11 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
 12 | 
 13 | 1. Definitions.
 14 | 
 15 | "License" shall mean the terms and conditions for use, reproduction,
 16 | and distribution as defined by Sections 1 through 9 of this document.
 17 | 
 18 | "Licensor" shall mean the copyright owner or entity authorized by
 19 | the copyright owner that is granting the License.
 20 | 
 21 | "Legal Entity" shall mean the union of the acting entity and all
 22 | other entities that control, are controlled by, or are under common
 23 | control with that entity. For the purposes of this definition,
 24 | "control" means (i) the power, direct or indirect, to cause the
 25 | direction or management of such entity, whether by contract or
 26 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
 27 | outstanding shares, or (iii) beneficial ownership of such entity.
 28 | 
 29 | "You" (or "Your") shall mean an individual or Legal Entity
 30 | exercising permissions granted by this License.
 31 | 
 32 | "Source" form shall mean the preferred form for making modifications,
 33 | including but not limited to software source code, documentation
 34 | source, and configuration files.
 35 | 
 36 | "Object" form shall mean any form resulting from mechanical
 37 | transformation or translation of a Source form, including but
 38 | not limited to compiled object code, generated documentation,
 39 | and conversions to other media types.
 40 | 
 41 | "Work" shall mean the work of authorship, whether in Source or
 42 | Object form, made available under the License, as indicated by a
 43 | copyright notice that is included in or attached to the work
 44 | (an example is provided in the Appendix below).
 45 | 
 46 | "Derivative Works" shall mean any work, whether in Source or Object
 47 | form, that is based on (or derived from) the Work and for which the
 48 | editorial revisions, annotations, elaborations, or other modifications
 49 | represent, as a whole, an original work of authorship. For the purposes
 50 | of this License, Derivative Works shall not include works that remain
 51 | separable from, or merely link (or bind by name) to the interfaces of,
 52 | the Work and Derivative Works thereof.
 53 | 
 54 | "Contribution" shall mean any work of authorship, including
 55 | the original version of the Work and any modifications or additions
 56 | to that Work or Derivative Works thereof, that is intentionally
 57 | submitted to Licensor for inclusion in the Work by the copyright owner
 58 | or by an individual or Legal Entity authorized to submit on behalf of
 59 | the copyright owner. For the purposes of this definition, "submitted"
 60 | means any form of electronic, verbal, or written communication sent
 61 | to the Licensor or its representatives, including but not limited to
 62 | communication on electronic mailing lists, source code control systems,
 63 | and issue tracking systems that are managed by, or on behalf of, the
 64 | Licensor for the purpose of discussing and improving the Work, but
 65 | excluding communication that is conspicuously marked or otherwise
 66 | designated in writing by the copyright owner as "Not a Contribution."
 67 | 
 68 | "Contributor" shall mean Licensor and any individual or Legal Entity
 69 | on behalf of whom a Contribution has been received by Licensor and
 70 | subsequently incorporated within the Work.
 71 | 
 72 | 2. Grant of Copyright License. Subject to the terms and conditions of
 73 | this License, each Contributor hereby grants to You a perpetual,
 74 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 75 | copyright license to reproduce, prepare Derivative Works of,
 76 | publicly display, publicly perform, sublicense, and distribute the
 77 | Work and such Derivative Works in Source or Object form.
 78 | 
 79 | 3. Grant of Patent License. Subject to the terms and conditions of
 80 | this License, each Contributor hereby grants to You a perpetual,
 81 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 82 | (except as stated in this section) patent license to make, have made,
 83 | use, offer to sell, sell, import, and otherwise transfer the Work,
 84 | where such license applies only to those patent claims licensable
 85 | by such Contributor that are necessarily infringed by their
 86 | Contribution(s) alone or by combination of their Contribution(s)
 87 | with the Work to which such Contribution(s) was submitted. If You
 88 | institute patent litigation against any entity (including a
 89 | cross-claim or counterclaim in a lawsuit) alleging that the Work
 90 | or a Contribution incorporated within the Work constitutes direct
 91 | or contributory patent infringement, then any patent licenses
 92 | granted to You under this License for that Work shall terminate
 93 | as of the date such litigation is filed.
 94 | 
 95 | 4. Redistribution. You may reproduce and distribute copies of the
 96 | Work or Derivative Works thereof in any medium, with or without
 97 | modifications, and in Source or Object form, provided that You
 98 | meet the following conditions:
 99 | 
100 | (a) You must give any other recipients of the Work or
101 | Derivative Works a copy of this License; and
102 | 
103 | (b) You must cause any modified files to carry prominent notices
104 | stating that You changed the files; and
105 | 
106 | (c) You must retain, in the Source form of any Derivative Works
107 | that You distribute, all copyright, patent, trademark, and
108 | attribution notices from the Source form of the Work,
109 | excluding those notices that do not pertain to any part of
110 | the Derivative Works; and
111 | 
112 | (d) If the Work includes a "NOTICE" text file as part of its
113 | distribution, then any Derivative Works that You distribute must
114 | include a readable copy of the attribution notices contained
115 | within such NOTICE file, excluding those notices that do not
116 | pertain to any part of the Derivative Works, in at least one
117 | of the following places: within a NOTICE text file distributed
118 | as part of the Derivative Works; within the Source form or
119 | documentation, if provided along with the Derivative Works; or,
120 | within a display generated by the Derivative Works, if and
121 | wherever such third-party notices normally appear. The contents
122 | of the NOTICE file are for informational purposes only and
123 | do not modify the License. You may add Your own attribution
124 | notices within Derivative Works that You distribute, alongside
125 | or as an addendum to the NOTICE text from the Work, provided
126 | that such additional attribution notices cannot be construed
127 | as modifying the License.
128 | 
129 | You may add Your own copyright statement to Your modifications and
130 | may provide additional or different license terms and conditions
131 | for use, reproduction, or distribution of Your modifications, or
132 | for any such Derivative Works as a whole, provided Your use,
133 | reproduction, and distribution of the Work otherwise complies with
134 | the conditions stated in this License.
135 | 
136 | 5. Submission of Contributions. Unless You explicitly state otherwise,
137 | any Contribution intentionally submitted for inclusion in the Work
138 | by You to the Licensor shall be under the terms and conditions of
139 | this License, without any additional terms or conditions.
140 | Notwithstanding the above, nothing herein shall supersede or modify
141 | the terms of any separate license agreement you may have executed
142 | with Licensor regarding such Contributions.
143 | 
144 | 6. Trademarks. This License does not grant permission to use the trade
145 | names, trademarks, service marks, or product names of the Licensor,
146 | except as required for reasonable and customary use in describing the
147 | origin of the Work and reproducing the content of the NOTICE file.
148 | 
149 | 7. Disclaimer of Warranty. Unless required by applicable law or
150 | agreed to in writing, Licensor provides the Work (and each
151 | Contributor provides its Contributions) on an "AS IS" BASIS,
152 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
153 | implied, including, without limitation, any warranties or conditions
154 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
155 | PARTICULAR PURPOSE. You are solely responsible for determining the
156 | appropriateness of using or redistributing the Work and assume any
157 | risks associated with Your exercise of permissions under this License.
158 | 
159 | 8. Limitation of Liability. In no event and under no legal theory,
160 | whether in tort (including negligence), contract, or otherwise,
161 | unless required by applicable law (such as deliberate and grossly
162 | negligent acts) or agreed to in writing, shall any Contributor be
163 | liable to You for damages, including any direct, indirect, special,
164 | incidental, or consequential damages of any character arising as a
165 | result of this License or out of the use or inability to use the
166 | Work (including but not limited to damages for loss of goodwill,
167 | work stoppage, computer failure or malfunction, or any and all
168 | other commercial damages or losses), even if such Contributor
169 | has been advised of the possibility of such damages.
170 | 
171 | 9. Accepting Warranty or Additional Liability. While redistributing
172 | the Work or Derivative Works thereof, You may choose to offer,
173 | and charge a fee for, acceptance of support, warranty, indemnity,
174 | or other liability obligations and/or rights consistent with this
175 | License. However, in accepting such obligations, You may act only
176 | on Your own behalf and on Your sole responsibility, not on behalf
177 | of any other Contributor, and only if You agree to indemnify,
178 | defend, and hold each Contributor harmless for any liability
179 | incurred by, or claims asserted against, such Contributor by reason
180 | of your accepting any such warranty or additional liability.
181 | 
182 | END OF TERMS AND CONDITIONS
183 | 
184 | APPENDIX: How to apply the Apache License to your work.
185 | 
186 | To apply the Apache License to your work, attach the following
187 | boilerplate notice, with the fields enclosed by brackets "[]"
188 | replaced with your own identifying information. (Don't include
189 | the brackets!)  The text should be enclosed in the appropriate
190 | comment syntax for the file format. We also recommend that a
191 | file or class name and description of purpose be included on the
192 | same "printed page" as the copyright notice for easier
193 | identification within third-party archives.
194 | 
195 | Copyright [yyyy] [name of copyright owner]
196 | 
197 | 
198 | Licensed under the Apache License, Version 2.0 (the "License");
199 | you may not use this file except in compliance with the License.
200 | You may obtain a copy of the License at
201 | 
202 | http://www.apache.org/licenses/LICENSE-2.0
203 | 
204 | Unless required by applicable law or agreed to in writing, software
205 | distributed under the License is distributed on an "AS IS" BASIS,
206 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
207 | See the License for the specific language governing permissions and
208 | limitations under the License.
209 | 
210 | ------------------------------------------------
211 | Plain-DETR (https://github.com/impiga/Plain-DETR)
212 | 2023 Xi'an Jiaotong University & Microsoft Research Asia.
213 | 
214 | MIT License
215 | 
216 | Copyright (c) 2023 Xi'an Jiaotong University and Microsoft Research Asia
217 | 
218 | Permission is hereby granted, free of charge, to any person obtaining a copy
219 | of this software and associated documentation files (the "Software"), to deal
220 | in the Software without restriction, including without limitation the rights
221 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
222 | copies of the Software, and to permit persons to whom the Software is
223 | furnished to do so, subject to the following conditions:
224 | 
225 | The above copyright notice and this permission notice shall be included in all
226 | copies or substantial portions of the Software.
227 | 
228 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
229 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
230 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
231 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
232 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
233 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
234 | SOFTWARE.
235 | 
236 | ------------------------------------------------
237 | WebDataset (https://github.com/webdataset/webdataset)
238 | NVIDIA CORPORATION
239 | 
240 | Copyright 2020 NVIDIA CORPORATION. All rights reserved.
241 | 
242 | Redistribution and use in source and binary forms, with or without
243 | modification, are permitted provided that the following conditions
244 | are met:
245 | 
246 | 1. Redistributions of source code must retain the above copyright notice,
247 | this list of conditions and the following disclaimer.
248 | 
249 | 2. Redistributions in binary form must reproduce the above copyright
250 | notice, this list of conditions and the following disclaimer in the
251 | documentation and/or other materials provided with the distribution.
252 | 
253 | 3. Neither the name of the copyright holder nor the names of its
254 | contributors may be used to endorse or promote products derived from
255 | this software without specific prior written permission.
256 | 
257 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
258 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
259 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
260 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
261 | HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
262 | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
263 | TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
264 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
265 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
266 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
267 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
268 | 
269 | MMDet3D (https://github.com/open-mmlab/mmdetection3d)
270 | 
271 | Copyright 2018-2019 Open-MMLab. All rights reserved.
272 | 
273 |                                  Apache License
274 |                            Version 2.0, January 2004
275 |                         http://www.apache.org/licenses/
276 | 
277 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
278 | 
279 |    1. Definitions.
280 | 
281 |       "License" shall mean the terms and conditions for use, reproduction,
282 |       and distribution as defined by Sections 1 through 9 of this document.
283 | 
284 |       "Licensor" shall mean the copyright owner or entity authorized by
285 |       the copyright owner that is granting the License.
286 | 
287 |       "Legal Entity" shall mean the union of the acting entity and all
288 |       other entities that control, are controlled by, or are under common
289 |       control with that entity. For the purposes of this definition,
290 |       "control" means (i) the power, direct or indirect, to cause the
291 |       direction or management of such entity, whether by contract or
292 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
293 |       outstanding shares, or (iii) beneficial ownership of such entity.
294 | 
295 |       "You" (or "Your") shall mean an individual or Legal Entity
296 |       exercising permissions granted by this License.
297 | 
298 |       "Source" form shall mean the preferred form for making modifications,
299 |       including but not limited to software source code, documentation
300 |       source, and configuration files.
301 | 
302 |       "Object" form shall mean any form resulting from mechanical
303 |       transformation or translation of a Source form, including but
304 |       not limited to compiled object code, generated documentation,
305 |       and conversions to other media types.
306 | 
307 |       "Work" shall mean the work of authorship, whether in Source or
308 |       Object form, made available under the License, as indicated by a
309 |       copyright notice that is included in or attached to the work
310 |       (an example is provided in the Appendix below).
311 | 
312 |       "Derivative Works" shall mean any work, whether in Source or Object
313 |       form, that is based on (or derived from) the Work and for which the
314 |       editorial revisions, annotations, elaborations, or other modifications
315 |       represent, as a whole, an original work of authorship. For the purposes
316 |       of this License, Derivative Works shall not include works that remain
317 |       separable from, or merely link (or bind by name) to the interfaces of,
318 |       the Work and Derivative Works thereof.
319 | 
320 |       "Contribution" shall mean any work of authorship, including
321 |       the original version of the Work and any modifications or additions
322 |       to that Work or Derivative Works thereof, that is intentionally
323 |       submitted to Licensor for inclusion in the Work by the copyright owner
324 |       or by an individual or Legal Entity authorized to submit on behalf of
325 |       the copyright owner. For the purposes of this definition, "submitted"
326 |       means any form of electronic, verbal, or written communication sent
327 |       to the Licensor or its representatives, including but not limited to
328 |       communication on electronic mailing lists, source code control systems,
329 |       and issue tracking systems that are managed by, or on behalf of, the
330 |       Licensor for the purpose of discussing and improving the Work, but
331 |       excluding communication that is conspicuously marked or otherwise
332 |       designated in writing by the copyright owner as "Not a Contribution."
333 | 
334 |       "Contributor" shall mean Licensor and any individual or Legal Entity
335 |       on behalf of whom a Contribution has been received by Licensor and
336 |       subsequently incorporated within the Work.
337 | 
338 |    2. Grant of Copyright License. Subject to the terms and conditions of
339 |       this License, each Contributor hereby grants to You a perpetual,
340 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
341 |       copyright license to reproduce, prepare Derivative Works of,
342 |       publicly display, publicly perform, sublicense, and distribute the
343 |       Work and such Derivative Works in Source or Object form.
344 | 
345 |    3. Grant of Patent License. Subject to the terms and conditions of
346 |       this License, each Contributor hereby grants to You a perpetual,
347 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
348 |       (except as stated in this section) patent license to make, have made,
349 |       use, offer to sell, sell, import, and otherwise transfer the Work,
350 |       where such license applies only to those patent claims licensable
351 |       by such Contributor that are necessarily infringed by their
352 |       Contribution(s) alone or by combination of their Contribution(s)
353 |       with the Work to which such Contribution(s) was submitted. If You
354 |       institute patent litigation against any entity (including a
355 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
356 |       or a Contribution incorporated within the Work constitutes direct
357 |       or contributory patent infringement, then any patent licenses
358 |       granted to You under this License for that Work shall terminate
359 |       as of the date such litigation is filed.
360 | 
361 |    4. Redistribution. You may reproduce and distribute copies of the
362 |       Work or Derivative Works thereof in any medium, with or without
363 |       modifications, and in Source or Object form, provided that You
364 |       meet the following conditions:
365 | 
366 |       (a) You must give any other recipients of the Work or
367 |           Derivative Works a copy of this License; and
368 | 
369 |       (b) You must cause any modified files to carry prominent notices
370 |           stating that You changed the files; and
371 | 
372 |       (c) You must retain, in the Source form of any Derivative Works
373 |           that You distribute, all copyright, patent, trademark, and
374 |           attribution notices from the Source form of the Work,
375 |           excluding those notices that do not pertain to any part of
376 |           the Derivative Works; and
377 | 
378 |       (d) If the Work includes a "NOTICE" text file as part of its
379 |           distribution, then any Derivative Works that You distribute must
380 |           include a readable copy of the attribution notices contained
381 |           within such NOTICE file, excluding those notices that do not
382 |           pertain to any part of the Derivative Works, in at least one
383 |           of the following places: within a NOTICE text file distributed
384 |           as part of the Derivative Works; within the Source form or
385 |           documentation, if provided along with the Derivative Works; or,
386 |           within a display generated by the Derivative Works, if and
387 |           wherever such third-party notices normally appear. The contents
388 |           of the NOTICE file are for informational purposes only and
389 |           do not modify the License. You may add Your own attribution
390 |           notices within Derivative Works that You distribute, alongside
391 |           or as an addendum to the NOTICE text from the Work, provided
392 |           that such additional attribution notices cannot be construed
393 |           as modifying the License.
394 | 
395 |       You may add Your own copyright statement to Your modifications and
396 |       may provide additional or different license terms and conditions
397 |       for use, reproduction, or distribution of Your modifications, or
398 |       for any such Derivative Works as a whole, provided Your use,
399 |       reproduction, and distribution of the Work otherwise complies with
400 |       the conditions stated in this License.
401 | 
402 |    5. Submission of Contributions. Unless You explicitly state otherwise,
403 |       any Contribution intentionally submitted for inclusion in the Work
404 |       by You to the Licensor shall be under the terms and conditions of
405 |       this License, without any additional terms or conditions.
406 |       Notwithstanding the above, nothing herein shall supersede or modify
407 |       the terms of any separate license agreement you may have executed
408 |       with Licensor regarding such Contributions.
409 | 
410 |    6. Trademarks. This License does not grant permission to use the trade
411 |       names, trademarks, service marks, or product names of the Licensor,
412 |       except as required for reasonable and customary use in describing the
413 |       origin of the Work and reproducing the content of the NOTICE file.
414 | 
415 |    7. Disclaimer of Warranty. Unless required by applicable law or
416 |       agreed to in writing, Licensor provides the Work (and each
417 |       Contributor provides its Contributions) on an "AS IS" BASIS,
418 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
419 |       implied, including, without limitation, any warranties or conditions
420 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
421 |       PARTICULAR PURPOSE. You are solely responsible for determining the
422 |       appropriateness of using or redistributing the Work and assume any
423 |       risks associated with Your exercise of permissions under this License.
424 | 
425 |    8. Limitation of Liability. In no event and under no legal theory,
426 |       whether in tort (including negligence), contract, or otherwise,
427 |       unless required by applicable law (such as deliberate and grossly
428 |       negligent acts) or agreed to in writing, shall any Contributor be
429 |       liable to You for damages, including any direct, indirect, special,
430 |       incidental, or consequential damages of any character arising as a
431 |       result of this License or out of the use or inability to use the
432 |       Work (including but not limited to damages for loss of goodwill,
433 |       work stoppage, computer failure or malfunction, or any and all
434 |       other commercial damages or losses), even if such Contributor
435 |       has been advised of the possibility of such damages.
436 | 
437 |    9. Accepting Warranty or Additional Liability. While redistributing
438 |       the Work or Derivative Works thereof, You may choose to offer,
439 |       and charge a fee for, acceptance of support, warranty, indemnity,
440 |       or other liability obligations and/or rights consistent with this
441 |       License. However, in accepting such obligations, You may act only
442 |       on Your own behalf and on Your sole responsibility, not on behalf
443 |       of any other Contributor, and only if You agree to indemnify,
444 |       defend, and hold each Contributor harmless for any liability
445 |       incurred by, or claims asserted against, such Contributor by reason
446 |       of your accepting any such warranty or additional liability.
447 | 
448 |    END OF TERMS AND CONDITIONS
449 | 
450 |    APPENDIX: How to apply the Apache License to your work.
451 | 
452 |       To apply the Apache License to your work, attach the following
453 |       boilerplate notice, with the fields enclosed by brackets "[]"
454 |       replaced with your own identifying information. (Don't include
455 |       the brackets!)  The text should be enclosed in the appropriate
456 |       comment syntax for the file format. We also recommend that a
457 |       file or class name and description of purpose be included on the
458 |       same "printed page" as the copyright notice for easier
459 |       identification within third-party archives.
460 | 
461 |    Copyright 2018-2019 Open-MMLab.
462 | 
463 |    Licensed under the Apache License, Version 2.0 (the "License");
464 |    you may not use this file except in compliance with the License.
465 |    You may obtain a copy of the License at
466 | 
467 |        http://www.apache.org/licenses/LICENSE-2.0
468 | 
469 |    Unless required by applicable law or agreed to in writing, software
470 |    distributed under the License is distributed on an "AS IS" BASIS,
471 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
472 |    See the License for the specific language governing permissions and
473 |    limitations under the License.
474 | 
475 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Code of Conduct
 2 | 
 3 | ## Our Pledge
 4 | 
 5 | In the interest of fostering an open and welcoming environment, we as
 6 | contributors and maintainers pledge to making participation in our project and
 7 | our community a harassment-free experience for everyone, regardless of age, body
 8 | size, disability, ethnicity, sex characteristics, gender identity and expression,
 9 | level of experience, education, socio-economic status, nationality, personal
10 | appearance, race, religion, or sexual identity and orientation.
11 | 
12 | ## Our Standards
13 | 
14 | Examples of behavior that contributes to creating a positive environment
15 | include:
16 | 
17 | * Using welcoming and inclusive language
18 | * Being respectful of differing viewpoints and experiences
19 | * Gracefully accepting constructive criticism
20 | * Focusing on what is best for the community
21 | * Showing empathy towards other community members
22 | 
23 | Examples of unacceptable behavior by participants include:
24 | 
25 | * The use of sexualized language or imagery and unwelcome sexual attention or
26 |   advances
27 | * Trolling, insulting/derogatory comments, and personal or political attacks
28 | * Public or private harassment
29 | * Publishing others' private information, such as a physical or electronic
30 |   address, without explicit permission
31 | * Other conduct which could reasonably be considered inappropriate in a
32 |   professional setting
33 | 
34 | ## Our Responsibilities
35 | 
36 | Project maintainers are responsible for clarifying the standards of acceptable
37 | behavior and are expected to take appropriate and fair corrective action in
38 | response to any instances of unacceptable behavior.
39 | 
40 | Project maintainers have the right and responsibility to remove, edit, or
41 | reject comments, commits, code, wiki edits, issues, and other contributions
42 | that are not aligned to this Code of Conduct, or to ban temporarily or
43 | permanently any contributor for other behaviors that they deem inappropriate,
44 | threatening, offensive, or harmful.
45 | 
46 | ## Scope
47 | 
48 | This Code of Conduct applies within all project spaces, and it also applies when
49 | an individual is representing the project or its community in public spaces.
50 | Examples of representing a project or community include using an official
51 | project e-mail address, posting via an official social media account, or acting
52 | as an appointed representative at an online or offline event. Representation of
53 | a project may be further defined and clarified by project maintainers.
54 | 
55 | ## Enforcement
56 | 
57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
58 | reported by contacting the open source team at [opensource-conduct@group.apple.com](mailto:opensource-conduct@group.apple.com). All
59 | complaints will be reviewed and investigated and will result in a response that
60 | is deemed necessary and appropriate to the circumstances. The project team is
61 | obligated to maintain confidentiality with regard to the reporter of an incident.
62 | Further details of specific enforcement policies may be posted separately.
63 | 
64 | Project maintainers who do not follow or enforce the Code of Conduct in good
65 | faith may face temporary or permanent repercussions as determined by other
66 | members of the project's leadership.
67 | 
68 | ## Attribution
69 | 
70 | This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 1.4,
71 | available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct.html](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html)


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contribution Guide
 2 | 
 3 | Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducibility, and beyond its publication there are limited plans for future development of the repository.
 4 | 
 5 | While we welcome new pull requests and issues please note that our response may be limited. Forks and out-of-tree improvements are strongly encouraged.
 6 | 
 7 | ## Before you get started
 8 | 
 9 | By submitting a pull request, you represent that you have the right to license your contribution to Apple and the community, and agree by submitting the patch that your contributions are licensed under the [LICENSE](LICENSE).
10 | 
11 | We ask that all community members read and observe our [Code of Conduct](CODE_OF_CONDUCT.md).
12 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (C) 2025 Apple Inc. All Rights Reserved.
 2 | 
 3 | IMPORTANT:  This Apple software is supplied to you by Apple
 4 | Inc. ("Apple") in consideration of your agreement to the following
 5 | terms, and your use, installation, modification or redistribution of
 6 | this Apple software constitutes acceptance of these terms.  If you do
 7 | not agree with these terms, please do not use, install, modify or
 8 | redistribute this Apple software.
 9 | 
10 | In consideration of your agreement to abide by the following terms, and
11 | subject to these terms, Apple grants you a personal, non-exclusive
12 | license, under Apple's copyrights in this original Apple software (the
13 | "Apple Software"), to use, reproduce, modify and redistribute the Apple
14 | Software, with or without modifications, in source and/or binary forms;
15 | provided that if you redistribute the Apple Software in its entirety and
16 | without modifications, you must retain this notice and the following
17 | text and disclaimers in all such redistributions of the Apple Software.
18 | Neither the name, trademarks, service marks or logos of Apple Inc. may
19 | be used to endorse or promote products derived from the Apple Software
20 | without specific prior written permission from Apple.  Except as
21 | expressly stated in this notice, no other rights or licenses, express or
22 | implied, are granted by Apple herein, including but not limited to any
23 | patent rights that may be infringed by your derivative works or by other
24 | works in which the Apple Software may be incorporated.
25 | 
26 | The Apple Software is provided by Apple on an "AS IS" basis.  APPLE
27 | MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION
28 | THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS
29 | FOR A PARTICULAR PURPOSE, REGARDING THE APPLE SOFTWARE OR ITS USE AND
30 | OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS.
31 | 
32 | IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL
33 | OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
34 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
35 | INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION,
36 | MODIFICATION AND/OR DISTRIBUTION OF THE APPLE SOFTWARE, HOWEVER CAUSED
37 | AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE),
38 | STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE
39 | POSSIBILITY OF SUCH DAMAGE.
40 | 
41 | -------------------------------------------------------------------------------
42 | SOFTWARE DISTRIBUTED WITH ml-cubifyanything
43 | 
44 | The ml-cubifyanything software includes a number of subcomponents with separate 
45 | copyright notices and license terms - please see the file ACKNOWLEDGEMENTS.
46 | -------------------------------------------------------------------------------
47 | 


--------------------------------------------------------------------------------
/LICENSE_MODEL:
--------------------------------------------------------------------------------
 1 | Disclaimer: IMPORTANT: This Apple Machine Learning Research Model is specifically developed and released by Apple Inc. ("Apple") for the sole purpose of scientific research of artificial intelligence and machine-learning technology. “Apple Machine Learning Research Model” means the model, including but not limited to algorithms, formulas, trained model weights, parameters, configurations, checkpoints, and any related materials (including documentation).
 2 | This Apple Machine Learning Research Model is provided to You by Apple in consideration of your agreement to the following terms, and your use, modification, creation of Model Derivatives, and or redistribution of the Apple Machine Learning Research Model constitutes acceptance of this Agreement. If You do not agree with these terms, please do not use, modify, create Model Derivatives of, or distribute this Apple Machine Learning Research Model or Model Derivatives.
 3 | 1. License Scope: In consideration of your agreement to abide by the following terms, and subject to these terms, Apple hereby grants you a personal, non- exclusive, worldwide, non-transferable, royalty-free, revocable, and limited license, to use, copy, modify, distribute, and create Model Derivatives (defined below) of the Apple Machine Learning Research Model exclusively for Research Purposes. You agree that any Model Derivatives You may create or that may be created for You will be limited to Research Purposes as well. “Research Purposes” means non-commercial scientific research and academic development activities, such as experimentation, analysis, testing conducted by You with the sole intent to advance scientific knowledge and research. “Research Purposes” does not include any commercial exploitation, product development or use in any commercial product or service.
 4 | 2. Distribution of Apple Machine Learning Research Model and Model Derivatives: If you choose to redistribute Apple Machine Learning Research Model or its Model Derivatives, you must provide a copy of this Agreement to such third party, and ensure that the following attribution notice be provided: “Apple Machine Learning Research Model is licensed under the Apple Machine Learning Research Model License Agreement.” Additionally, all Model Derivatives must clearly be identified as such, including disclosure of modifications and changes made to the Apple Machine Learning Research Model. The name, trademarks, service marks or logos of Apple may not be used to endorse or promote Model Derivatives or the relationship between You and Apple. “Model Derivatives” means any models or any other artifacts created by modifications, improvements, adaptations, alterations to the architecture,
 5 |   
 6 | algorithm or training processes of the Apple Machine Learning Research Model, or by any retraining, fine-tuning of the Apple Machine Learning Research Model.
 7 | 3. No Other License: Except as expressly stated in this notice, no other rights or licenses, express or implied, are granted by Apple herein, including but not limited to any patent, trademark, and similar intellectual property rights worldwide that may be infringed by the Apple Machine Learning Research Model, the Model Derivatives or by other works in which the Apple Machine Learning Research Model may be incorporated.
 8 | 4. Compliance with Laws: Your use of Apple Machine Learning Research Model must be in compliance with all applicable laws and regulations.
 9 | 5. Term and Termination: The term of this Agreement will begin upon your acceptance of this Agreement or use of the Apple Machine Learning Research Model and will continue until terminated in accordance with the following terms. Apple may terminate this Agreement at any time if You are in breach of any term or condition of this Agreement. Upon termination of this Agreement, You must cease to use all Apple Machine Learning Research Models and Model Derivatives and permanently delete any copy thereof. Sections 3, 6 and 7 will survive termination.
10 | 6. Disclaimer and Limitation of Liability: This Apple Machine Learning Research Model and any outputs generated by the Apple Machine Learning Research Model are provided on an “AS IS” basis. APPLE MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, REGARDING THE APPLE MACHINE LEARNING RESEARCH MODEL OR OUTPUTS GENERATED BY THE APPLE MACHINE LEARNING RESEARCH MODEL. You are solely responsible for determining the appropriateness of using or redistributing the Apple Machine Learning Research Model and any outputs of the Apple Machine Learning Research Model and assume any risks associated with Your use of the Apple Machine Learning Research Model and any output and results. IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION, MODIFICATION AND/OR DISTRIBUTION OF THE APPLE MACHINE LEARNING RESEARCH MODEL AND ANY OUTPUTS OF THE APPLE MACHINE LEARNING RESEARCH MODEL, HOWEVER CAUSED AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
11 | 7. Governing Law: This Agreement will be governed by and construed under the laws of the State of California without regard to its choice of law principles. The Convention on Contracts for the International Sale of Goods shall not apply to the Agreement except that the arbitration clause and any arbitration hereunder shall be governed by the Federal Arbitration Act, Chapters 1 and 2.
12 | Copyright (c) 2025 Apple Inc. All Rights Reserved.
13 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # CA-1M and Cubify Anything
  2 | 
  3 | This repository includes the public implementation of Cubify Transformer and the
  4 | associated CA-1M dataset.
  5 | 
  6 | ## Paper
  7 | 
  8 | **Apple**
  9 | 
 10 | [Cubify Anything: Scaling Indoor 3D Object Detection](https://arxiv.org/abs/2412.04458)
 11 | 
 12 | Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, Afshin Dehghan
 13 | 
 14 | **CVPR 2025**
 15 | 
 16 | ![Teaser](teaser.jpg?raw=true "Teaser")
 17 | 
 18 | ## Repository Overview
 19 | 
 20 | This repository includes:
 21 | 
 22 | 1. Links to the underlying data and annotations of the CA-1M dataset.
 23 | 2. Links to released models of the Cubify Transformer (CuTR) model from the Cubify Anything paper.
 24 | 3. Basic readers and inference code to run CuTR on the provided data.
 25 | 4. Basic support for using images captured from own device using the NeRF Capture app.
 26 | 
 27 | ## Installation
 28 | 
 29 | We recommend Python 3.10 and a recent 2.x build of PyTorch. We include a `requirements.txt` which should encapsulate
 30 | all necessary dependencies. Please make sure you have `torch` installed first, e.g.,:
 31 | 
 32 | ```
 33 | pip install torch torchvision
 34 | ```
 35 | 
 36 | Then, within the root of the repository:
 37 | 
 38 | ```
 39 | pip install -r requirements.txt
 40 | pip install -e .
 41 | ```
 42 | 
 43 | ## CA-1M versus ARKitScenes?
 44 | 
 45 | This work is related to [ARKitScenes](https://machinelearning.apple.com/research/arkitscenes). We generally share
 46 | the same underlying captures. Some notable differences in CA-1M:
 47 | 
 48 | 1. Each scene has been exhaustively annotated with class-agnostic 3D boxes. We release these in the laser scanner's coordinate frame.
 49 | 2. For each frame in each capture, we include "per-frame" 3D box ground-truth which was produced using the rendering
 50 |    process outlined in the Cubify Anything paper. These annotations are, therefore, *independent* of any pose.
 51 | 
 52 | Some other nice things:
 53 | 
 54 | 1. We release the GT poses (registered to laser scanner) for every frame in each capture.
 55 | 2. We release the GT depth (rendered from laser scanner) at 512 x 384 for every frame in each capture.
 56 | 3. Each frame has been already oriented into an upright position.
 57 | 
 58 | **NOTE:** CA-1M will only include captures which were successfully registered to the laser scanner. Therefore
 59 | not every capture including in ARKitScenes will be present in CA-1M.
 60 | 
 61 | ## Downloading and using the CA-1M data
 62 | 
 63 | ### Data License
 64 | 
 65 | **All data is released under the [CC-by-NC-ND](data/LICENSE_DATA).**
 66 | 
 67 | All links to the data are contained in `data/train.txt` and `data/val.txt`. You can use `curl` to download all files
 68 | listed. If you don't need the whole dataset in advance, you can either explicitly pass these
 69 | links explicitly or pass the split's `txt` file itself and use the `--video-ids` argument to filter the desired videos.
 70 | 
 71 | If you pass the `txt` file, please note that file will be cached under `data/[split]`.
 72 | 
 73 | ## Understanding the CA-1M data
 74 | 
 75 | CA-1M is released in WebDataset format. Therefore, it is essentially a fancy tar archive
 76 | *per* capture (i.e., a video). Therefore, a single archive `ca1m-[split]-XXXXXXX.tar` corresponds to all data
 77 | of capture XXXXXXXX.
 78 | 
 79 | Both splits are released at full frame rate.
 80 | 
 81 | All data should be neatly loaded by `CubifyAnythingDataset`. Please refer to `dataset.py` for more
 82 | specifics on how to read/parse data on disk. Some general pointers:
 83 | 
 84 | ```python
 85 | [video_id]/[integer_timestamp].wide/image.png               # A 1024x768 RGB image corresponding to the main camera.
 86 | [video_id]/[integer_timestamp].wide/depth.png               # A 256x192 depth image stored as a UInt16 (as millimeters) derived from the capture device's onboard LiDAR (ARKit depth).
 87 | [video_id]/[integer_timestamp].wide/depth/confidence.tiff   # A 256x192 confidence image storing the [0, 1] confidence value of each depth measurement (currently unused).
 88 | [video_id]/[integer_timestamp].wide/instances.json          # A list of GT instances alongside their 3D boxes (i.e., the resulting of the GT rendering process).
 89 | [video_id]/[integer_timestamp].wide/T_gravity.json          # A rotation matrix which encodes the pitch/roll of the camera, which we assume is known (e.g., IMU).
 90 | 
 91 | [video_id]/[integer_timestamp].gt/RT.json                   # A 4x4 (row major) JSON-encoded matrix corresponding to the registered pose in the laser-scanner space.
 92 | [video_id]/[integer_timestamp].gt/depth.png                 # A 512x384 depth image stored as a UInt16 (as millimeters) derived from the FARO laser scanner registration.
 93 | 
 94 | ```
 95 | 
 96 | Note that since we have already oriented the images, these dimensions may be transposed. GT depth may have 0 values which corresponding to unregistered points.
 97 | 
 98 | An additional file is included as `[video_id]/world.gt/instances.json` which corresponds to the full world set of 3D annotations from which
 99 | the per-frame labels are generated from. These instances include some structural labels: `wall`, `floor`, `ceiling`, `door_frame` which
100 | might aid in rendering.
101 | 
102 | ## Visualization
103 | 
104 | We include visualization support using [rerun](https://rerun.io). Visualization should happen
105 | automatically. If you wish to not run any models, but only visualize the data, use `--viz-only`.
106 | 
107 | During inference, you may wish to inspect the 3D accuracy of the predictions. We support
108 | visualizing the predictions on the GT point cloud (derived from Faro depth) when using
109 | the `--viz-on-gt-points` flag.
110 | 
111 | ### Sample command
112 | 
113 | ``` bash
114 | python tools/demo.py [path_to_downloaded_data]/ca1m-val-42898570.tar --viz-only
115 | ```
116 | 
117 | ``` bash
118 | python tools/demo.py data/train.txt --viz-only --video-ids 45261548
119 | ```
120 | 
121 | ## Skipping Frames
122 | 
123 | The data is provided at a high frame rate, so using `--every-nth-frame N` will only
124 | process every N frames.
125 | 
126 | ## Running the CuTR models
127 | 
128 | **All models are released under the Apple ML Research Model Terms of Use in [LICENSE_MODEL](LICENSE_MODEL).**
129 | 
130 | 1. [RGB-D](https://ml-site.cdn-apple.com/models/cutr/cutr_rgbd.pth)
131 | 2. [RGB](https://ml-site.cdn-apple.com/models/cutr/cutr_rgb.pth)
132 | 
133 | Models can be provided to `demo.py` using the `--model-path` argument. We detect whether this is an RGB
134 | or RGB-D model and disable depth accordingly.
135 | 
136 | ### RGB-D
137 | 
138 | The first variant of CuTR expects an RGB image and a metric depth map. We train on ARKit depth,
139 | although you may find it works with other metric depth estimators as well.
140 | 
141 | #### Sample Command
142 | 
143 | If your computer is MPS enabled:
144 | 
145 | ``` bash
146 | python tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgbd.pth --viz-on-gt-points --device mps
147 | ```
148 | 
149 | If your computer is CUDA enabled:
150 | 
151 | ``` bash
152 | python tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgbd.pth --viz-on-gt-points --device cuda
153 | ```
154 | 
155 | Otherwise:
156 | 
157 | ``` bash
158 | python tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgbd.pth --viz-on-gt-points --device cpu
159 | ```
160 | 
161 | ### RGB Only
162 | 
163 | The second variant of CuTR expects an RGB image alone and attempts to derive the metric scale of
164 | the scene from the image itself.
165 | 
166 | #### Sample Command
167 | 
168 | If your device is MPS enabled:
169 | 
170 | ``` bash
171 | python tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgb.pth --viz-on-gt-points --device mps
172 | ```
173 | 
174 | ## Run on captures from your own device
175 | 
176 | We also have basic support for running on RGB/Depth captured from your own device.
177 | 
178 | 1. Make sure you have [NeRF Capture](https://apps.apple.com/au/app/nerfcapture/id6446518379) installed on your device
179 | 2. Start the NeRF Capture app *before* running `demo.py` (force quit and reopen if for some reason things stop working or a connection is not made).
180 | 3. Run the normal commands but pass "stream" instead of the usual tar/folder path.
181 | 4. Hit "Send" in the app to send a frame for inference. This will be visualized in the rerun window.
182 | 
183 | We will continue to print "Still waiting" to show liveliness.
184 | 
185 | If you have a device equipped with LiDAR, you can use this combined with the RGB-D models, otherwise, you can
186 | only use the RGB only model.
187 | 
188 | #### RGB-D (on MPS)
189 | 
190 | ``` bash
191 | python tools/demo.py stream --model-path [path_to_models]/cutr_rgbd.pth --device mps
192 | ```
193 | 
194 | #### RGB (on MPS)
195 | 
196 | ``` bash
197 | python tools/demo.py stream --model-path [path_to_models]/cutr_rgb.pth --device mps
198 | ```
199 | 
200 | ## Citation
201 | 
202 | If you use CA-1M or CuTR in your research, please use the following entry:
203 | 
204 | ```
205 | @article{lazarow2024cubify,
206 |   title={Cubify Anything: Scaling Indoor 3D Object Detection},
207 |   author={Lazarow, Justin and Griffiths, David and Kohavi, Gefen and Crespo, Francisco and Dehghan, Afshin},
208 |   journal={arXiv preprint arXiv:2412.04458},
209 |   year={2024}
210 | }
211 | ```
212 | 
213 | ## Licenses
214 | 
215 | The sample code is released under [Apple Sample Code License](LICENSE).
216 | 
217 | The data is released under [CC-by-NC-ND](data/LICENSE_DATA).
218 | 
219 | The models are released under [Apple ML Research Model Terms of Use](LICENSE_MODEL).
220 | 
221 | ## Acknowledgements
222 | 
223 | We use and acknowledge contributions from multiple open-source projects in [ACKNOWLEDGEMENTS](ACKNOWLEDGEMENTS).
224 | 


--------------------------------------------------------------------------------
/cubifyanything/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/apple/ml-cubifyanything/7419eb0cb9b19cb5257b4a1dc905476c155cd343/cubifyanything/__init__.py


--------------------------------------------------------------------------------
/cubifyanything/batching.py:
--------------------------------------------------------------------------------
 1 | # For licensing see accompanying LICENSE file.
 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved.
 3 | 
 4 | import copy
 5 | import torch
 6 | 
 7 | from typing import Any, Dict, Generic, List, Optional, Tuple, TypeVar, Union
 8 | from typing_extensions import TypeAlias
 9 | 
10 | from cubifyanything.measurement import (
11 |     MeasurementInfo,
12 |     DepthMeasurementInfo,
13 |     ImageMeasurementInfo)
14 | 
15 | from cubifyanything.sensor import (
16 |     SensorInfo,
17 |     PosedSensorInfo)
18 | 
19 | from cubifyanything.imagelist import ImageList
20 | from cubifyanything.instances import Instances3D
21 | 
22 | T = TypeVar("T")
23 | I = TypeVar("I", bound=MeasurementInfo)
24 | S = TypeVar("S", bound=SensorInfo)
25 | 
26 | class Measurement(Generic[T, I, S]):
27 |     def __init__(self, data: T, info: I, sensor: S):
28 |         self.data = data
29 |         self.info = info
30 |         self.sensor = sensor
31 | 
32 |     # This is painful, but stems from lack of multiple dispatch.
33 |     @classmethod
34 |     def batch(cls, args: List["Measurement"], **kwargs) -> "BatchedMeasurement":
35 |         if isinstance(args[0].info, (DepthMeasurementInfo,)):
36 |             return BatchedPosedDepth(
37 |                 ImageList.from_tensors(
38 |                     [a.data for a in args],
39 |                     **kwargs),
40 |                 [a.info for a in args],
41 |                 [a.sensor for a in args])
42 |         elif isinstance(args[0].info, (ImageMeasurementInfo,)):
43 |             return BatchedPosedImage(
44 |                 ImageList.from_tensors(
45 |                     [a.data for a in args],
46 |                     **kwargs),
47 |                 [a.info for a in args],
48 |                 [a.sensor for a in args])
49 |         else:
50 |             raise NotImplementedError
51 | 
52 |     def to(self, *args: Any, **kwargs: Any) -> "Measurement":
53 |         return self.__orig_class__(
54 |             self.data.to(*args, **kwargs),
55 |             self.info.to(*args, **kwargs),
56 |             self.sensor.to(*args, **kwargs))
57 | 
58 | class BatchedMeasurement(Generic[T, I, S]):
59 |     def __init__(self, data: T, info: List[I], sensor: List[S]):
60 |         self.data = data
61 |         self.info = info
62 |         self.sensor = sensor
63 | 
64 |     @property
65 |     def padding(self) -> int:
66 |         raise NotImplementedError
67 | 
68 |     def __getitem__(self, index):
69 |         # TODO: Also give data back (sliced).
70 |         return self.__orig_class__(
71 |             data=self.data if isinstance(self.data, ImageList) else self.data[index],
72 |             info=self.info[index],
73 |             sensor=self.sensor[index])
74 | 
75 |     # For now, only shallow copy sensor itself (since has recursive references).
76 |     def clone(self):
77 |         return self.__orig_class__(
78 |             [data_.clone() if hasattr(data_, "clone") else copy.copy(data_) for data_ in self.data],
79 |             [info_.clone() for info_ in self.info],
80 |             copy.copy(self.sensor))
81 | 
82 | PosedImage: TypeAlias = Measurement[torch.Tensor, ImageMeasurementInfo, PosedSensorInfo]
83 | PosedDepth: TypeAlias = Measurement[torch.Tensor, DepthMeasurementInfo, PosedSensorInfo]
84 | 
85 | BatchedPosedImage: TypeAlias = BatchedMeasurement[ImageList, ImageMeasurementInfo, PosedSensorInfo]
86 | BatchedPosedDepth: TypeAlias = BatchedMeasurement[ImageList, DepthMeasurementInfo, PosedSensorInfo]
87 | 
88 | Sensors: TypeAlias = Dict[str, Dict[str, Measurement]]
89 | BatchedSensors: TypeAlias = Dict[str, Dict[str, BatchedMeasurement]]
90 | BatchedPosedSensor: TypeAlias = Dict[str, Union[BatchedPosedImage, BatchedPosedDepth]]
91 | 


--------------------------------------------------------------------------------
/cubifyanything/boxes.py:
--------------------------------------------------------------------------------
  1 | # For licensing see accompanying LICENSE file.
  2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved.
  3 | 
  4 | import numpy as np
  5 | import torch
  6 | import warnings
  7 | 
  8 | from abc import abstractmethod
  9 | from scipy.spatial.transform import Rotation
 10 | from torch import Tensor
 11 | from typing import Iterator, Optional, Sequence, Tuple, Union
 12 | 
 13 | 
 14 | from enum import Enum
 15 | class BoxDOF(Enum):
 16 |     All = 1
 17 |     GravityAligned = 2
 18 | 
 19 | # Based on MMDet3D.
 20 | def rotation_3d_in_axis(
 21 |     points: Union[np.ndarray, Tensor],
 22 |     angles: Union[np.ndarray, Tensor, float],
 23 |     axis: int = 0,
 24 |     return_mat: bool = False,
 25 |     clockwise: bool = False
 26 | ) -> Union[Tuple[np.ndarray, np.ndarray], Tuple[Tensor, Tensor], np.ndarray,
 27 |            Tensor]:
 28 |     """Rotate points by angles according to axis.
 29 | 
 30 |     Args:
 31 |         points (np.ndarray or Tensor): Points with shape (N, M, 3).
 32 |         angles (np.ndarray or Tensor or float): Vector of angles with shape
 33 |             (N, ).
 34 |         axis (int): The axis to be rotated. Defaults to 0.
 35 |         return_mat (bool): Whether or not to return the rotation matrix
 36 |             (transposed). Defaults to False.
 37 |         clockwise (bool): Whether the rotation is clockwise. Defaults to False.
 38 | 
 39 |     Raises:
 40 |         ValueError: When the axis is not in range [-3, -2, -1, 0, 1, 2], it
 41 |             will raise ValueError.
 42 | 
 43 |     Returns:
 44 |         Tuple[np.ndarray, np.ndarray] or Tuple[Tensor, Tensor] or np.ndarray or
 45 |         Tensor: Rotated points with shape (N, M, 3) and rotation matrix with
 46 |         shape (N, 3, 3).
 47 |     """
 48 |     batch_free = len(points.shape) == 2
 49 |     if batch_free:
 50 |         points = points[None]
 51 | 
 52 |     if isinstance(angles, float) or len(angles.shape) == 0:
 53 |         angles = torch.full(points.shape[:1], angles)
 54 | 
 55 |     assert len(points.shape) == 3 and len(angles.shape) == 1 and \
 56 |         points.shape[0] == angles.shape[0], 'Incorrect shape of points ' \
 57 |         f'angles: {points.shape}, {angles.shape}'
 58 | 
 59 |     assert points.shape[-1] in [2, 3], \
 60 |         f'Points size should be 2 or 3 instead of {points.shape[-1]}'
 61 | 
 62 |     rot_sin = torch.sin(angles)
 63 |     rot_cos = torch.cos(angles)
 64 |     ones = torch.ones_like(rot_cos)
 65 |     zeros = torch.zeros_like(rot_cos)
 66 | 
 67 |     if points.shape[-1] == 3:
 68 |         if axis == 1 or axis == -2:
 69 |             rot_mat_T = torch.stack([
 70 |                 torch.stack([rot_cos, zeros, -rot_sin]),
 71 |                 torch.stack([zeros, ones, zeros]),
 72 |                 torch.stack([rot_sin, zeros, rot_cos])
 73 |             ])
 74 |         elif axis == 2 or axis == -1:
 75 |             rot_mat_T = torch.stack([
 76 |                 torch.stack([rot_cos, rot_sin, zeros]),
 77 |                 torch.stack([-rot_sin, rot_cos, zeros]),
 78 |                 torch.stack([zeros, zeros, ones])
 79 |             ])
 80 |         elif axis == 0 or axis == -3:
 81 |             rot_mat_T = torch.stack([
 82 |                 torch.stack([ones, zeros, zeros]),
 83 |                 torch.stack([zeros, rot_cos, rot_sin]),
 84 |                 torch.stack([zeros, -rot_sin, rot_cos])
 85 |             ])
 86 |         else:
 87 |             raise ValueError(
 88 |                 f'axis should in range [-3, -2, -1, 0, 1, 2], got {axis}')
 89 |     else:
 90 |         rot_mat_T = torch.stack([
 91 |             torch.stack([rot_cos, rot_sin]),
 92 |             torch.stack([-rot_sin, rot_cos])
 93 |         ])
 94 | 
 95 |     if clockwise:
 96 |         rot_mat_T = rot_mat_T.transpose(0, 1)
 97 | 
 98 |     if points.shape[0] == 0:
 99 |         points_new = points
100 |     else:
101 |         points_new = torch.einsum('aij,jka->aik', points, rot_mat_T)
102 | 
103 |     if batch_free:
104 |         points_new = points_new.squeeze(0)
105 | 
106 |     if return_mat:
107 |         rot_mat_T = torch.einsum('jka->ajk', rot_mat_T)
108 |         if batch_free:
109 |             rot_mat_T = rot_mat_T.squeeze(0)
110 |         return points_new, rot_mat_T
111 |     else:
112 |         return points_new
113 | 
114 | # from MMDet3D.
115 | class BaseInstance3DBoxes:
116 |     """Base class for 3D Boxes.
117 | 
118 |     Note:
119 |         The box is bottom centered, i.e. the relative position of origin in the
120 |         box is (0.5, 0.5, 0).
121 | 
122 |     Args:
123 |         tensor (Tensor or np.ndarray or Sequence[Sequence[float]]): The boxes
124 |             data with shape (N, box_dim).
125 |         box_dim (int): Number of the dimension of a box. Each row is
126 |             (x, y, z, x_size, y_size, z_size, yaw). Defaults to 7.
127 |         with_yaw (bool): Whether the box is with yaw rotation. If False, the
128 |             value of yaw will be set to 0 as minmax boxes. Defaults to True.
129 |         origin (Tuple[float]): Relative position of the box origin.
130 |             Defaults to (0.5, 0.5, 0). This will guide the box be converted to
131 |             (0.5, 0.5, 0) mode.
132 | 
133 |     Attributes:
134 |         tensor (Tensor): Float matrix with shape (N, box_dim).
135 |         box_dim (int): Integer indicating the dimension of a box. Each row is
136 |             (x, y, z, x_size, y_size, z_size, yaw, ...).
137 |         with_yaw (bool): If True, the value of yaw will be set to 0 as minmax
138 |             boxes.
139 |     """
140 | 
141 |     YAW_AXIS: int = 0
142 | 
143 |     def __init__(
144 |         self,
145 |         tensor: Union[Tensor, np.ndarray, Sequence[Sequence[float]]],
146 |         box_dim: int = 7,
147 |         with_yaw: bool = True,
148 |         origin: Tuple[float, float, float] = (0.5, 0.5, 0)
149 |     ) -> None:
150 |         if isinstance(tensor, Tensor):
151 |             device = tensor.device
152 |         else:
153 |             device = torch.device('cpu')
154 |         tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device)
155 |         if tensor.numel() == 0:
156 |             # Use reshape, so we don't end up creating a new tensor that does
157 |             # not depend on the inputs (and consequently confuses jit)
158 |             tensor = tensor.reshape((-1, box_dim))
159 |         assert tensor.dim() == 2 and tensor.size(-1) == box_dim, \
160 |             ('The box dimension must be 2 and the length of the last '
161 |              f'dimension must be {box_dim}, but got boxes with shape '
162 |              f'{tensor.shape}.')
163 | 
164 |         if tensor.shape[-1] == 6:
165 |             # If the dimension of boxes is 6, we expand box_dim by padding 0 as
166 |             # a fake yaw and set with_yaw to False
167 |             assert box_dim == 6
168 |             fake_rot = tensor.new_zeros(tensor.shape[0], 1)
169 |             tensor = torch.cat((tensor, fake_rot), dim=-1)
170 |             self.box_dim = box_dim + 1
171 |             self.with_yaw = False
172 |         else:
173 |             self.box_dim = box_dim
174 |             self.with_yaw = with_yaw
175 |         self.tensor = tensor.clone()
176 | 
177 |         if origin != (0.5, 0.5, 0):
178 |             dst = self.tensor.new_tensor((0.5, 0.5, 0))
179 |             src = self.tensor.new_tensor(origin)
180 |             self.tensor[:, :3] += self.tensor[:, 3:6] * (dst - src)
181 | 
182 |     @property
183 |     def shape(self) -> torch.Size:
184 |         """torch.Size: Shape of boxes."""
185 |         return self.tensor.shape
186 | 
187 |     @property
188 |     def volume(self) -> Tensor:
189 |         """Tensor: A vector with volume of each box in shape (N, )."""
190 |         return self.tensor[:, 3] * self.tensor[:, 4] * self.tensor[:, 5]
191 | 
192 |     @property
193 |     def dims(self) -> Tensor:
194 |         """Tensor: Size dimensions of each box in shape (N, 3)."""
195 |         return self.tensor[:, 3:6]
196 | 
197 |     @property
198 |     def yaw(self) -> Tensor:
199 |         """Tensor: A vector with yaw of each box in shape (N, )."""
200 |         return self.tensor[:, 6]
201 | 
202 |     @property
203 |     def height(self) -> Tensor:
204 |         """Tensor: A vector with height of each box in shape (N, )."""
205 |         return self.tensor[:, 5]
206 | 
207 |     @property
208 |     def top_height(self) -> Tensor:
209 |         """Tensor: A vector with top height of each box in shape (N, )."""
210 |         return self.bottom_height + self.height
211 | 
212 |     @property
213 |     def bottom_height(self) -> Tensor:
214 |         """Tensor: A vector with bottom height of each box in shape (N, )."""
215 |         return self.tensor[:, 2]
216 | 
217 |     @property
218 |     def center(self) -> Tensor:
219 |         """Calculate the center of all the boxes.
220 | 
221 |         Note:
222 |             In MMDetection3D's convention, the bottom center is usually taken
223 |             as the default center.
224 | 
225 |             The relative position of the centers in different kinds of boxes
226 |             are different, e.g., the relative center of a boxes is
227 |             (0.5, 1.0, 0.5) in camera and (0.5, 0.5, 0) in lidar. It is
228 |             recommended to use ``bottom_center`` or ``gravity_center`` for
229 |             clearer usage.
230 | 
231 |         Returns:
232 |             Tensor: A tensor with center of each box in shape (N, 3).
233 |         """
234 |         return self.bottom_center
235 | 
236 |     @property
237 |     def bottom_center(self) -> Tensor:
238 |         """Tensor: A tensor with center of each box in shape (N, 3)."""
239 |         return self.tensor[:, :3]
240 | 
241 |     @property
242 |     def gravity_center(self) -> Tensor:
243 |         """Tensor: A tensor with center of each box in shape (N, 3)."""
244 |         bottom_center = self.bottom_center
245 |         gravity_center = torch.zeros_like(bottom_center)
246 |         gravity_center[:, :2] = bottom_center[:, :2]
247 |         gravity_center[:, 2] = bottom_center[:, 2] + self.tensor[:, 5] * 0.5
248 |         return gravity_center
249 | 
250 |     @property
251 |     def corners(self) -> Tensor:
252 |         """Tensor: A tensor with 8 corners of each box in shape (N, 8, 3)."""
253 |         pass
254 | 
255 |     @abstractmethod
256 |     def rotate(
257 |         self,
258 |         angle: Union[Tensor, np.ndarray, float],
259 |         points: Optional[Union[Tensor, np.ndarray]] = None
260 |     ) -> Union[Tuple[Tensor, Tensor], Tuple[np.ndarray, np.ndarray], Tuple[
261 |             Tensor], None]:
262 |         """Rotate boxes with points (optional) with the given angle or rotation
263 |         matrix.
264 | 
265 |         Args:
266 |             angle (Tensor or np.ndarray or float): Rotation angle or rotation
267 |                 matrix.
268 |             points (Tensor or np.ndarray or :obj:`BasePoints`, optional):
269 |                 Points to rotate. Defaults to None.
270 | 
271 |         Returns:
272 |             tuple or None: When ``points`` is None, the function returns None,
273 |             otherwise it returns the rotated points and the rotation matrix
274 |             ``rot_mat_T``.
275 |         """
276 |         pass
277 | 
278 |     def translate(self, trans_vector: Union[Tensor, np.ndarray]) -> None:
279 |         """Translate boxes with the given translation vector.
280 | 
281 |         Args:
282 |             trans_vector (Tensor or np.ndarray): Translation vector of size
283 |                 1x3.
284 |         """
285 |         if not isinstance(trans_vector, Tensor):
286 |             trans_vector = self.tensor.new_tensor(trans_vector)
287 |             
288 |         self.tensor[:, :3] += trans_vector
289 | 
290 |         return self
291 | 
292 |     def in_range_3d(
293 |             self, box_range: Union[Tensor, np.ndarray,
294 |                                    Sequence[float]]) -> Tensor:
295 |         """Check whether the boxes are in the given range.
296 | 
297 |         Args:
298 |             box_range (Tensor or np.ndarray or Sequence[float]): The range of
299 |                 box (x_min, y_min, z_min, x_max, y_max, z_max).
300 | 
301 |         Note:
302 |             In the original implementation of SECOND, checking whether a box in
303 |             the range checks whether the points are in a convex polygon, we try
304 |             to reduce the burden for simpler cases.
305 | 
306 |         Returns:
307 |             Tensor: A binary vector indicating whether each point is inside the
308 |             reference range.
309 |         """
310 |         in_range_flags = ((self.tensor[:, 0] > box_range[0])
311 |                           & (self.tensor[:, 1] > box_range[1])
312 |                           & (self.tensor[:, 2] > box_range[2])
313 |                           & (self.tensor[:, 0] < box_range[3])
314 |                           & (self.tensor[:, 1] < box_range[4])
315 |                           & (self.tensor[:, 2] < box_range[5]))
316 |         return in_range_flags
317 | 
318 |     @abstractmethod
319 |     def convert_to(self,
320 |                    dst: int,
321 |                    rt_mat: Optional[Union[Tensor, np.ndarray]] = None,
322 |                    correct_yaw: bool = False) -> 'BaseInstance3DBoxes':
323 |         """Convert self to ``dst`` mode.
324 | 
325 |         Args:
326 |             dst (int): The target Box mode.
327 |             rt_mat (Tensor or np.ndarray, optional): The rotation and
328 |                 translation matrix between different coordinates.
329 |                 Defaults to None. The conversion from ``src`` coordinates to
330 |                 ``dst`` coordinates usually comes along the change of sensors,
331 |                 e.g., from camera to LiDAR. This requires a transformation
332 |                 matrix.
333 |             correct_yaw (bool): Whether to convert the yaw angle to the target
334 |                 coordinate. Defaults to False.
335 | 
336 |         Returns:
337 |             :obj:`BaseInstance3DBoxes`: The converted box of the same type in
338 |             the ``dst`` mode.
339 |         """
340 |         pass
341 | 
342 |     def scale(self, scale_factor: float) -> None:
343 |         """Scale the box with horizontal and vertical scaling factors.
344 | 
345 |         Args:
346 |             scale_factors (float): Scale factors to scale the boxes.
347 |         """
348 |         self.tensor[:, :6] *= scale_factor
349 |         self.tensor[:, 7:] *= scale_factor  # velocity
350 | 
351 |     def nonempty(self, threshold: float = 0.0) -> Tensor:
352 |         """Find boxes that are non-empty.
353 | 
354 |         A box is considered empty if either of its side is no larger than
355 |         threshold.
356 | 
357 |         Args:
358 |             threshold (float): The threshold of minimal sizes. Defaults to 0.0.
359 | 
360 |         Returns:
361 |             Tensor: A binary vector which represents whether each box is empty
362 |             (False) or non-empty (True).
363 |         """
364 |         box = self.tensor
365 |         size_x = box[..., 3]
366 |         size_y = box[..., 4]
367 |         size_z = box[..., 5]
368 |         keep = ((size_x > threshold)
369 |                 & (size_y > threshold) & (size_z > threshold))
370 |         return keep
371 | 
372 |     def __getitem__(
373 |             self, item: Union[int, slice, np.ndarray,
374 |                               Tensor]) -> 'BaseInstance3DBoxes':
375 |         """
376 |         Args:
377 |             item (int or slice or np.ndarray or Tensor): Index of boxes.
378 | 
379 |         Note:
380 |             The following usage are allowed:
381 | 
382 |             1. `new_boxes = boxes[3]`: Return a `Boxes` that contains only one
383 |                box.
384 |             2. `new_boxes = boxes[2:10]`: Return a slice of boxes.
385 |             3. `new_boxes = boxes[vector]`: Where vector is a
386 |                torch.BoolTensor with `length = len(boxes)`. Nonzero elements in
387 |                the vector will be selected.
388 | 
389 |             Note that the returned Boxes might share storage with this Boxes,
390 |             subject to PyTorch's indexing semantics.
391 | 
392 |         Returns:
393 |             :obj:`BaseInstance3DBoxes`: A new object of
394 |             :class:`BaseInstance3DBoxes` after indexing.
395 |         """
396 |         original_type = type(self)
397 |         if isinstance(item, int):
398 |             return original_type(
399 |                 self.tensor[item].view(1, -1),
400 |                 box_dim=self.box_dim,
401 |                 with_yaw=self.with_yaw)
402 |         b = self.tensor[item]
403 |         assert b.dim() == 2, \
404 |             f'Indexing on Boxes with {item} failed to return a matrix!'
405 |         return original_type(b, box_dim=self.box_dim, with_yaw=self.with_yaw)
406 | 
407 |     def __len__(self) -> int:
408 |         """int: Number of boxes in the current object."""
409 |         return self.tensor.shape[0]
410 | 
411 |     def __repr__(self) -> str:
412 |         """str: Return a string that describes the object."""
413 |         return self.__class__.__name__ + '(\n    ' + str(self.tensor) + ')'
414 | 
415 |     def clone(self) -> 'BaseInstance3DBoxes':
416 |         """Clone the boxes.
417 | 
418 |         Returns:
419 |             :obj:`BaseInstance3DBoxes`: Box object with the same properties as
420 |             self.
421 |         """
422 |         original_type = type(self)
423 |         return original_type(
424 |             self.tensor.clone(), box_dim=self.box_dim, with_yaw=self.with_yaw)    
425 | 
426 |     @classmethod
427 |     def cat(cls, boxes_list: Sequence['BaseInstance3DBoxes']
428 |             ) -> 'BaseInstance3DBoxes':
429 |         """Concatenate a list of Boxes into a single Boxes.
430 | 
431 |         Args:
432 |             boxes_list (Sequence[:obj:`BaseInstance3DBoxes`]): List of boxes.
433 | 
434 |         Returns:
435 |             :obj:`BaseInstance3DBoxes`: The concatenated boxes.
436 |         """
437 |         assert isinstance(boxes_list, (list, tuple))
438 |         if len(boxes_list) == 0:
439 |             return cls(torch.empty(0))
440 |         assert all(isinstance(box, cls) for box in boxes_list)
441 | 
442 |         # use torch.cat (v.s. layers.cat)
443 |         # so the returned boxes never share storage with input
444 |         cat_boxes = cls(
445 |             torch.cat([b.tensor for b in boxes_list], dim=0),
446 |             box_dim=boxes_list[0].box_dim,
447 |             with_yaw=boxes_list[0].with_yaw)
448 |         return cat_boxes
449 | 
450 |     @property
451 |     def bev(self) -> Tensor:
452 |         """Tensor: 2D BEV box of each box with rotation in XYWHR format, in
453 |         shape (N, 5)."""
454 |         return self.tensor[:, [0, 1, 3, 4, 6]]    
455 | 
456 |     def new_box(
457 |         self, data: Union[Tensor, np.ndarray, Sequence[Sequence[float]]]
458 |     ) -> 'BaseInstance3DBoxes':
459 |         """Create a new box object with data.
460 | 
461 |         The new box and its tensor has the similar properties as self and
462 |         self.tensor, respectively.
463 | 
464 |         Args:
465 |             data (Tensor or np.ndarray or Sequence[Sequence[float]]): Data to
466 |                 be copied.
467 | 
468 |         Returns:
469 |             :obj:`BaseInstance3DBoxes`: A new bbox object with ``data``, the
470 |             object's other properties are similar to ``self``.
471 |         """
472 |         new_tensor = self.tensor.new_tensor(data) \
473 |             if not isinstance(data, Tensor) else data.to(self.device)
474 |         original_type = type(self)
475 |         return original_type(
476 |             new_tensor, box_dim=self.box_dim, with_yaw=self.with_yaw)    
477 | 
478 |     def numpy(self) -> np.ndarray:
479 |         """Reload ``numpy`` from self.tensor."""
480 |         return self.tensor.numpy()
481 | 
482 |     def to(self, device: Union[str, torch.device], *args,
483 |            **kwargs) -> 'BaseInstance3DBoxes':
484 |         """Convert current boxes to a specific device.
485 | 
486 |         Args:
487 |             device (str or :obj:`torch.device`): The name of the device.
488 | 
489 |         Returns:
490 |             :obj:`BaseInstance3DBoxes`: A new boxes object on the specific
491 |             device.
492 |         """
493 |         original_type = type(self)
494 |         return original_type(
495 |             self.tensor.to(device, *args, **kwargs),
496 |             box_dim=self.box_dim,
497 |             with_yaw=self.with_yaw)
498 | 
499 |     @property
500 |     def device(self) -> torch.device:
501 |         """torch.device: The device of the boxes are on."""
502 |         return self.tensor.device
503 |     
504 |     def __iter__(self) -> Iterator[Tensor]:
505 |         """Yield a box as a Tensor at a time.
506 | 
507 |         Returns:
508 |             Iterator[Tensor]: A box of shape (box_dim, ).
509 |         """
510 |         yield from self.tensor
511 | 
512 | class DepthInstance3DBoxes(BaseInstance3DBoxes):
513 |     YAW_AXIS = 2
514 | 
515 |     @property
516 |     def gravity_center(self):
517 |         """torch.Tensor: A tensor with center of each box in shape (N, 3)."""
518 |         bottom_center = self.bottom_center
519 |         gravity_center = torch.zeros_like(bottom_center)
520 |         gravity_center[:, :2] = bottom_center[:, :2]
521 |         gravity_center[:, 2] = bottom_center[:, 2] + self.tensor[:, 5] * 0.5
522 |         return gravity_center
523 | 
524 |     @property
525 |     def corners(self):
526 |         if self.tensor.numel() == 0:
527 |             return torch.empty([0, 8, 3], device=self.tensor.device)
528 | 
529 |         dims = self.dims
530 |         corners_norm = torch.from_numpy(
531 |             np.stack(np.unravel_index(np.arange(8), [2] * 3), axis=1)).to(
532 |                 device=dims.device, dtype=dims.dtype)
533 | 
534 |         corners_norm = corners_norm[[0, 1, 3, 2, 4, 5, 7, 6]]
535 |         # use relative origin (0.5, 0.5, 0)
536 |         corners_norm = corners_norm - dims.new_tensor([0.5, 0.5, 0])
537 |         corners = dims.view([-1, 1, 3]) * corners_norm.reshape([1, 8, 3])
538 | 
539 |         # rotate around z axis
540 |         corners = rotation_3d_in_axis(
541 |             corners, self.tensor[:, 6], axis=self.YAW_AXIS)
542 |         corners += self.tensor[:, :3].view(-1, 1, 3)
543 |         return corners
544 |     
545 |     def rotate(self, angle):
546 |         """Rotate boxes 
547 |  
548 |         Args:
549 |             angle (float | torch.Tensor | np.ndarray):
550 |                 Rotation angle or rotation matrix.
551 |             points (torch.Tensor | np.ndarray | :obj:`BasePoints`, optional):
552 |                 Points to rotate. Defaults to None.
553 | 
554 |         Returns:
555 |             tuple or None: When ``points`` is None, the function returns
556 |                 None, otherwise it returns the rotated points and the
557 |                 rotation matrix ``rot_mat_T``.
558 |         """
559 |         if not isinstance(angle, torch.Tensor):
560 |             angle = self.tensor.new_tensor(angle)
561 | 
562 |         assert angle.shape == torch.Size([3, 3]) or angle.numel() == 1, \
563 |             f'invalid rotation angle shape {angle.shape}'
564 | 
565 |         if angle.numel() == 1:
566 |             self.tensor[:, 0:3], rot_mat_T = rotation_3d_in_axis(
567 |                 self.tensor[:, 0:3],
568 |                 angle,
569 |                 axis=self.YAW_AXIS,
570 |                 return_mat=True)
571 |         else:
572 |             rot_mat_T = angle
573 |             rot_sin = rot_mat_T[0, 1]
574 |             rot_cos = rot_mat_T[0, 0]
575 |             angle = torch.arctan2(rot_sin, rot_cos)
576 |             self.tensor[:, 0:3] = self.tensor[:, 0:3] @ rot_mat_T
577 | 
578 |         if self.with_yaw:
579 |             self.tensor[:, 6] += angle
580 |         else:
581 |             # for axis-aligned boxes, we take the new
582 |             # enclosing axis-aligned boxes after rotation
583 |             corners_rot = self.corners @ rot_mat_T
584 |             new_x_size = corners_rot[..., 0].max(
585 |                 dim=1, keepdim=True)[0] - corners_rot[..., 0].min(
586 |                     dim=1, keepdim=True)[0]
587 |             new_y_size = corners_rot[..., 1].max(
588 |                 dim=1, keepdim=True)[0] - corners_rot[..., 1].min(
589 |                     dim=1, keepdim=True)[0]
590 |             self.tensor[:, 3:5] = torch.cat((new_x_size, new_y_size), dim=-1)
591 | 
592 |         # I've modified this to remove point support and return self so this can be chained (usually you want a clone() first).
593 |         return self
594 | 
595 |     def flip(self, bev_direction='horizontal', points=None):
596 |         """Flip the boxes in BEV along given BEV direction.
597 | 
598 |         In Depth coordinates, it flips x (horizontal) or y (vertical) axis.
599 | 
600 |         Args:
601 |             bev_direction (str, optional): Flip direction
602 |                 (horizontal or vertical). Defaults to 'horizontal'.
603 |             points (torch.Tensor | np.ndarray | :obj:`BasePoints`, optional):
604 |                 Points to flip. Defaults to None.
605 | 
606 |         Returns:
607 |             torch.Tensor, numpy.ndarray or None: Flipped points.
608 |         """
609 |         assert bev_direction in ('horizontal', 'vertical')
610 |         if bev_direction == 'horizontal':
611 |             self.tensor[:, 0::7] = -self.tensor[:, 0::7]
612 |             if self.with_yaw:
613 |                 self.tensor[:, 6] = -self.tensor[:, 6] + np.pi
614 |         elif bev_direction == 'vertical':
615 |             self.tensor[:, 1::7] = -self.tensor[:, 1::7]
616 |             if self.with_yaw:
617 |                 self.tensor[:, 6] = -self.tensor[:, 6]
618 | 
619 |         if points is not None:
620 |             assert isinstance(points, (torch.Tensor, np.ndarray, BasePoints))
621 |             if isinstance(points, (torch.Tensor, np.ndarray)):
622 |                 if bev_direction == 'horizontal':
623 |                     points[:, 0] = -points[:, 0]
624 |                 elif bev_direction == 'vertical':
625 |                     points[:, 1] = -points[:, 1]
626 |             elif isinstance(points, BasePoints):
627 |                 points.flip(bev_direction)
628 |             return points
629 | 
630 |     def enlarged_box(self, extra_width):
631 |         """Enlarge the length, width and height boxes.
632 |         Args:
633 |             extra_width (float | torch.Tensor): Extra width to enlarge the box.
634 |         Returns:
635 |             :obj:`DepthInstance3DBoxes`: Enlarged boxes.
636 |         """
637 |         enlarged_boxes = self.tensor.clone()
638 |         enlarged_boxes[:, 3:6] += extra_width * 2
639 |         # bottom center z minus extra_width
640 |         if isinstance(extra_width, torch.Tensor) and (extra_width.shape[-1] == 3):
641 |             enlarged_boxes[:, 2] -= extra_width[:, 2]
642 |         else:
643 |             enlarged_boxes[:, 2] -= extra_width
644 |             
645 |         return self.new_box(enlarged_boxes)        
646 | 
647 |     def to_camera(self, RT) -> 'GeneralInstance3DBoxes': 
648 |         # corners -> expected permutation.
649 |         corners = self.corners[:, [1, 5, 4, 0, 2, 6, 7, 3]]
650 |         corners = torch.cat((corners, torch.ones_like(corners[..., :1])), dim=-1)
651 |         corners = torch.linalg.inv(RT) @ corners.permute(0, 2, 1)
652 |         corners = corners[:, :3].permute(0, 2, 1)
653 | 
654 |         return corners_to_camera(corners)
655 |     
656 | class GeneralInstance3DBoxes(object):
657 |     def __init__(self, xyzlhw, R, box_dim=6 + 3 * 3, origin=(0.5, 0.5, 0), dof=BoxDOF.All):
658 |         if isinstance(xyzlhw, torch.Tensor):
659 |             device = xyzlhw.device
660 |         else:
661 |             device = torch.device('cpu')
662 | 
663 |         xyzlhw = torch.as_tensor(xyzlhw, dtype=torch.float32, device=device)
664 |         R = torch.as_tensor(R, dtype=torch.float32, device=device)
665 | 
666 |         self.dof = dof
667 |         self.box_dim = box_dim
668 |         self.tensor = xyzlhw.clone()
669 |         self.R = R.clone()
670 | 
671 |     @classmethod
672 |     def empty(cls, dof=BoxDOF.GravityAligned):
673 |         return GeneralInstance3DBoxes(
674 |             torch.zeros((0, 6)),
675 |             torch.zeros((0, 3, 3)),
676 |             dof=dof)
677 | 
678 |     @property
679 |     def volume(self):
680 |         """torch.Tensor: A vector with volume of each box."""
681 |         return self.tensor[:, 3] * self.tensor[:, 4] * self.tensor[:, 5]
682 | 
683 |     @property
684 |     def dims(self):
685 |         """torch.Tensor: Size dimensions of each box in shape (N, 3)."""
686 |         return self.tensor[:, 3:6]
687 | 
688 |     @property
689 |     def whl(self):
690 |         return self.tensor[:, [5, 4, 3]]
691 | 
692 |     @property
693 |     def xyzwhl(self):
694 |         return self.tensor[:, [0, 1, 2, 5, 4, 3]]
695 | 
696 |     @property
697 |     def center(self):
698 |         """Calculate the center of all the boxes.
699 | 
700 |         Note:
701 |             In MMDetection3D's convention, the bottom center is
702 |             usually taken as the default center.
703 | 
704 |             The relative position of the centers in different kinds of
705 |             boxes are different, e.g., the relative center of a boxes is
706 |             (0.5, 1.0, 0.5) in camera and (0.5, 0.5, 0) in lidar.
707 |             It is recommended to use ``bottom_center`` or ``gravity_center``
708 |             for clearer usage.
709 | 
710 |         Returns:
711 |             torch.Tensor: A tensor with center of each box in shape (N, 3).
712 |         """
713 |         return self.gravity_center
714 | 
715 |     @property
716 |     def bottom_center(self):
717 |         """torch.Tensor: A tensor with center of each box in shape (N, 3)."""
718 |         raise ValueError("not supported")
719 | 
720 |     @property
721 |     def gravity_center(self):
722 |         """torch.Tensor: A tensor with center of each box in shape (N, 3)."""
723 |         return self.tensor[:, :3]
724 | 
725 |     @property
726 |     def corners(self):
727 |         """torch.Tensor:
728 |             a tensor with 8 corners of each box in shape (N, 8, 3)."""
729 |         x3d = self.tensor[:, 0].unsqueeze(1)
730 |         y3d = self.tensor[:, 1].unsqueeze(1)
731 |         z3d = self.tensor[:, 2].unsqueeze(1)
732 |         w3d = self.tensor[:, 5].unsqueeze(1)
733 |         h3d = self.tensor[:, 4].unsqueeze(1)
734 |         l3d = self.tensor[:, 3].unsqueeze(1)
735 | 
736 |         '''
737 |                     v4_____________________v5
738 |                     /|                    /|
739 |                    / |                   / |
740 |                   /  |                  /  |
741 |                  /___|_________________/   |
742 |               v0|    |                 |v1 |
743 |                 |    |                 |   |
744 |                 |    |                 |   |
745 |                 |    |                 |   |
746 |                 |    |_________________|___|
747 |                 |   / v7               |   /v6
748 |                 |  /                   |  /
749 |                 | /                    | /
750 |                 |/_____________________|/
751 |                 v3                     v2
752 |         '''
753 | 
754 |         verts = torch.zeros([len(self), 3, 8], device=self.device)
755 | 
756 |         # setup X
757 |         verts[:, 0, [0, 3, 4, 7]] = -l3d / 2
758 |         verts[:, 0, [1, 2, 5, 6]] = l3d / 2
759 | 
760 |         # setup Y
761 |         verts[:, 1, [0, 1, 4, 5]] = -h3d / 2
762 |         verts[:, 1, [2, 3, 6, 7]] = h3d / 2
763 | 
764 |         # setup Z
765 |         verts[:, 2, [0, 1, 2, 3]] = -w3d / 2
766 |         verts[:, 2, [4, 5, 6, 7]] = w3d / 2
767 | 
768 |         # rotate
769 |         verts = self.R @ verts
770 | 
771 |         # translate
772 |         verts[:, 0, :] += x3d
773 |         verts[:, 1, :] += y3d
774 |         verts[:, 2, :] += z3d
775 | 
776 |         verts = verts.transpose(1, 2)
777 |         return verts
778 | 
779 |     @property
780 |     def bev(self):
781 |         """torch.Tensor: 2D BEV box of each box with rotation
782 |             in XYWHR format, in shape (N, 5)."""
783 |         pass
784 | 
785 |     @property
786 |     def nearest_bev(self):
787 |         """torch.Tensor: A tensor of 2D BEV box of each box
788 |             without rotation."""
789 |         pass
790 | 
791 |     def in_range_bev(self, box_range):
792 |         """Check whether the boxes are in the given range.
793 | 
794 |         Args:
795 |             box_range (list | torch.Tensor): the range of box
796 |                 (x_min, y_min, x_max, y_max)
797 | 
798 |         Note:
799 |             The original implementation of SECOND checks whether boxes in
800 |             a range by checking whether the points are in a convex
801 |             polygon, we reduce the burden for simpler cases.
802 | 
803 |         Returns:
804 |             torch.Tensor: Whether each box is inside the reference range.
805 |         """
806 |         raise ValueError("not supported")
807 | 
808 |     @abstractmethod
809 |     def rotate(self, angle, points=None):
810 |         """Rotate boxes with points (optional) with the given angle or rotation
811 |         matrix.
812 | 
813 |         Args:
814 |             angle (float | torch.Tensor | np.ndarray):
815 |                 Rotation angle or rotation matrix.
816 |             points (torch.Tensor | numpy.ndarray |
817 |                 :obj:`BasePoints`, optional):
818 |                 Points to rotate. Defaults to None.
819 |         """
820 |         pass
821 | 
822 |     def translate(self, trans_vector):
823 |         """Translate boxes with the given translation vector.
824 | 
825 |         Args:
826 |             trans_vector (torch.Tensor): Translation vector of size (1, 3).
827 |         """
828 |         if not isinstance(trans_vector, torch.Tensor):
829 |             trans_vector = self.tensor.new_tensor(trans_vector)
830 |         self.tensor[:, :3] += trans_vector
831 | 
832 |     def __getitem__(self, item):
833 |         original_type = type(self)
834 |         if isinstance(item, int):
835 |             return original_type(
836 |                 self.tensor[item].view(1, -1),
837 |                 self.R[item].view(1, 3, 3),
838 |                 dof=self.dof)
839 | 
840 |         b = self.tensor[item]
841 |         r = self.R[item]
842 |         assert b.dim() == 2, \
843 |             f'Indexing on Boxes with {item} failed to return a matrix!'
844 |         return original_type(b, r, dof=self.dof)
845 | 
846 |     def __len__(self):
847 |         """int: Number of boxes in the current object."""
848 |         return self.tensor.shape[0]
849 | 
850 |     def __repr__(self):
851 |         """str: Return a strings that describes the object."""
852 |         return self.__class__.__name__ + '(\n    ' + str(self.tensor) + ')'
853 | 
854 |     @classmethod
855 |     def cat(cls, boxes_list):
856 |         """Concatenate a list of Boxes into a single Boxes.
857 | 
858 |         Args:
859 |             boxes_list (list[:obj:`BaseInstance3DBoxes`]): List of boxes.
860 | 
861 |         Returns:
862 |             :obj:`BaseInstance3DBoxes`: The concatenated Boxes.
863 |         """
864 |         assert isinstance(boxes_list, (list, tuple))
865 |         if len(boxes_list) == 0:
866 |             return cls(torch.empty(0))
867 |         assert all(isinstance(box, cls) for box in boxes_list)
868 | 
869 |         first_dof = boxes_list[0].dof
870 |         assert all(box.dof == first_dof for box in boxes_list)
871 | 
872 |         # use torch.cat (v.s. layers.cat)
873 |         # so the returned boxes never share storage with input
874 |         cat_boxes = cls(
875 |             xyzlhw=torch.cat([b.tensor for b in boxes_list], dim=0),
876 |             R=torch.cat([b.R for b in boxes_list], dim=0),
877 |             dof=boxes_list[0].dof)
878 | 
879 |         return cat_boxes
880 | 
881 |     def split(self, split_size_or_sections):
882 |         tensors = torch.split(self.tensor, split_size_or_sections)
883 |         Rs = torch.split(self.R, split_size_or_sections)
884 | 
885 |         return [
886 |             type(self)(
887 |                 xyzlhw=tensor,
888 |                 R=R,
889 |                 dof=self.dof
890 |             ) for tensor, R in zip(tensors, Rs)
891 |         ]
892 | 
893 |     def to(self, device):
894 |         """Convert current boxes to a specific device.
895 | 
896 |         Args:
897 |             device (str | :obj:`torch.device`): The name of the device.
898 | 
899 |         Returns:
900 |             :obj:`BaseInstance3DBoxes`: A new boxes object on the
901 |                 specific device.
902 |         """
903 |         return GeneralInstance3DBoxes(
904 |             self.tensor.to(device),
905 |             R=self.R.to(device),
906 |             dof=self.dof)
907 | 
908 |     def clone(self) -> 'GeneralInstance3DBoxes':
909 |         """Clone the boxes.
910 | 
911 |         Returns:
912 |             :obj:`GeneralInstance3DBoxes`: Box object with the same properties as
913 |             self.
914 |         """
915 |         original_type = type(self)
916 |         return original_type(
917 |             self.tensor.clone(), self.R.clone(), dof=self.dof)
918 | 
919 |     @property
920 |     def device(self):
921 |         """str: The device of the boxes are on."""
922 |         return self.tensor.device
923 | 
924 |     def __iter__(self):
925 |         """Yield a box as a Tensor of shape (4,) at a time.
926 | 
927 |         Returns:
928 |             torch.Tensor: A box of shape (4,).
929 |         """
930 |         yield from self.tensor
931 | 


--------------------------------------------------------------------------------
/cubifyanything/capture_stream.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Dataset to stream RGB-D data from the NeRFCapture iOS App -> Cubify Transformer
  3 | 
  4 | Adapted from SplaTaM: https://github.com/spla-tam/SplaTAM
  5 | """
  6 | 
  7 | import numpy as np
  8 | import time
  9 | import torch
 10 | 
 11 | import cyclonedds.idl as idl
 12 | import cyclonedds.idl.annotations as annotate
 13 | import cyclonedds.idl.types as types
 14 | 
 15 | from dataclasses import dataclass
 16 | from cyclonedds.domain import DomainParticipant, Domain
 17 | from cyclonedds.core import Qos, Policy
 18 | from cyclonedds.sub import DataReader
 19 | from cyclonedds.topic import Topic
 20 | from cyclonedds.util import duration
 21 | 
 22 | from PIL import Image
 23 | from scipy.spatial.transform import Rotation
 24 | from torch.utils.data import IterableDataset
 25 | 
 26 | from cubifyanything.boxes import DepthInstance3DBoxes
 27 | from cubifyanything.measurement import ImageMeasurementInfo, DepthMeasurementInfo
 28 | from cubifyanything.orientation import ImageOrientation, rotate_tensor, ROT_Z
 29 | from cubifyanything.sensor import SensorArrayInfo, SensorInfo, PosedSensorInfo
 30 | 
 31 | # DDS
 32 | # ==================================================================================================
 33 | @dataclass
 34 | @annotate.final
 35 | @annotate.autoid("sequential")
 36 | class CaptureFrame(idl.IdlStruct, typename="CaptureData.CaptureFrame"):
 37 |     id: types.uint32
 38 |     annotate.key("id")
 39 |     timestamp: types.float64
 40 |     fl_x: types.float32
 41 |     fl_y: types.float32
 42 |     cx: types.float32
 43 |     cy: types.float32
 44 |     transform_matrix: types.array[types.float32, 16]
 45 |     width: types.uint32
 46 |     height: types.uint32
 47 |     image: types.sequence[types.uint8]
 48 |     has_depth: bool
 49 |     depth_width: types.uint32
 50 |     depth_height: types.uint32
 51 |     depth_scale: types.float32
 52 |     depth_image: types.sequence[types.uint8]
 53 | 
 54 | # 8 MB seems to work for me, but not 10 MB.
 55 | dds_config = """<?xml version="1.0" encoding="UTF-8" ?> \
 56 | <CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd"> \
 57 |     <Domain id="any"> \
 58 |         <Internal> \
 59 |             <MinimumSocketReceiveBufferSize>8MB</MinimumSocketReceiveBufferSize> \
 60 |         </Internal> \
 61 |         <Tracing> \
 62 |             <Verbosity>config</Verbosity> \
 63 |             <OutputFile>stdout</OutputFile> \
 64 |         </Tracing> \
 65 |     </Domain> \
 66 | </CycloneDDS> \
 67 | """
 68 | 
 69 | T_RW_to_VW = np.array([[0, 0, -1, 0],
 70 |                        [-1,  0, 0, 0],
 71 |                        [0, 1, 0, 0],
 72 |                        [ 0, 0, 0, 1]]).reshape((4,4)).astype(np.float32)
 73 | 
 74 | T_RC_to_VC = np.array([[1,  0,  0, 0],
 75 |                        [0, -1,  0, 0],
 76 |                        [0,  0, -1, 0],
 77 |                        [0,  0,  0, 1]]).reshape((4,4)).astype(np.float32)
 78 | 
 79 | T_VC_to_RC = np.array([[1,  0,  0, 0],
 80 |                        [0, -1,  0, 0],
 81 |                        [0,  0, -1, 0],
 82 |                        [0,  0,  0, 1]]).reshape((4,4)).astype(np.float32)
 83 | 
 84 | def compute_VC2VW_from_RC2RW(T_RC_to_RW):
 85 |     T_vc2rw = np.matmul(T_RC_to_RW,T_VC_to_RC)
 86 |     T_vc2vw = np.matmul(T_RW_to_VW,T_vc2rw)
 87 |     return T_vc2vw
 88 | 
 89 | def get_camera_to_gravity_transform(pose, current, target=ImageOrientation.UPRIGHT):
 90 |     z_rot_4x4 = torch.eye(4).float()
 91 |     z_rot_4x4[:3, :3] = ROT_Z[(current, target)]
 92 |     pose = pose @ torch.linalg.inv(z_rot_4x4.to(pose))
 93 | 
 94 |     # This is somewhat lazy.
 95 |     fake_corners = DepthInstance3DBoxes(
 96 |         np.array([[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0]])).corners[:, [1, 5, 4, 0, 2, 6, 7, 3]]
 97 |     fake_corners = torch.cat((fake_corners, torch.ones_like(fake_corners[..., :1])), dim=-1).to(pose)
 98 | 
 99 |     fake_corners = (torch.linalg.inv(pose) @ fake_corners.permute(0, 2, 1)).permute(0, 2, 1)[..., :3]
100 |     fake_basis = torch.stack([
101 |         (fake_corners[:, 1] - fake_corners[:, 0]) / torch.linalg.norm(fake_corners[:, 1] - fake_corners[:, 0], dim=-1)[:, None],
102 |         (fake_corners[:, 3] - fake_corners[:, 0]) / torch.linalg.norm(fake_corners[:, 3] - fake_corners[:, 0], dim=-1)[:, None],
103 |         (fake_corners[:, 4] - fake_corners[:, 0]) / torch.linalg.norm(fake_corners[:, 4] - fake_corners[:, 0], dim=-1)[:, None],
104 |     ], dim=1).permute(0, 2, 1)
105 | 
106 |     # this gets applied _after_ predictions to put it in camera space.
107 |     T = Rotation.from_euler("xz", Rotation.from_matrix(fake_basis[-1].cpu().numpy()).as_euler("yxz")[1:]).as_matrix()
108 | 
109 |     return torch.tensor(T).to(pose)
110 | 
111 | MAX_LONG_SIDE = 1024
112 | 
113 | # Acts like CubifyAnythingDataset but reads from the NeRFCapture stream.
114 | class CaptureDataset(IterableDataset):
115 |     def __init__(self, load_arkit_depth=True):
116 |         super(CaptureDataset, self).__init__()
117 | 
118 |         self.load_arkit_depth = load_arkit_depth
119 |         
120 |         self.domain = Domain(domain_id=0, config=dds_config)
121 |         self.participant = DomainParticipant()
122 |         self.qos = Qos(Policy.Reliability.Reliable(
123 |             max_blocking_time=duration(seconds=1)))
124 |         self.topic = Topic(self.participant, "Frames", CaptureFrame, qos=self.qos)
125 |         self.reader = DataReader(self.participant, self.topic)
126 | 
127 |     def __iter__(self):
128 |         print("Waiting for frames...")
129 |         video_id = 0
130 | 
131 |         # Start DDS Loop
132 |         while True:
133 |             sample = self.reader.read_next()
134 |             if not sample:
135 |                 print("Still waiting...")
136 |                 time.sleep(0.05)
137 |                 continue
138 | 
139 |             result = dict(wide=dict())
140 |             wide = PosedSensorInfo()            
141 |             
142 |             # OK, we have a frame. Fill on the requisite data/fields.
143 |             image_info = ImageMeasurementInfo(
144 |                 size=(sample.width, sample.height),
145 |                 K=torch.tensor([
146 |                     [sample.fl_x, 0.0, sample.cx],
147 |                     [0.0, sample.fl_y, sample.cy],
148 |                     [0.0, 0.0, 1.0]
149 |                 ])[None])
150 | 
151 |             print(image_info.size)
152 | 
153 |             image = np.asarray(sample.image, dtype=np.uint8).reshape((sample.height, sample.width, 3))
154 |             wide.image = image_info
155 |             result["wide"]["image"] = torch.tensor(np.moveaxis(image, -1, 0))[None]
156 | 
157 |             if self.load_arkit_depth and not sample.has_depth:
158 |                 raise ValueError("Depth was not found, you likely can only run the RGB only model with your device")
159 |             
160 |             depth_info = None            
161 |             if sample.has_depth:
162 |                 # We'll eventually ensure this is 1/4.
163 |                 rgb_depth_ratio = sample.width / sample.depth_width
164 |                 depth_info = DepthMeasurementInfo(
165 |                     size=(sample.depth_width, sample.depth_height),
166 |                     K=torch.tensor([
167 |                         [sample.fl_x / rgb_depth_ratio , 0.0, sample.cx / rgb_depth_ratio],
168 |                         [0.0, sample.fl_y / rgb_depth_ratio, sample.cy / rgb_depth_ratio],
169 |                         [0.0, 0.0, 1.0]
170 |                     ])[None])
171 | 
172 |                 # Is this an encoding thing?
173 |                 depth_scale = sample.depth_scale
174 |                 print(depth_scale)
175 |                 wide.depth = depth_info
176 | 
177 |                 # If I understand this correctly, it looks like this might just want the lower 16 bits?
178 |                 depth = torch.tensor(
179 |                     np.asarray(sample.depth_image, dtype=np.uint8).view(dtype=np.float32).reshape((sample.depth_height, sample.depth_width)))[None].float()
180 |                 result["wide"]["depth"] = depth
181 |                 
182 |                 desired_image_size = (4 * depth_info.size[0], 4 * depth_info.size[1])
183 |                 wide.image = wide.image.resize(desired_image_size)
184 |                 result["wide"]["image"] = torch.tensor(np.moveaxis(np.array(Image.fromarray(image).resize(desired_image_size)), -1, 0))[None]
185 |             else:
186 |                 # Even for RGB-only, only support a certain long size.
187 |                 if max(wide.image.size) > MAX_LONG_SIDE:
188 |                     scale_factor = MAX_LONG_SIDE / max(wide.image.size)
189 | 
190 |                     new_size = (int(wide.image.size[0] * scale_factor), int(wide.image.size[1] * scale_factor))
191 |                     wide.image = wide.image.resize(new_size)
192 |                     result["wide"]["image"] = torch.tensor(np.moveaxis(np.array(Image.fromarray(image).resize(new_size)), -1, 0))[None]
193 | 
194 |             # ARKit sends W2C?
195 |             # While we don't necessarily care about pose, we use it to derive the orientation
196 |             # and T_gravity.
197 |             RT = torch.tensor(
198 |                 compute_VC2VW_from_RC2RW(np.asarray(sample.transform_matrix).astype(np.float32).reshape((4, 4)).T))
199 |             wide.RT = RT[None]
200 | 
201 |             current_orientation = wide.orientation
202 |             target_orientation = ImageOrientation.UPRIGHT
203 | 
204 |             T_gravity = get_camera_to_gravity_transform(wide.RT[-1], current_orientation, target=target_orientation)
205 |             wide = wide.orient(current_orientation, target_orientation)
206 | 
207 |             result["wide"]["image"] = rotate_tensor(result["wide"]["image"], current_orientation, target=target_orientation)
208 |             if wide.has("depth"):
209 |                 result["wide"]["depth"] = rotate_tensor(result["wide"]["depth"], current_orientation, target=target_orientation)
210 | 
211 |             # No need for pose anymore.
212 |             wide.RT = torch.eye(4)[None]
213 |             wide.T_gravity = T_gravity[None]
214 | 
215 |             sensor_info = SensorArrayInfo()
216 |             sensor_info.wide = wide
217 |                                 
218 |             result["meta"] = dict(video_id=video_id, timestamp=sample.timestamp)
219 |             result["sensor_info"] = sensor_info
220 | 
221 |             yield result
222 | 


--------------------------------------------------------------------------------
/cubifyanything/color.py:
--------------------------------------------------------------------------------
 1 | # Detectron2's colors.
 2 | import numpy as np
 3 | 
 4 | _COLORS = np.array(
 5 |     [
 6 |         0.000, 0.447, 0.741,
 7 |         0.850, 0.325, 0.098,
 8 |         0.929, 0.694, 0.125,
 9 |         0.494, 0.184, 0.556,
10 |         0.466, 0.674, 0.188,
11 |         0.301, 0.745, 0.933,
12 |         0.635, 0.078, 0.184,
13 |         0.300, 0.300, 0.300,
14 |         0.600, 0.600, 0.600,
15 |         1.000, 0.000, 0.000,
16 |         1.000, 0.500, 0.000,
17 |         0.749, 0.749, 0.000,
18 |         0.000, 1.000, 0.000,
19 |         0.000, 0.000, 1.000,
20 |         0.667, 0.000, 1.000,
21 |         0.333, 0.333, 0.000,
22 |         0.333, 0.667, 0.000,
23 |         0.333, 1.000, 0.000,
24 |         0.667, 0.333, 0.000,
25 |         0.667, 0.667, 0.000,
26 |         0.667, 1.000, 0.000,
27 |         1.000, 0.333, 0.000,
28 |         1.000, 0.667, 0.000,
29 |         1.000, 1.000, 0.000,
30 |         0.000, 0.333, 0.500,
31 |         0.000, 0.667, 0.500,
32 |         0.000, 1.000, 0.500,
33 |         0.333, 0.000, 0.500,
34 |         0.333, 0.333, 0.500,
35 |         0.333, 0.667, 0.500,
36 |         0.333, 1.000, 0.500,
37 |         0.667, 0.000, 0.500,
38 |         0.667, 0.333, 0.500,
39 |         0.667, 0.667, 0.500,
40 |         0.667, 1.000, 0.500,
41 |         1.000, 0.000, 0.500,
42 |         1.000, 0.333, 0.500,
43 |         1.000, 0.667, 0.500,
44 |         1.000, 1.000, 0.500,
45 |         0.000, 0.333, 1.000,
46 |         0.000, 0.667, 1.000,
47 |         0.000, 1.000, 1.000,
48 |         0.333, 0.000, 1.000,
49 |         0.333, 0.333, 1.000,
50 |         0.333, 0.667, 1.000,
51 |         0.333, 1.000, 1.000,
52 |         0.667, 0.000, 1.000,
53 |         0.667, 0.333, 1.000,
54 |         0.667, 0.667, 1.000,
55 |         0.667, 1.000, 1.000,
56 |         1.000, 0.000, 1.000,
57 |         1.000, 0.333, 1.000,
58 |         1.000, 0.667, 1.000,
59 |         0.333, 0.000, 0.000,
60 |         0.500, 0.000, 0.000,
61 |         0.667, 0.000, 0.000,
62 |         0.833, 0.000, 0.000,
63 |         1.000, 0.000, 0.000,
64 |         0.000, 0.167, 0.000,
65 |         0.000, 0.333, 0.000,
66 |         0.000, 0.500, 0.000,
67 |         0.000, 0.667, 0.000,
68 |         0.000, 0.833, 0.000,
69 |         0.000, 1.000, 0.000,
70 |         0.000, 0.000, 0.167,
71 |         0.000, 0.000, 0.333,
72 |         0.000, 0.000, 0.500,
73 |         0.000, 0.000, 0.667,
74 |         0.000, 0.000, 0.833,
75 |         0.000, 0.000, 1.000,
76 |         0.000, 0.000, 0.000,
77 |         0.143, 0.143, 0.143,
78 |         0.857, 0.857, 0.857,
79 |         1.000, 1.000, 1.000
80 |     ]
81 | ).astype(np.float32).reshape(-1, 3)
82 | 
83 | def random_color(rgb=False, maximum=255):
84 |     """
85 |     Args:
86 |         rgb (bool): whether to return RGB colors or BGR colors.
87 |         maximum (int): either 255 or 1
88 | 
89 |     Returns:
90 |         ndarray: a vector of 3 numbers
91 |     """
92 |     idx = np.random.randint(0, len(_COLORS))
93 |     ret = _COLORS[idx] * maximum
94 |     if not rgb:
95 |         ret = ret[::-1]
96 |     return ret
97 | 


--------------------------------------------------------------------------------
/cubifyanything/dataset.py:
--------------------------------------------------------------------------------
  1 | # For licensing see accompanying LICENSE file.
  2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved.
  3 | 
  4 | import functools
  5 | import io
  6 | import numpy as np
  7 | import torch
  8 | import json
  9 | import tifffile
 10 | import webdataset
 11 | 
 12 | from pathlib import Path
 13 | from PIL import Image
 14 | from typing import Any, Callable, Dict, Iterable, Iterator, Optional, Set, Tuple
 15 | 
 16 | from webdataset.cache import cached_url_opener
 17 | from webdataset.handlers import reraise_exception
 18 | 
 19 | from cubifyanything.boxes import GeneralInstance3DBoxes, BoxDOF
 20 | from cubifyanything.instances import Instances3D
 21 | from cubifyanything.measurement import ImageMeasurementInfo, DepthMeasurementInfo
 22 | from cubifyanything.sensor import SensorArrayInfo, SensorInfo, PosedSensorInfo
 23 | 
 24 | def custom_pipe_cleaner(spec):
 25 |     # This should only be called when using links directly to the MLR CDN, so assume some stuff.
 26 |     return "/".join(Path(spec).parts[-2:])
 27 | 
 28 | custom_cached_url_opener = functools.partial(cached_url_opener, cache_dir="data", url_to_name=custom_pipe_cleaner)
 29 | 
 30 | PREFIX_SEPARATOR = "."
 31 | WORLD_PREFIX = "world"
 32 | 
 33 | def split_into_prefix_suffix(name):
 34 |     return name.split(PREFIX_SEPARATOR)[:2]
 35 | 
 36 | # All samples should be stored with keys like [video_id]/[integer timestamp].[sensor_name]/[measurement_name] (or world/).
 37 | def group_by_video_and_timestamp(
 38 |         data: Iterable[Dict[str, Any]],
 39 |         keys: Callable[[str], Tuple[str, str]] = split_into_prefix_suffix,
 40 |         lcase: bool = True,
 41 |         suffixes: Optional[Set[str]] = None,
 42 |         handler: Callable[[Exception], bool] = reraise_exception,
 43 | ) -> Iterator[Dict[str, Any]]:
 44 |     return webdataset.tariterators.group_by_keys(data, keys, lcase, suffixes, handler)
 45 | 
 46 | TIME_SCALE = 1e9
 47 | MM_TO_M = 1000.0
 48 | 
 49 | # Parsers.
 50 | def parse_json(data, key):
 51 |     return json.loads(data[key].decode("utf-8"))
 52 | 
 53 | def parse_size(data):
 54 |     return tuple(int(x) for x in data.decode("utf-8").strip("[]").split(", "))
 55 | 
 56 | def parse_transform_3x3(data):
 57 |     return torch.tensor(np.array(json.loads(data)).reshape(3, 3).astype(np.float32))
 58 | 
 59 | def parse_transform_4x4(data):
 60 |     return torch.tensor(np.array(json.loads(data)).reshape(4, 4).astype(np.float32))
 61 | 
 62 | def read_image_bytes(image_bytes, expected_size, channels_first=True):
 63 |     if image_bytes.startswith(b"\x89PNG"):
 64 |         # PNG.
 65 |         image = np.array(Image.open(io.BytesIO(image_bytes)))
 66 |     elif image_bytes.startswith(b"II*\x00") or image_bytes.startswith(b"MM\x00*"):
 67 |         # TIFF.
 68 |         image = tifffile.imread(io.BytesIO(image_bytes))
 69 |     else:
 70 |         raise ValueError("Unknown image format")
 71 | 
 72 |     assert (image.shape[1], image.shape[0]) == expected_size
 73 | 
 74 |     if channels_first and (image.ndim > 2):
 75 |         image = np.moveaxis(image, -1, 0)
 76 | 
 77 |     return torch.tensor(image)
 78 | 
 79 | def read_instances(data):
 80 |     instances_data = json.loads(data)    
 81 |     instances = Instances3D()
 82 |     
 83 |     if len(instances_data) == 0:
 84 |         # Empty.
 85 |         instances.set("gt_ids", [])
 86 |         instances.set("gt_names", [])        
 87 |         instances.set("gt_boxes_3d", empty_box(box_type))
 88 |         for src_key_2d, dst_key_2d in [("box_2d_rend", "gt_boxes_2d_trunc"), ("box_2d_proj", "gt_boxes_2d_proj")]:
 89 |             instances.set(dst_key_2d, np.empty((0, 4)))
 90 | 
 91 |         return instances
 92 | 
 93 |     instances.set("gt_ids", [bi["id"] for bi in instances_data])
 94 |     instances.set("gt_names", [bi["category"] for bi in instances_data])    
 95 |     instances.set("gt_boxes_3d", GeneralInstance3DBoxes(
 96 |             np.concatenate((
 97 |                 np.array([bi["position"] for bi in instances_data]),
 98 |                 np.array([bi["scale"] for bi in instances_data])), axis=-1),
 99 |             np.array([bi["R"] for bi in instances_data])))
100 | 
101 |     return instances
102 | 
103 | class CubifyAnythingDataset(webdataset.DataPipeline):
104 |     def __init__(self, url, box_dof=BoxDOF.GravityAligned, yield_world_instances=False, load_arkit_depth=True, use_cache=False):
105 |         self._url = url
106 |         self._yield_world_instances = yield_world_instances
107 |         self._use_cache = use_cache
108 | 
109 |         super(CubifyAnythingDataset, self).__init__(
110 |             webdataset.SimpleShardList(url),
111 |             custom_cached_url_opener if self._use_cache else webdataset.tariterators.url_opener,
112 |             webdataset.tariterators.tar_file_expander,
113 |             group_by_video_and_timestamp,
114 |             self._map_samples)
115 | 
116 |         self.load_arkit_depth = load_arkit_depth
117 | 
118 |     def _map_sample(self, sample):
119 |         video_id, timestamp = sample["__key__"].split("/")
120 |         video_id = int(video_id)
121 |         if timestamp == "world":
122 |             return dict(
123 |                 world=dict(instances=read_instances(sample["gt/instances"])),
124 |                 meta=dict(video_id=video_id))
125 |             
126 |         gt_depth_size = parse_size(sample["_gt/depth/size"])
127 | 
128 |         timestamp = float(timestamp) / 1e9
129 | 
130 |         # At this point, everything is in camera coordinates.        
131 |         wide = PosedSensorInfo()
132 |         wide.RT = torch.eye(4)[None]
133 |         wide.image = ImageMeasurementInfo(
134 |             size=parse_size(sample["_wide/image/size"]),
135 |             K=parse_transform_3x3(sample["wide/image/k"])[None],
136 |         )
137 |         
138 |         if self.load_arkit_depth:
139 |             wide.depth = DepthMeasurementInfo(            
140 |                 size=parse_size(sample["_wide/depth/size"]),
141 |                 K=parse_transform_3x3(sample["wide/depth/k"])[None])
142 |             
143 |         wide.T_gravity = parse_transform_3x3(sample["wide/t_gravity"])[None]
144 | 
145 |         gt = PosedSensorInfo()        
146 |         gt.RT = parse_transform_4x4(sample["gt/rt"])[None]
147 |         gt.depth = DepthMeasurementInfo(
148 |             size=parse_size(sample["_gt/depth/size"]),
149 |             K=parse_transform_3x3(sample["gt/depth/k"])[None])
150 | 
151 |         sensor_info = SensorArrayInfo()
152 |         sensor_info.wide = wide
153 |         sensor_info.gt = gt
154 | 
155 |         result = dict(
156 |             sensor_info=sensor_info,
157 |             wide=dict(
158 |                 image=read_image_bytes(sample["wide/image"], expected_size=wide.image.size)[None],
159 |                 instances=read_instances(sample["wide/instances"])),
160 |             gt=dict(
161 |                 # NOTE: 0.0 values here correspond to failed registration areas.
162 |                 depth=read_image_bytes(sample["gt/depth"], expected_size=gt.depth.size)[None].float() / MM_TO_M),
163 |             meta=dict(video_id=video_id, timestamp=timestamp))
164 | 
165 |         if self.load_arkit_depth:
166 |             result["wide"]["depth"] = read_image_bytes(sample["wide/depth"], expected_size=wide.depth.size)[None].float() / MM_TO_M
167 | 
168 |         return result
169 | 
170 |     def _map_samples(self, samples):
171 |         for sample in samples:
172 |             # Don't map the world instances unless requested to (since these are timeless).
173 |             if sample["__key__"].endswith("/world"):
174 |                 if not self._yield_world_instances:
175 |                     continue            
176 |             
177 |             yield self._map_sample(sample)            
178 | 
179 | if __name__ == "__main__":
180 |     dataset = CubifyAnythingDataset("file:/tmp/lupine-train-49739919.tar")
181 |     for blah in iter(dataset):
182 |         import pdb
183 |         pdb.set_trace()
184 | 


--------------------------------------------------------------------------------
/cubifyanything/imagelist.py:
--------------------------------------------------------------------------------
  1 | # Detectron2's ImageList.
  2 | from typing import Any, Dict, List, Optional, Tuple
  3 | import torch
  4 | from torch import device
  5 | from torch.nn import functional as F
  6 | 
  7 | class ImageList:
  8 |     """
  9 |     Structure that holds a list of images (of possibly
 10 |     varying sizes) as a single tensor.
 11 |     This works by padding the images to the same size.
 12 |     The original sizes of each image is stored in `image_sizes`.
 13 | 
 14 |     Attributes:
 15 |         image_sizes (list[tuple[int, int]]): each tuple is (h, w).
 16 |             During tracing, it becomes list[Tensor] instead.
 17 |     """
 18 | 
 19 |     def __init__(self, tensor: torch.Tensor, image_sizes: List[Tuple[int, int]]):
 20 |         """
 21 |         Arguments:
 22 |             tensor (Tensor): of shape (N, H, W) or (N, C_1, ..., C_K, H, W) where K >= 1
 23 |             image_sizes (list[tuple[int, int]]): Each tuple is (h, w). It can
 24 |                 be smaller than (H, W) due to padding.
 25 |         """
 26 |         self.tensor = tensor
 27 |         self.image_sizes = image_sizes
 28 | 
 29 |     def __len__(self) -> int:
 30 |         return len(self.image_sizes)
 31 | 
 32 |     def __getitem__(self, idx) -> torch.Tensor:
 33 |         """
 34 |         Access the individual image in its original size.
 35 | 
 36 |         Args:
 37 |             idx: int or slice
 38 | 
 39 |         Returns:
 40 |             Tensor: an image of shape (H, W) or (C_1, ..., C_K, H, W) where K >= 1
 41 |         """
 42 |         size = self.image_sizes[idx]
 43 |         return self.tensor[idx, ..., : size[0], : size[1]]
 44 | 
 45 |     @torch.jit.unused
 46 |     def to(self, *args: Any, **kwargs: Any) -> "ImageList":
 47 |         cast_tensor = self.tensor.to(*args, **kwargs)
 48 |         return ImageList(cast_tensor, self.image_sizes)
 49 | 
 50 |     @property
 51 |     def device(self) -> device:
 52 |         return self.tensor.device
 53 | 
 54 |     @staticmethod
 55 |     def from_tensors(
 56 |         tensors: List[torch.Tensor],
 57 |         size_divisibility: int = 0,
 58 |         pad_value: float = 0.0,
 59 |         padding_constraints: Optional[Dict[str, int]] = None,
 60 |     ) -> "ImageList":
 61 |         """
 62 |         Args:
 63 |             tensors: a tuple or list of `torch.Tensor`, each of shape (Hi, Wi) or
 64 |                 (C_1, ..., C_K, Hi, Wi) where K >= 1. The Tensors will be padded
 65 |                 to the same shape with `pad_value`.
 66 |             size_divisibility (int): If `size_divisibility > 0`, add padding to ensure
 67 |                 the common height and width is divisible by `size_divisibility`.
 68 |                 This depends on the model and many models need a divisibility of 32.
 69 |             pad_value (float): value to pad.
 70 |             padding_constraints (optional[Dict]): If given, it would follow the format as
 71 |                 {"size_divisibility": int, "square_size": int}, where `size_divisibility` will
 72 |                 overwrite the above one if presented and `square_size` indicates the
 73 |                 square padding size if `square_size` > 0.
 74 |         Returns:
 75 |             an `ImageList`.
 76 |         """
 77 |         assert len(tensors) > 0
 78 |         assert isinstance(tensors, (tuple, list))
 79 |         for t in tensors:
 80 |             assert isinstance(t, torch.Tensor), type(t)
 81 |             assert t.shape[:-2] == tensors[0].shape[:-2], t.shape
 82 | 
 83 |         image_sizes = [(im.shape[-2], im.shape[-1]) for im in tensors]
 84 |         image_sizes_tensor = [torch.as_tensor(x) for x in image_sizes]
 85 |         max_size = torch.stack(image_sizes_tensor).max(0).values
 86 | 
 87 |         if padding_constraints is not None:
 88 |             square_size = padding_constraints.get("square_size", 0)
 89 |             if square_size > 0:
 90 |                 # pad to square.
 91 |                 max_size[0] = max_size[1] = square_size
 92 |             if "size_divisibility" in padding_constraints:
 93 |                 size_divisibility = padding_constraints["size_divisibility"]
 94 |         if size_divisibility > 1:
 95 |             stride = size_divisibility
 96 |             # the last two dims are H,W, both subject to divisibility requirement
 97 |             max_size = (max_size + (stride - 1)).div(stride, rounding_mode="floor") * stride
 98 | 
 99 |         # handle weirdness of scripting and tracing ...
100 |         if torch.jit.is_scripting():
101 |             max_size: List[int] = max_size.to(dtype=torch.long).tolist()
102 |         else:
103 |             if torch.jit.is_tracing():
104 |                 image_sizes = image_sizes_tensor
105 | 
106 |         if len(tensors) == 1:
107 |             # This seems slightly (2%) faster.
108 |             # TODO: check whether it's faster for multiple images as well
109 |             image_size = image_sizes[0]
110 |             padding_size = [0, max_size[-1] - image_size[1], 0, max_size[-2] - image_size[0]]
111 |             batched_imgs = F.pad(tensors[0], padding_size, value=pad_value).unsqueeze_(0)
112 |         else:
113 |             raise NotImplementedError
114 | 
115 |         return ImageList(batched_imgs.contiguous(), image_sizes)
116 | 


--------------------------------------------------------------------------------
/cubifyanything/instances.py:
--------------------------------------------------------------------------------
  1 | # For licensing see accompanying LICENSE file.
  2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved.
  3 | 
  4 | # Based on D2's Instances.
  5 | import itertools
  6 | import warnings
  7 | from typing import Any, Dict, List, Tuple, Union
  8 | 
  9 | import numpy as np
 10 | import torch
 11 | 
 12 | # Provides basic compatibility with D2.
 13 | class Instances3D:
 14 |     """
 15 |     This class represents a list of instances in _the world_.
 16 |     """
 17 |     def __init__(self, image_size: Tuple[int, int] = (0, 0), **kwargs: Any):
 18 |         # image_size is here for Detectron2 compatibility.
 19 |         self._image_size = image_size
 20 |         self._fields: Dict[str, Any] = {}
 21 |         for k, v in kwargs.items():
 22 |             self.set(k, v)
 23 | 
 24 |     @property
 25 |     def image_size(self) -> Tuple[int, int]:
 26 |         """
 27 |         Returns:
 28 |             tuple: height, width (note: opposite of cubifycore).
 29 | 
 30 |         Here for D2 compatibility. You probably shouldn't be using this.
 31 |         """
 32 |         return self._image_size            
 33 | 
 34 |     def __setattr__(self, name: str, val: Any) -> None:
 35 |         if name.startswith("_"):
 36 |             super().__setattr__(name, val)
 37 |         else:
 38 |             self.set(name, val)
 39 | 
 40 |     def __getattr__(self, name: str) -> Any:
 41 |         if name == "_fields" or name not in self._fields:
 42 |             raise AttributeError("Cannot find field '{}' in the given Instances3D!".format(name))
 43 |         return self._fields[name]
 44 | 
 45 |     def set(self, name: str, value: Any) -> None:
 46 |         """
 47 |         Set the field named `name` to `value`.
 48 |         The length of `value` must be the number of instances,
 49 |         and must agree with other existing fields in this object.
 50 |         """
 51 |         with warnings.catch_warnings(record=True):
 52 |             data_len = len(value)
 53 |         if len(self._fields):
 54 |             assert (
 55 |                 len(self) == data_len
 56 |             ), "Adding a field of length {} to a Instances3D of length {}".format(data_len, len(self))
 57 |         self._fields[name] = value
 58 | 
 59 |     def has(self, name: str) -> bool:
 60 |         """
 61 |         Returns:
 62 |             bool: whether the field called `name` exists.
 63 |         """
 64 |         return name in self._fields
 65 | 
 66 |     def remove(self, name: str) -> None:
 67 |         """
 68 |         Remove the field called `name`.
 69 |         """
 70 |         del self._fields[name]
 71 | 
 72 |     def get(self, name: str) -> Any:
 73 |         """
 74 |         Returns the field called `name`.
 75 |         """
 76 |         return self._fields[name]
 77 | 
 78 |     def get_fields(self) -> Dict[str, Any]:
 79 |         """
 80 |         Returns:
 81 |             dict: a dict which maps names (str) to data of the fields
 82 | 
 83 |         Modifying the returned dict will modify this instance.
 84 |         """
 85 |         return self._fields
 86 | 
 87 |     # Tensor-like methods
 88 |     def to(self, *args: Any, **kwargs: Any) -> "Instances3D":
 89 |         """
 90 |         Returns:
 91 |             Instances: all fields are called with a `to(device)`, if the field has this method.
 92 |         """
 93 |         ret = Instances3D(image_size=self._image_size)
 94 |         # Copy fields that were explicitly added to this object (e.g., hidden fields)
 95 |         for name, value in self.__dict__.items():
 96 |             if (name not in ["_fields"]) and name.startswith("_"):
 97 |                 setattr(ret, name, value.to(*args, **kwargs) if hasattr(value, "to") else value)
 98 |         
 99 |         for k, v in self._fields.items():
100 |             if hasattr(v, "to"):
101 |                 v = v.to(*args, **kwargs)
102 |             ret.set(k, v)
103 | 
104 |         return ret
105 | 
106 |     def __getitem__(self, item: Union[int, slice, torch.BoolTensor]) -> "Instances3D":
107 |         """
108 |         Args:
109 |             item: an index-like object and will be used to index all the fields.
110 | 
111 |         Returns:
112 |             If `item` is a string, return the data in the corresponding field.
113 |             Otherwise, returns an `Instances3D` where all fields are indexed by `item`.
114 |         """
115 |         if type(item) == int:
116 |             if item >= len(self) or item < -len(self):
117 |                 raise IndexError("Instances3D index out of range!")
118 |             else:
119 |                 item = slice(item, None, len(self))
120 | 
121 |         ret = Instances3D(image_size=self.image_size)
122 |         for name, value in self.__dict__.items():
123 |             if (name not in ["_fields"]) and name.startswith("_"):
124 |                 setattr(ret, name, value)
125 |         
126 |         for k, v in self._fields.items():
127 |             if isinstance(v, (torch.Tensor, np.ndarray)) or hasattr(v, "tensor"):
128 |                 # assume if has .tensor, then this is piped into __getitem__.
129 |                 # Make sure to match underlying types.
130 |                 if isinstance(v, np.ndarray) and isinstance(item, torch.Tensor):
131 |                     ret.set(k, v[item.cpu().numpy()])
132 |                 else:
133 |                     ret.set(k, v[item])
134 |             elif hasattr(v, "__iter__"):
135 |                 # handle non-Tensor types like lists, etc.
136 |                 if isinstance(item, np.ndarray) and (item.dtype == np.bool_):
137 |                     ret.set(k, [v_ for i_, v_ in enumerate(v) if item[i_]])                    
138 |                 elif isinstance(item, torch.BoolTensor) or (isinstance(item, torch.Tensor) and (item.dtype == torch.bool)):
139 |                     ret.set(k, [v_ for i_, v_ in enumerate(v) if item[i_].item()])
140 |                 elif isinstance(item, torch.LongTensor) or (isinstance(item, torch.Tensor) and (item.dtype == torch.int64)):
141 |                     # Can this be right?
142 |                     ret.set(k, [v[i_.item()] for i_ in item])
143 |                 elif isinstance(item, slice):
144 |                     ret.set(k, v[item])
145 |                 else:
146 |                     raise ValueError("Expected Bool or Long Tensor")
147 |             else:
148 |                 raise ValueError("Not supported!")
149 |                 
150 |         return ret
151 | 
152 |     def __len__(self) -> int:
153 |         for v in self._fields.values():
154 |             # use __len__ because len() has to be int and is not friendly to tracing
155 |             return v.__len__()
156 |         raise NotImplementedError("Empty Instances3D does not support __len__!")
157 | 
158 |     def __iter__(self):
159 |         raise NotImplementedError("`Instances3D` object is not iterable!")
160 | 
161 |     def split(self, split_size_or_sections):
162 |         indexes = torch.arange(len(self))
163 |         splits = torch.split(indexes, split_size_or_sections)
164 | 
165 |         return [self[split] for split in splits]
166 | 
167 |     def clone(self):
168 |         import copy
169 | 
170 |         ret = Instances3D(image_size=self._image_size)
171 |         for k, v in self._fields.items():
172 |             if hasattr(v, "clone"):
173 |                 v = v.clone()
174 |             elif isinstance(v, np.ndarray):
175 |                 v = np.copy(v)
176 |             elif isinstance(v, (str, list, tuple)):
177 |                 v = copy.copy(v)
178 |             elif hasattr(v, "tensor"):
179 |                 v = type(v)(v.tensor.clone())
180 |             else:
181 |                 raise NotImplementedError
182 | 
183 |             ret.set(k, v)
184 | 
185 |         return ret
186 | 
187 |     @staticmethod
188 |     def cat(instance_lists: List["Instances3D"]) -> "Instances3D":
189 |         """
190 |         Args:
191 |             instance_lists (list[Instances])
192 | 
193 |         Returns:
194 |             Instances
195 |         """
196 |         assert all(isinstance(i, Instances3D) for i in instance_lists)
197 |         assert len(instance_lists) > 0
198 |         if len(instance_lists) == 1:
199 |             return instance_lists[0]
200 | 
201 |         ret = Instances3D(image_size=instance_lists[0]._image_size)
202 |         for k in instance_lists[0]._fields.keys():
203 |             values = [i.get(k) for i in instance_lists]
204 |             v0 = values[0]
205 |             if isinstance(v0, torch.Tensor):
206 |                 values = torch.cat(values, dim=0)
207 |             elif isinstance(v0, list):
208 |                 values = list(itertools.chain(*values))
209 |             elif hasattr(type(v0), "cat"):
210 |                 values = type(v0).cat(values)
211 |             else:
212 |                 raise ValueError("Unsupported type {} for concatenation".format(type(v0)))
213 |             ret.set(k, values)
214 |         return ret
215 | 
216 |     def translate(self, translation):
217 |         # in-place.
218 |         for field_name, field in self._fields.items():
219 |             if hasattr(field, "translate"):
220 |                 field.translate(translation)
221 | 
222 |     def __str__(self) -> str:
223 |         s = self.__class__.__name__ + "("
224 |         s += "num_instances={}, ".format(len(self))
225 |         s += "fields=[{}])".format(", ".join((f"{k}: {v}" for k, v in self._fields.items())))
226 |         return s
227 | 
228 |     __repr__ = __str__
229 | 


--------------------------------------------------------------------------------
/cubifyanything/measurement.py:
--------------------------------------------------------------------------------
  1 | # For licensing see accompanying LICENSE file.
  2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved.
  3 | 
  4 | import numpy as np
  5 | import torch
  6 | 
  7 | from typing import Any, Dict, List, Tuple, Union
  8 | 
  9 | from cubifyanything.orientation import ImageOrientation, rotate_K
 10 | 
 11 | class BaseMeasurementInfo(object):
 12 |     def __init__(self, meta=None, **kwargs):
 13 |         super(BaseMeasurementInfo, self).__init__()
 14 |         self.meta = meta
 15 | 
 16 |     @property
 17 |     def ts(self):
 18 |         if (self.meta is not None) and hasattr(self.meta, "ts"):
 19 |             return self.meta.ts
 20 | 
 21 |         return None
 22 | 
 23 | class MeasurementInfo(BaseMeasurementInfo):
 24 |     pass
 25 | 
 26 | class ImageMeasurementInfo(MeasurementInfo):
 27 |     def __init__(self, size, K, meta=None, original_size=None):
 28 |         super(ImageMeasurementInfo, self).__init__(meta=meta)
 29 |         self.size = size
 30 |         if isinstance(self.size, torch.Tensor) and not torch.jit.is_tracing():
 31 |             self.size = (self.size[0].item(), self.size[1].item())
 32 | 
 33 |         self.original_size = original_size or self.size
 34 | 
 35 |         # check for normalized.
 36 |         if ((K[..., 2] >= 0) & (K[..., 2] < 1)).all():
 37 |             raise ValueError("Normalized intrinsics are not supported")
 38 | 
 39 |         # No float64 support on MPS.
 40 |         self.K = K.float()
 41 | 
 42 |     @property
 43 |     def device(self):
 44 |         return self.K.device
 45 | 
 46 |     def _get_fields(self):
 47 |         # Don't support anything fancy for now.
 48 |         return dict(
 49 |             size=torch.tensor(self.size),
 50 |             K=self.K)
 51 | 
 52 |     def __len__(self):
 53 |         return len(self.K)
 54 | 
 55 |     def __getitem__(self, item):
 56 |         ret = type(self)(self.size, self.K.__getitem__(item), meta=self.meta, original_size=self.original_size)
 57 |         return ret
 58 | 
 59 |     def to(self, *args: Any, **kwargs: Any) -> "ImageMeasurementInfo":
 60 |         ret = type(self)(self.size, self.K.to(*args, **kwargs), meta=self.meta, original_size=self.original_size)
 61 |         return ret
 62 | 
 63 |     @classmethod
 64 |     def cat(self, info_list):
 65 |         return type(info_list[0])(
 66 |             size=info_list[0].size,
 67 |             K=torch.cat([info_.K for info_ in info_list]),
 68 |         )
 69 | 
 70 |     def _get_oriented_size(self, current_orientation, target_orientation, size):
 71 |         if (target_orientation != ImageOrientation.UPRIGHT) and (current_orientation != ImageOrientation.UPRIGHT):
 72 |             raise NotImplementedError
 73 | 
 74 |         if ((current_orientation, target_orientation) in [
 75 |                 (ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT),
 76 |                 (ImageOrientation.UPSIDE_DOWN, ImageOrientation.UPRIGHT),
 77 |                 (ImageOrientation.UPRIGHT, ImageOrientation.UPSIDE_DOWN),
 78 |                 (ImageOrientation.LEFT, ImageOrientation.RIGHT),
 79 |                 (ImageOrientation.RIGHT, ImageOrientation.LEFT)
 80 |         ]):
 81 |             # Nothing changes.
 82 |             new_size = size
 83 |         else:
 84 |             # Swap.
 85 |             new_size = (size[1], size[0])
 86 | 
 87 |         return new_size
 88 | 
 89 |     def orient(self, current_orientation, target_orientation):
 90 |         if (target_orientation != ImageOrientation.UPRIGHT) and (current_orientation != ImageOrientation.UPRIGHT):
 91 |             raise NotImplementedError
 92 | 
 93 |         new_K = rotate_K(self.K, current_orientation, self.size, target=target_orientation)
 94 |         new_size = self._get_oriented_size(current_orientation, target_orientation, self.size)
 95 | 
 96 |         ret = type(self)(
 97 |             new_size,
 98 |             new_K,
 99 |             meta=self.meta,
100 |             original_size=self._get_oriented_size(current_orientation, target_orientation, self.original_size))
101 | 
102 |         return ret
103 | 
104 |     def rescale(self, factor):
105 |         old_size = self.size
106 |         new_size = (int(old_size[0] * factor), int(old_size[1] * factor))
107 | 
108 |         new_K = self.K.clone()
109 |         new_K[..., :2, :] = new_K[..., :2, :] * factor
110 | 
111 |         return type(self)(new_size, new_K, meta=self.meta, original_size=self.original_size)
112 | 
113 |     def resize(self, new_size):
114 |         if isinstance(new_size, float):
115 |             return self.rescale(new_size)
116 | 
117 |         width_scale = new_size[0] / self.size[0]
118 |         height_scale = new_size[1] / self.size[1]
119 | 
120 |         # Might be some some pixel errors.
121 |         if not np.isclose(height_scale, width_scale, atol=0.025):
122 |             print(f"Rescaling from {self.size} to {new_size}. This does not seem uniform but may be due to discretization error.")
123 | 
124 |         result = self.rescale(height_scale)
125 |         # Even if it's not the best idea, always make sure the given size is
126 |         # reflected.
127 |         result.size = tuple(new_size)
128 |         return result
129 |         
130 | class DepthMeasurementInfo(ImageMeasurementInfo):
131 |     def normalize(self, parameters):
132 |         return WhitenedDepthMeasurementInfo(
133 |             size=self.size,
134 |             K=self.K,
135 |             meta=self.meta,
136 |             parameters=parameters,
137 |             original_size=self.original_size)
138 | 
139 | class WhitenedDepthMeasurementInfo(DepthMeasurementInfo):
140 |     def __init__(self, size, K, meta=None, parameters=None, original_size=None):
141 |         super(WhitenedDepthMeasurementInfo, self).__init__(size, K, meta=meta, original_size=original_size)
142 | 
143 |         # Whitening parameters.
144 |         self.parameters = parameters
145 | 
146 |     def _get_fields(self):
147 |         return dict(
148 |             size=torch.tensor(self.size),
149 |             K=self.K,
150 |             parameters=self.parameters)
151 |     
152 | 


--------------------------------------------------------------------------------
/cubifyanything/orientation.py:
--------------------------------------------------------------------------------
 1 | # For licensing see accompanying LICENSE file.
 2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved.
 3 | 
 4 | import numpy as np
 5 | import torch
 6 | 
 7 | from enum import Enum
 8 | from scipy.spatial.transform import Rotation
 9 | 
10 | class ImageOrientation(Enum):
11 |     UPRIGHT = 0
12 |     LEFT = 1
13 |     UPSIDE_DOWN = 2
14 |     RIGHT = 3
15 |     ORIGINAL = 4
16 | 
17 | ROT_Z = {
18 |     (ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT): torch.tensor(Rotation.from_euler('z', 0).as_matrix()).float(),
19 |     (ImageOrientation.LEFT, ImageOrientation.UPRIGHT): torch.tensor(Rotation.from_euler('z', np.pi / 2).as_matrix()).float(),
20 |     (ImageOrientation.UPSIDE_DOWN, ImageOrientation.UPRIGHT): torch.tensor(Rotation.from_euler('z', np.pi).as_matrix()).float(),
21 |     (ImageOrientation.RIGHT, ImageOrientation.UPRIGHT): torch.tensor(Rotation.from_euler('z', -np.pi / 2).as_matrix()).float(),
22 | 
23 |     # Inverses.
24 |     (ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT): torch.tensor(Rotation.from_euler('z', 0).as_matrix()).float(),
25 |     (ImageOrientation.UPRIGHT, ImageOrientation.LEFT): torch.tensor(Rotation.from_euler('z', -np.pi / 2).as_matrix()).float(),
26 |     (ImageOrientation.UPRIGHT, ImageOrientation.UPSIDE_DOWN): torch.tensor(Rotation.from_euler('z', -np.pi).as_matrix()).float(),
27 |     (ImageOrientation.UPRIGHT, ImageOrientation.RIGHT): torch.tensor(Rotation.from_euler('z', np.pi / 2).as_matrix()).float(),
28 | }
29 | 
30 | ROT_K = {
31 |     (ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT): 0,
32 |     (ImageOrientation.LEFT, ImageOrientation.UPRIGHT): -1,
33 |     (ImageOrientation.UPSIDE_DOWN, ImageOrientation.UPRIGHT): 2,
34 |     (ImageOrientation.RIGHT, ImageOrientation.UPRIGHT): 1,
35 | 
36 |     # Inverses.
37 |     (ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT): 0,
38 |     (ImageOrientation.UPRIGHT, ImageOrientation.LEFT): 1,
39 |     (ImageOrientation.UPRIGHT, ImageOrientation.UPSIDE_DOWN): -2,
40 |     (ImageOrientation.UPRIGHT, ImageOrientation.RIGHT): -1
41 | }
42 | 
43 | def get_orientation(pose):
44 |     z_vec = pose[..., 2, :3]
45 |     z_orien = torch.tensor(np.array(
46 |         [
47 |             [0.0, -1.0, 0.0],  # upright
48 |             [-1.0, 0.0, 0.0],  # left
49 |             [0.0, 1.0, 0.0],  # upside-down
50 |             [1.0, 0.0, 0.0],
51 |         ]  # right
52 |     )).to(pose)
53 | 
54 |     corr = (z_orien @ z_vec.T).T
55 |     corr_max = corr.argmax(dim=-1)
56 | 
57 |     return corr_max
58 | 
59 | def rotate_K(K, current, image_size, target=ImageOrientation.UPRIGHT):
60 |     # TODO: use image_size to properly compute the new (cx, cy)
61 |     if (current, target) in [(ImageOrientation.UPRIGHT, ImageOrientation.UPRIGHT)]:
62 |         return K.clone()
63 |     elif (current, target) in [(ImageOrientation.LEFT, ImageOrientation.UPRIGHT), (ImageOrientation.UPRIGHT, ImageOrientation.RIGHT)]:
64 |         return torch.stack([
65 |             torch.stack([K[:, 1, 1], K[:, 0, 1], K[:, 1, 2]], dim=1),
66 |             torch.stack([K[:, 1, 0], K[:, 0, 0], K[:, 0, 2]], dim=1),
67 |             torch.stack([K[:, 2, 0], K[:, 2, 1], K[:, 2, 2]], dim=1)
68 |         ], dim=1).to(K)
69 |     elif (current, target) in [(ImageOrientation.UPSIDE_DOWN, ImageOrientation.UPRIGHT), (ImageOrientation.UPRIGHT, ImageOrientation.UPSIDE_DOWN)]:
70 |         return torch.stack([
71 |             torch.stack([K[:, 0, 0], K[:, 0, 1], image_size[0] - K[:, 0, 2]], dim=1),
72 |             torch.stack([K[:, 1, 0], K[:, 1, 1], image_size[1] - K[:, 1, 2]], dim=1),
73 |             torch.stack([K[:, 2, 0], K[:, 2, 1], K[:, 2, 2]], dim=1)
74 |         ], dim=1).to(K)
75 |     elif (current, target) in [(ImageOrientation.RIGHT, ImageOrientation.UPRIGHT), (ImageOrientation.UPRIGHT, ImageOrientation.LEFT)]:
76 |         return torch.stack([
77 |             torch.stack([K[:, 1, 1], K[:, 0, 1], K[:, 1, 2]], dim=1),
78 |             torch.stack([K[:, 1, 0], K[:, 0, 0], K[:, 0, 2]], dim=1),
79 |             torch.stack([K[:, 2, 0], K[:, 2, 1], K[:, 2, 2]], dim=1)
80 |         ], dim=1).to(K)
81 | 
82 |     raise ValueError("unknown orientation")
83 | 
84 | def rotate_pose(pose, current, target=ImageOrientation.UPRIGHT):
85 |     rot_z = ROT_Z[(current, target)].to(pose)
86 |     rot_z_4x4 = torch.eye(4, device=pose.device).float()
87 |     rot_z_4x4[:3, :3] = rot_z
88 | 
89 |     return pose @ torch.linalg.inv(rot_z_4x4)
90 | 
91 | def rotate_xyz(xyz, current, target=ImageOrientation.UPRIGHT):
92 |     rot_z = ROT_Z[(current, target)].to(xyz)
93 |     return rot_z @ xyz
94 | 
95 | def rotate_tensor(tensor, current, target=ImageOrientation.UPRIGHT):
96 |     return torch.rot90(tensor, ROT_K[(current, target)], dims=(-2, -1))
97 | 


--------------------------------------------------------------------------------
/cubifyanything/pos.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import torch
  3 | import warnings
  4 | 
  5 | from torch import nn
  6 | from torch.nn import functional as F
  7 | 
  8 | from math import log2, pi
  9 | 
 10 | class PositionEmbeddingSine(nn.Module):
 11 |     """
 12 |     This is a more standard version of the position embedding, very similar to the one
 13 |     used by the Attention is all you need paper, generalized to work on images.
 14 |     """
 15 | 
 16 |     def __init__(
 17 |         self, num_pos_feats=64, temperature=10000, normalize=False, scale=None
 18 |     ):
 19 |         super().__init__()
 20 |         self.num_pos_feats = num_pos_feats
 21 |         self.temperature = temperature
 22 |         self.normalize = normalize
 23 |         if scale is not None and normalize is False:
 24 |             raise ValueError("normalize should be True if scale is passed")
 25 |         if scale is None:
 26 |             scale = 2 * math.pi
 27 |         self.scale = scale
 28 | 
 29 |     def forward(self, tensor_list, sensor):
 30 |         x = tensor_list.tensors
 31 |         mask = tensor_list.mask
 32 |         assert mask is not None
 33 |         not_mask = ~mask
 34 |         y_embed = not_mask.cumsum(1, dtype=torch.float32)
 35 |         x_embed = not_mask.cumsum(2, dtype=torch.float32)
 36 |         if self.normalize:
 37 |             eps = 1e-6
 38 |             y_embed = (y_embed - 0.5) / (y_embed[:, -1:, :] + eps) * self.scale
 39 |             x_embed = (x_embed - 0.5) / (x_embed[:, :, -1:] + eps) * self.scale
 40 |         else:
 41 |             y_embed = (y_embed - 0.5) * self.scale
 42 |             x_embed = (x_embed - 0.5) * self.scale
 43 | 
 44 |         dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)
 45 |         with warnings.catch_warnings():
 46 |             warnings.simplefilter("ignore")
 47 |             dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)
 48 | 
 49 |         pos_x = x_embed[:, :, :, None] / dim_t
 50 |         pos_y = y_embed[:, :, :, None] / dim_t
 51 |         pos_x = torch.stack(
 52 |             (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4
 53 |         ).flatten(3)
 54 |         pos_y = torch.stack(
 55 |             (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4
 56 |         ).flatten(3)
 57 |         pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
 58 | 
 59 |         return pos
 60 | 
 61 | def generate_rays(
 62 |     info, image_shape, noisy: bool = False
 63 | ):
 64 |     camera_intrinsics = info.K[-1][None]
 65 |     batch_size, device, dtype = (
 66 |         camera_intrinsics.shape[0],
 67 |         camera_intrinsics.device,
 68 |         camera_intrinsics.dtype,
 69 |     )
 70 |     height, width = image_shape
 71 |     # Generate grid of pixel coordinates
 72 |     pixel_coords_x = torch.linspace(0, width - 1, width, device=device, dtype=dtype)
 73 |     pixel_coords_y = torch.linspace(0, height - 1, height, device=device, dtype=dtype)
 74 |     if noisy:
 75 |         pixel_coords_x += torch.rand_like(pixel_coords_x) - 0.5
 76 |         pixel_coords_y += torch.rand_like(pixel_coords_y) - 0.5
 77 |     pixel_coords = torch.stack(
 78 |         [pixel_coords_x.repeat(height, 1), pixel_coords_y.repeat(width, 1).t()], dim=2
 79 |     )  # (H, W, 2)
 80 |     pixel_coords = pixel_coords + 0.5
 81 | 
 82 |     # Handle radial distortion.
 83 |     ray_is_valid = torch.ones((height, width), dtype=torch.bool, device=device)
 84 | 
 85 |     # Calculate ray directions
 86 |     intrinsics_inv = torch.eye(3, device=device).unsqueeze(0).repeat(batch_size, 1, 1)
 87 |     intrinsics_inv[:, 0, 0] = 1.0 / camera_intrinsics[:, 0, 0]
 88 |     intrinsics_inv[:, 1, 1] = 1.0 / camera_intrinsics[:, 1, 1]
 89 |     intrinsics_inv[:, 0, 2] = -camera_intrinsics[:, 0, 2] / camera_intrinsics[:, 0, 0]
 90 |     intrinsics_inv[:, 1, 2] = -camera_intrinsics[:, 1, 2] / camera_intrinsics[:, 1, 1]
 91 |     homogeneous_coords = torch.cat(
 92 |         [pixel_coords, torch.ones_like(pixel_coords[:, :, :1])], dim=2
 93 |     )  # (H, W, 3)
 94 | 
 95 |     ray_directions = torch.matmul(
 96 |         intrinsics_inv, homogeneous_coords.permute(2, 0, 1).flatten(-2)).view(
 97 |             3, height, width).permute(1, 2, 0)  # (3, H*W)
 98 | 
 99 |     ray_directions = F.normalize(ray_directions, dim=-1)  # (B, 3, H*W)
100 |     theta = torch.atan2(ray_directions[..., 0], ray_directions[..., -1])
101 |     phi = torch.acos(ray_directions[..., 1])
102 |     angles = torch.stack([theta, phi], dim=-1)
103 | 
104 |     # Ensure we set anything invalid to just 0?
105 |     ray_directions[~ray_is_valid] = 0.0
106 |     angles[~ray_is_valid] = 0.0
107 | 
108 |     return ray_directions, angles
109 | 
110 | def generate_fourier_features(
111 |     x: torch.Tensor,
112 |     dim: int = 256,
113 |     max_freq: int = 64,
114 |     use_cos: bool = False,
115 |     use_log: bool = False,
116 |     cat_orig: bool = False,
117 | ):
118 |     x_orig = x
119 |     device, dtype, input_dim = x.device, x.dtype, x.shape[-1]
120 |     num_bands = dim // (2 * input_dim) if use_cos else dim // input_dim
121 | 
122 |     if use_log:
123 |         scales = 2.0 ** torch.linspace(
124 |             0.0, log2(max_freq), steps=num_bands, device=device, dtype=dtype
125 |         )
126 |     else:
127 |         scales = torch.linspace(
128 |             1.0, max_freq / 2, num_bands, device=device, dtype=dtype
129 |         )
130 | 
131 |     x = x.unsqueeze(-1)
132 |     scales = scales[(*((None,) * (len(x.shape) - 1)), Ellipsis)]
133 | 
134 |     x = x * scales * pi
135 |     x = torch.cat(
136 |         (
137 |             [x.sin(), x.cos()]
138 |             if use_cos
139 |             else [
140 |                 x.sin(),
141 |             ]
142 |         ),
143 |         dim=-1,
144 |     )
145 | 
146 |     if cat_orig:
147 |         raise NotImplementedError
148 | 
149 |     return x.flatten(3)    
150 | 
151 | # Adopted from UniDepth. I don't think this is necessary, but keeping until we re-train models.
152 | class CameraRayEmbedding(nn.Module):
153 |     def __init__(self, dim):
154 |         super().__init__()
155 | 
156 |         self.dim = dim
157 |         self.proj = nn.Linear(255, self.dim)
158 | 
159 |     def forward(self, tensor_list, sensor):
160 |         x = tensor_list.tensors
161 |         
162 |         feat_size =  tensor_list.tensors.shape[-1]
163 |         # Hard-coded stride.
164 |         square_pad = feat_size * 16
165 | 
166 |         # Generate the rays for the original images.
167 |         ray_dirs = []
168 |         for info_ in sensor["image"].info:
169 |             ray_dirs_, angles_ = generate_rays(info_, (info_.size[1], info_.size[0]))
170 |             ray_dirs_ = F.pad(ray_dirs_, (0, 0, 0, square_pad - ray_dirs_.shape[1], 0, square_pad - ray_dirs_.shape[0]))
171 |             ray_dirs.append(ray_dirs_)
172 | 
173 |         ray_dirs = torch.stack(ray_dirs)
174 | 
175 |         rays_embedding = F.interpolate(ray_dirs.permute(0, 3, 1, 2), (feat_size, feat_size), mode="nearest").permute(0, 2, 3, 1)
176 |         rays_embedding = F.normalize(rays_embedding, dim=-1)
177 |         rays_embedding = generate_fourier_features(
178 |             rays_embedding,
179 |             dim=self.dim,
180 |             max_freq=feat_size // 2,
181 |             use_log=True,
182 |             cat_orig=False,
183 |         )
184 | 
185 |         rays_embedding = self.proj(rays_embedding)
186 |         return rays_embedding.permute(0, 3, 1, 2).contiguous()
187 | 


--------------------------------------------------------------------------------
/cubifyanything/preprocessor.py:
--------------------------------------------------------------------------------
  1 | # For licensing see accompanying LICENSE file.
  2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved.
  3 | 
  4 | import copy
  5 | import os
  6 | import torch
  7 | 
  8 | from cubifyanything.measurement import (
  9 |     DepthMeasurementInfo,
 10 |     ImageMeasurementInfo)
 11 | 
 12 | from cubifyanything.batching import (
 13 |     Measurement,
 14 |     PosedImage,    
 15 |     PosedDepth,
 16 |     BatchedSensors,
 17 |     Sensors)
 18 | 
 19 | from typing import Dict, List
 20 | 
 21 | IGNORE_KEYS = ["sensor_info", "__key__", "gt", "video_info", "meta"]
 22 | 
 23 | def move_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
 24 |     try:
 25 |         return src.to(dst)
 26 |     except:
 27 |         return src.to(dst.device)
 28 | 
 29 | def move_to_current_device(x, t):
 30 |     if isinstance(x, (list, tuple)):
 31 |         return [move_device_like(x_, t) for x_ in x]
 32 |     
 33 |     return move_device_like(x, t)
 34 | 
 35 | def move_input_to_current_device(batched_input: Sensors, t: torch.Tensor):
 36 |     # Assume only two levels of nesting for now.
 37 |     return { name: { name_: move_to_current_device(m, t) for name_, m in s.items() } for name, s in batched_input.items() }
 38 | 
 39 | class Augmentor(object):
 40 |     def __init__(self, measurement_keys=None):
 41 |         self.measurement_keys = measurement_keys
 42 | 
 43 |     def package(self, sample) -> Dict[str, Dict[str, Measurement]]:
 44 |         # Simply everything into "Packages" to make it more amenable for a training pipeline.
 45 |         # Essentially return Dict
 46 |         # Make sure everything is contiguous. channels -> first.
 47 |         result = {}
 48 |         for sensor_name, sensor_data in sample.items():
 49 |             if sensor_name in IGNORE_KEYS:
 50 |                 continue
 51 | 
 52 |             if not isinstance(sensor_data, dict):
 53 |                 continue
 54 | 
 55 |             sensor_result = {}
 56 |             sensor_info = copy.deepcopy(getattr(sample["sensor_info"], sensor_name))
 57 |             for measurement_name, measurement in sensor_data.items():
 58 |                 measurement_key = os.path.join(sensor_name, measurement_name)
 59 |                 if (self.measurement_keys is not None) and (measurement_key not in self.measurement_keys):
 60 |                     # Make sure to delete from sensor info as well.
 61 |                     if sensor_info.has(measurement_name):
 62 |                         sensor_info.remove(measurement_name)
 63 |                         
 64 |                     continue
 65 | 
 66 |                 measurement_info = getattr(sensor_info, measurement_name)
 67 |                 if isinstance(measurement_info, DepthMeasurementInfo):
 68 |                     sensor_result[measurement_name] = PosedDepth(
 69 |                         sample[sensor_name][measurement_name][-1],
 70 |                         measurement_info,
 71 |                         sensor_info)
 72 |                 elif isinstance(measurement_info, ImageMeasurementInfo):
 73 |                     sensor_result[measurement_name] = PosedImage(
 74 |                         sample[sensor_name][measurement_name][-1],
 75 |                         measurement_info,
 76 |                         sensor_info)
 77 | 
 78 |             # Don't include if empty.
 79 |             if sensor_result:
 80 |                 result[sensor_name] = sensor_result
 81 | 
 82 |         return result
 83 | 
 84 | class Preprocessor(object):
 85 |     def __init__(self,
 86 |                  square_pad=[256, 384, 512, 640, 768, 896, 1024],
 87 |                  size_divisibility=32,
 88 |                  pixel_mean=[123.675, 116.28, 103.53],
 89 |                  pixel_std=[58.395, 57.12, 57.375],
 90 |                  device=None):
 91 |         self.square_pad = square_pad
 92 |         self.size_divisibility = size_divisibility
 93 |         self.pixel_mean = torch.tensor(pixel_mean).view(-1, 1, 1)
 94 |         self.pixel_std = torch.tensor(pixel_std).view(-1, 1, 1)
 95 |         self.device = device
 96 | 
 97 |     @staticmethod
 98 |     def standardize_depth_map(img, trunc_value=0.1):
 99 |         # Always do this on CPU! MPS has some surprising behavior.
100 |         device = img.device
101 |         img = img.cpu()
102 |         img[img <= 0.0] = torch.nan
103 | 
104 |         sorted_img = torch.sort(torch.flatten(img))[0]
105 |         # Remove nan, nan at the end of sort
106 |         num_nan = sorted_img.isnan().sum()
107 |         if num_nan > 0:
108 |             sorted_img = sorted_img[:-num_nan]
109 |         # Remove outliers
110 |         trunc_img = sorted_img[int(trunc_value * len(sorted_img)): int((1 - trunc_value) * len(sorted_img))]
111 |         if len(trunc_img) <= 1:
112 |             # guard against no valid Jasper.
113 |             trunc_mean = torch.tensor(0.0).to(img)
114 |             trunc_std = torch.tensor(1.0).to(img)
115 |         else:
116 |             trunc_mean = trunc_img.mean()
117 |             trunc_var = trunc_img.var()
118 | 
119 |             eps = 1e-2
120 |             trunc_std = torch.sqrt(trunc_var + eps)
121 | 
122 |         # Replace nan by mean
123 |         img = torch.nan_to_num(img, nan=trunc_mean)
124 | 
125 |         # Standardize
126 |         img = (img - trunc_mean) / trunc_std
127 | 
128 |         # return the scale parameters for encoding.
129 |         return img.to(device), torch.tensor([trunc_mean, trunc_std]).to(device)
130 | 
131 |     def normalize(self, batched_input: Sensors):
132 |         # Happens in-place.
133 |         for sensor_name, sensor in batched_input.items():
134 |             for measurement_name, measurement in sensor.items():
135 |                 if measurement_name in ["features"]:
136 |                     continue
137 | 
138 |                 if measurement.__orig_class__ in (PosedDepth,):
139 |                     measurement.data, scaling = Preprocessor.standardize_depth_map(measurement.data)
140 |                     measurement.info = measurement.info.normalize(scaling[None])
141 |                 elif measurement.__orig_class__ in (PosedImage,):
142 |                     measurement.data = (measurement.data.float() - self.pixel_mean.to(measurement.data)) / self.pixel_std.to(measurement.data)
143 | 
144 |         return batched_input
145 | 
146 |     def batch(self, batched_inputs: List[Sensors]) -> List[BatchedSensors]:
147 |         sensor_names = batched_inputs[0].keys()
148 |         result = {}
149 |         for sensor_name in sensor_names:
150 |             measurement_names = batched_inputs[0][sensor_name].keys()
151 |             sensor_result = {}
152 |             for measurement_name in measurement_names:
153 |                 batched_measurements = [bi[sensor_name][measurement_name] for bi in batched_inputs]
154 |                 if measurement_name in ["features"]:
155 |                     # TODO!
156 |                     sensor_result["features"] = batched_measurements[0]
157 |                     continue
158 | 
159 |                 # Hacky way to pass some additional constraints.
160 |                 batching_kwargs = {}
161 |                 if batched_measurements[0].__orig_class__ in (PosedDepth,):
162 |                     # Very bad, but assume the PosedImage here gets processed first, so that
163 |                     # square_pad and rgb_size are assigned.
164 |                     rgb_to_depth_ratio = round(rgb_size[0] / batched_measurements[0].info.size[0])
165 |                     if rgb_to_depth_ratio not in [1, 2, 4]:
166 |                         raise ValueError(f"Unsupported rgb -> depth ratio: {rgb_to_depth_ratio}")
167 | 
168 |                     # note: square_pad should always be divisible by the given ratios: e.g. 1, 2, 4.
169 |                     batching_kwargs = dict(
170 |                         size_divisibility=self.size_divisibility,
171 |                         padding_constraints={
172 |                             "size_divisibility": self.size_divisibility,
173 |                             "square_size": square_pad // rgb_to_depth_ratio
174 |                         })
175 |                 elif batched_measurements[0].__orig_class__ in (PosedImage,):
176 |                     # Backbone sizes are computed w.r.t image. We may need
177 |                     # to adjust them to depth or other sensors with different sizes.
178 |                     square_pad = self.square_pad
179 |                     rgb_size = batched_measurements[0].info.size
180 |                     if isinstance(square_pad, (list,)):
181 |                         longest_edge = max([max(bm.info.size) for bm in batched_measurements])
182 |                         square_pad = int(min([s for s in square_pad if s >= longest_edge]))
183 | 
184 |                     batching_kwargs = dict(
185 |                         size_divisibility=self.size_divisibility,
186 |                         padding_constraints={
187 |                             "size_divisibility": self.size_divisibility,
188 |                             "square_size": square_pad
189 |                         })
190 | 
191 |                 batched_measurements = Measurement.batch(
192 |                     batched_measurements,
193 |                     **batching_kwargs)
194 | 
195 |                 sensor_result[measurement_name] = batched_measurements
196 | 
197 |             result[sensor_name] = sensor_result
198 | 
199 |         return result
200 | 
201 |     def __call__(self, batches):
202 |         for batch in batches:
203 |             if isinstance(batch, tuple):
204 |                 # Probably inference with GT.
205 |                 input_, gt_ = batch
206 |                 if self.device is not None:
207 |                     input_ = move_input_to_current_device(input_, self.device)
208 | 
209 |                 yield self.preprocess([input_]), gt_
210 |             else:
211 |                 yield self.preprocess(batch)
212 | 
213 |     def preprocess(self, batched_inputs: List[Sensors]) -> List[Sensors]:
214 |         batched_inputs = [self.normalize(bi) for bi in batched_inputs]
215 | 
216 |         return self.batch(batched_inputs)
217 | 


--------------------------------------------------------------------------------
/cubifyanything/sensor.py:
--------------------------------------------------------------------------------
  1 | # For licensing see accompanying LICENSE file.
  2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved.
  3 | 
  4 | import numpy as np
  5 | import torch
  6 | import warnings
  7 | 
  8 | from typing import Any, Dict, List, Tuple, Union
  9 | 
 10 | from cubifyanything.measurement import BaseMeasurementInfo
 11 | from cubifyanything.orientation import ImageOrientation, get_orientation, rotate_pose
 12 | 
 13 | # Extends some of the ideas of D2's "Instances" to more broad sensors.
 14 | class SensorInfo(object):
 15 |     def __init__(self, **kwargs):
 16 |         self._measurements: Dict[str, Any] = {}
 17 |         self._other = {}
 18 |         self._meta_keys = []        
 19 | 
 20 |         for k, v in kwargs.items():
 21 |             self.set(k, v)
 22 | 
 23 |     def __setattr__(self, name: str, val: Any) -> None:
 24 |         if name in ["_other", "_measurements", "_RT", "_meta_keys"]:
 25 |             super(SensorInfo, self).__setattr__(name, val)
 26 |         elif name.startswith("_"):
 27 |             self._other[name] = val
 28 |         else:
 29 |             self.set(name, val)
 30 | 
 31 |     def __getattr__(self, name: str) -> Any:
 32 |         if name == "_other":
 33 |             return self.__getattribute__("_other")
 34 | 
 35 |         if name.startswith("_"):
 36 |             if name in self._other:
 37 |                 return self._other[name]
 38 | 
 39 |             return self.__getattribute__(name)
 40 | 
 41 |         if name not in self._measurements:
 42 |             raise AttributeError("Cannot find field '{}' in the given measurements!".format(name))
 43 | 
 44 |         return self._measurements[name]
 45 | 
 46 |     def __getstate__(self):
 47 |         return {"_measurements": self._measurements, "_other": self._other, "_meta_keys": self._meta_keys}
 48 | 
 49 |     def __setstate__(self, s):
 50 |         self._measurements = s["_measurements"]
 51 |         self._other = s["_other"]
 52 |         self._meta_keys = s["_meta_keys"]
 53 | 
 54 |     @property
 55 |     def ts(self):
 56 |         if isinstance(self, PosedSensorInfo):
 57 |             return self._RT_meta.ts
 58 | 
 59 |         # TODO: Take first measurement and ask for ts?
 60 |         return None    
 61 | 
 62 |     def translate(self, t):
 63 |         raise NotImplementedError
 64 | 
 65 |     def set(self, name: str, value: Any) -> None:
 66 |         """
 67 |         Set the field named `name` to `value`.
 68 |         The length of `value` must be the number of instances,
 69 |         and must agree with other existing fields in this object.
 70 |         """
 71 |         with warnings.catch_warnings(record=True):
 72 |             data_len = len(value)
 73 | 
 74 |         if len(self._measurements):
 75 |             assert (
 76 |                 len(self) == data_len
 77 |             ), "Adding a field of length {} to a measurement of length {}".format(data_len, len(self))
 78 | 
 79 |         self._measurements[name] = value
 80 | 
 81 |     def has(self, name: str) -> bool:
 82 |         """
 83 |         Returns:
 84 |             bool: whether the field called `name` exists.
 85 |         """
 86 |         return name in self._measurements
 87 | 
 88 |     def remove(self, name: str) -> None:
 89 |         """
 90 |         Remove the field called `name`.
 91 |         """
 92 |         del self._measurements[name]
 93 | 
 94 |     def get(self, name: str) -> Any:
 95 |         """
 96 |         Returns the field called `name`.
 97 |         """
 98 |         return self._measurements[name]
 99 | 
100 |     @classmethod
101 |     def cat(self, sensor_list):
102 |         # TODO: Flesh this out better.
103 |         measurement_names = sensor_list[0].get_measurements().keys()
104 |         measurements = {}
105 | 
106 |         for measurement_name in measurement_names:
107 |             info_list = [getattr(sensor_list_, measurement_name) for sensor_list_ in sensor_list]
108 |             measurements[measurement_name] = type(info_list[0]).cat(info_list)
109 | 
110 |         return type(sensor_list[0])(**measurements)
111 | 
112 |     def __len__(self) -> int:
113 |         for v in self._measurements.values():
114 |             # use __len__ because len() has to be int and is not friendly to tracing
115 |             return v.__len__()
116 | 
117 |     def get_measurements(self) -> Dict[str, Any]:
118 |         """
119 |         Returns:
120 |             dict: a dict which maps names (str) to data of the fields
121 | 
122 |         Modifying the returned dict will modify this instance.
123 |         """
124 |         # for now, only return subclasses of MeasurementInfo
125 |         return { k: m for k, m in self._measurements.items() if isinstance(m, (MeasurementInfo,)) }
126 | 
127 |     def orient(self, current_orientation, target_orientation):
128 |         new_sensor_info = type(self)()
129 |         new_sensor_info._other = dict(self._other)
130 | 
131 |         # Save this for the ability to restore?
132 |         new_sensor_info._original_orientation = current_orientation
133 | 
134 |         # One of these needs to be UPRIGHT for now.
135 |         if (current_orientation != ImageOrientation.UPRIGHT) and (target_orientation != ImageOrientation.UPRIGHT):
136 |             raise NotImplementedError
137 | 
138 |         for measurement_name, measurement in self._measurements.items():
139 |             # TODO: fix this as an _other_?
140 |             if measurement_name == "RT":
141 |                 new_sensor_info.RT = rotate_pose(self.RT, current_orientation, target=target_orientation)
142 |             elif measurement_name == "ts":
143 |                 new_sensor_info.ts = self.ts.clone()
144 |             elif isinstance(measurement, BaseMeasurementInfo):
145 |                 setattr(new_sensor_info, measurement_name, measurement.orient(current_orientation, target_orientation))
146 | 
147 |         if isinstance(self, PosedSensorInfo):
148 |             new_sensor_info._RT = self._RT.clone()
149 | 
150 |         # Make sure we continue to use the override.
151 |         if hasattr(self, "_orientation"):
152 |             setattr(new_sensor_info, "_orientation", target_orientation)
153 | 
154 |         return new_sensor_info
155 | 
156 |     def to(self, *args: Any, **kwargs: Any) -> "SensorInfo":        
157 |         ret = type(self)()
158 |         for k, v in self._measurements.items():
159 |             if hasattr(v, "to"):
160 |                 v = v.to(*args, **kwargs)
161 |             ret.set(k, v)
162 | 
163 |         ret._other = dict(self._other)
164 |         for meta_key in self._meta_keys:
165 |             ret._meta_keys.append(meta_key)
166 |             setattr(ret, meta_key, getattr(self, meta_key))
167 |         
168 |         return ret
169 | 
170 | # TODO: this should enforce "RT" (i.e. pose) existing.
171 | class PosedSensorInfo(SensorInfo):
172 |     @property
173 |     def orientation(self):
174 |         # Allow override.
175 |         if hasattr(self, "_orientation"):
176 |             return self._orientation
177 | 
178 |         # for now, assume we're dealing with a single orientation. majority vote.
179 |         if len(self.RT) == 1:
180 |             return ImageOrientation(get_orientation(self.RT)[-1].item())
181 | 
182 |         orientations = get_orientation(self.RT).cpu().numpy()
183 |         unique_orientations, counts = np.unique(orientations, return_counts=True)
184 |         most_frequent_orientation = unique_orientations[np.argmax(counts)]
185 | 
186 |         return ImageOrientation(most_frequent_orientation)
187 | 
188 |     @property
189 |     def device(self):
190 |         return self.RT.device
191 | 
192 |     def set(self, name: str, value: Any) -> None:
193 |         if name == "RT":
194 |             # only write if we don't already have 
195 |             if not hasattr(self, "_RT"):
196 |                 self._RT = value.clone()
197 | 
198 |         super(PosedSensorInfo, self).set(name, value)
199 | 
200 |     def apply_transform(self, transform_4x4):
201 |         new_sensor_info = PosedSensorInfo()
202 |         new_sensor_info._RT = self._RT.clone()
203 |         new_sensor_info._other = dict(self._other)
204 | 
205 |         for measurement_name, measurement in self._measurements.items():
206 |             if measurement_name == "RT":
207 |                 new_sensor_info.RT = transform_4x4 @ self.RT
208 |             elif measurement_name == "ts":
209 |                 new_sensor_info.ts = self.ts.clone()
210 |             elif hasattr(measurement, "apply_transform"):
211 |                 setattr(new_sensor_info, measurement_name, measurement.apply_transform(transform_4x4))
212 |             else:
213 | 
214 |                 setattr(new_sensor_info, measurement_name, measurement)
215 | 
216 |         return new_sensor_info
217 | 
218 |     def translate(self, t):
219 |         translation_4x4 = torch.eye(4)[None, ...].to(t.device)
220 |         translation_4x4[:, :3, -1] = t
221 | 
222 |         return self.apply_transform(translation_4x4)
223 | 
224 |     @classmethod
225 |     def cat(cls, sensor_list):
226 |         new_sensor_info = SensorInfo.cat(sensor_list)
227 |         new_sensor_info.RT = torch.cat([sensor_info.RT for sensor_info in sensor_list])
228 | 
229 |         return new_sensor_info
230 | 
231 | class SensorArrayInfo(object):
232 |     def __init__(self, **kwargs: Any):
233 |         self._sensors: Dict[str, SensorInfo] = {}
234 |         self._rel_transforms: Dict[Tuple[str, str], torch.Tensor] = {}
235 | 
236 |         for k, v in kwargs.items():
237 |             self.set(k, v)
238 | 
239 |     def __setattr__(self, name: str, val: Any) -> None:
240 |         if name.startswith("_"):
241 |             super().__setattr__(name, val)
242 |         else:
243 |             self.set(name, val)
244 | 
245 |     def __getattr__(self, name: str) -> Any:
246 |         if name == "_sensors" or name not in self._sensors:
247 |             raise AttributeError("Cannot find field '{}' in the given sensors!".format(name))
248 | 
249 |         return self._sensors[name]
250 | 
251 |     def __getstate__(self):
252 |         return self._sensors
253 | 
254 |     def __setstate__(self, d):
255 |         self._sensors = d
256 | 
257 |     def set(self, name: str, value: Any) -> None:
258 |         self._sensors[name] = value
259 | 
260 |     def has(self, name: str) -> bool:
261 |         """
262 |         Returns:
263 |             bool: whether the field called `name` exists.
264 |         """
265 |         return name in self._sensors
266 | 
267 |     def remove(self, name: str) -> None:
268 |         """
269 |         Remove the field called `name`.
270 |         """
271 |         del self._sensors[name]
272 | 
273 |     def get(self, name: str) -> Any:
274 |         """
275 |         Returns the field called `name`.
276 |         """
277 |         return self._sensors[name]
278 | 
279 |     # This is not really always a good idea because sensor's don't _have_ to
280 |     # have the same length (although they often do).
281 |     def uniform_length(self) -> int:
282 |         for v in self._sensors.values():
283 |             # use __len__ because len() has to be int and is not friendly to tracing
284 |             return v.__len__()
285 | 
286 |     def to(self, *args: Any, **kwargs: Any) -> "SensorArrayInfo":
287 |         ret = type(self)()
288 |         for k, v in self._sensors.items():
289 |             if hasattr(v, "to"):
290 |                 v = v.to(*args, **kwargs)
291 |             ret.set(k, v)
292 | 
293 |         return ret
294 |     
295 | 


--------------------------------------------------------------------------------
/cubifyanything/transforms.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | # These functions are taken from PyTorch3D.
 4 | 
 5 | def _axis_angle_rotation(axis: str, angle: torch.Tensor) -> torch.Tensor:
 6 |     """
 7 |     Return the rotation matrices for one of the rotations about an axis
 8 |     of which Euler angles describe, for each value of the angle given.
 9 | 
10 |     Args:
11 |         axis: Axis label "X" or "Y or "Z".
12 |         angle: any shape tensor of Euler angles in radians
13 | 
14 |     Returns:
15 |         Rotation matrices as tensor of shape (..., 3, 3).
16 |     """
17 | 
18 |     cos = torch.cos(angle)
19 |     sin = torch.sin(angle)
20 |     one = torch.ones_like(angle)
21 |     zero = torch.zeros_like(angle)
22 | 
23 |     if axis == "X":
24 |         R_flat = (one, zero, zero, zero, cos, -sin, zero, sin, cos)
25 |     elif axis == "Y":
26 |         R_flat = (cos, zero, sin, zero, one, zero, -sin, zero, cos)
27 |     elif axis == "Z":
28 |         R_flat = (cos, -sin, zero, sin, cos, zero, zero, zero, one)
29 |     else:
30 |         raise ValueError("letter must be either X, Y or Z.")
31 | 
32 |     return torch.stack(R_flat, -1).reshape(angle.shape + (3, 3))
33 | 
34 | def euler_angles_to_matrix(euler_angles: torch.Tensor, convention: str) -> torch.Tensor:
35 |     """
36 |     Convert rotations given as Euler angles in radians to rotation matrices.
37 | 
38 |     Args:
39 |         euler_angles: Euler angles in radians as tensor of shape (..., 3).
40 |         convention: Convention string of three uppercase letters from
41 |             {"X", "Y", and "Z"}.
42 | 
43 |     Returns:
44 |         Rotation matrices as tensor of shape (..., 3, 3).
45 |     """
46 |     if euler_angles.dim() == 0 or euler_angles.shape[-1] != 3:
47 |         raise ValueError("Invalid input euler angles.")
48 |     if len(convention) != 3:
49 |         raise ValueError("Convention must have 3 letters.")
50 |     if convention[1] in (convention[0], convention[2]):
51 |         raise ValueError(f"Invalid convention {convention}.")
52 |     for letter in convention:
53 |         if letter not in ("X", "Y", "Z"):
54 |             raise ValueError(f"Invalid letter {letter} in convention string.")
55 |     matrices = [
56 |         _axis_angle_rotation(c, e)
57 |         for c, e in zip(convention, torch.unbind(euler_angles, -1))
58 |     ]
59 |     # return functools.reduce(torch.matmul, matrices)
60 |     return torch.matmul(torch.matmul(matrices[0], matrices[1]), matrices[2])
61 | 


--------------------------------------------------------------------------------
/cubifyanything/vit.py:
--------------------------------------------------------------------------------
  1 | # This is a self-contained version of Detectron2's ViT with additional modifications (only meant for inference).
  2 | import math
  3 | import numpy as np
  4 | import torch
  5 | import torch.nn as nn
  6 | import torch.nn.functional as F
  7 | 
  8 | from timm.layers import Mlp
  9 | from typing import Union
 10 | 
 11 | from cubifyanything.batching import BatchedPosedSensor
 12 | 
 13 | __all__ = ["ViT"]
 14 | 
 15 | # NOTE: We replicate some functions here which need modifications for tracing.
 16 | def window_partition(x, window_size):
 17 |     """
 18 |     Partition into non-overlapping windows with padding if needed.
 19 |     Args:
 20 |         x (tensor): input tokens with [B, H, W, C].
 21 |         window_size (int): window size.
 22 | 
 23 |     Returns:
 24 |         windows: windows after partition with [B * num_windows, window_size, window_size, C].
 25 |         (Hp, Wp): padded height and width before partition
 26 |     """
 27 |     B, H, W, C = x.shape
 28 | 
 29 |     pad_h = (window_size - H % window_size) % window_size
 30 |     pad_w = (window_size - W % window_size) % window_size
 31 |     
 32 |     x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))
 33 |     Hp, Wp = H + pad_h, W + pad_w
 34 | 
 35 |     x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)
 36 |     windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
 37 |     return windows, (Hp, Wp)
 38 | 
 39 | def window_unpartition(windows, window_size, pad_hw, hw):
 40 |     """
 41 |     Window unpartition into original sequences and removing padding.
 42 |     Args:
 43 |         x (tensor): input tokens with [B * num_windows, window_size, window_size, C].
 44 |         window_size (int): window size.
 45 |         pad_hw (Tuple): padded height and width (Hp, Wp).
 46 |         hw (Tuple): original height and width (H, W) before padding.
 47 | 
 48 |     Returns:
 49 |         x: unpartitioned sequences with [B, H, W, C].
 50 |     """
 51 |     Hp, Wp = pad_hw
 52 |     H, W = hw
 53 |     B = windows.shape[0] // (Hp * Wp // window_size // window_size)
 54 |     x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)
 55 |     x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)
 56 |     x = x[:, :H, :W, :].contiguous()
 57 | 
 58 |     return x
 59 | 
 60 | def get_abs_pos(abs_pos, has_cls_token, hw):
 61 |     """
 62 |     Calculate absolute positional embeddings. If needed, resize embeddings and remove cls_token
 63 |         dimension for the original embeddings.
 64 |     Args:
 65 |         abs_pos (Tensor): absolute positional embeddings with (1, num_position, C).
 66 |         has_cls_token (bool): If true, has 1 embedding in abs_pos for cls token.
 67 |         hw (Tuple): size of input image tokens.
 68 | 
 69 |     Returns:
 70 |         Absolute positional embeddings after processing with shape (1, H, W, C)
 71 |     """
 72 |     h, w = hw
 73 |     if has_cls_token:
 74 |         abs_pos = abs_pos[:, 1:]
 75 |     xy_num = abs_pos.shape[1]
 76 |     size = int(math.sqrt(xy_num))
 77 |     assert size * size == xy_num
 78 | 
 79 |     new_abs_pos = F.interpolate(
 80 |         abs_pos.reshape(1, size, size, -1).permute(0, 3, 1, 2),
 81 |         size=(h, w),
 82 |         mode="bicubic",
 83 |         align_corners=False,
 84 |     )
 85 | 
 86 |     return new_abs_pos.permute(0, 2, 3, 1)
 87 | 
 88 | class LayerScale(nn.Module):
 89 |     def __init__(
 90 |         self,
 91 |         dim: int,
 92 |         init_values: Union[float, torch.Tensor] = 1e-5,
 93 |         inplace: bool = False,
 94 |     ) -> None:
 95 |         super().__init__()
 96 |         self.inplace = inplace
 97 |         self.gamma = nn.Parameter(init_values * torch.ones(dim))
 98 | 
 99 |     def forward(self, x: torch.Tensor) -> torch.Tensor:
100 |         return x.mul_(self.gamma) if self.inplace else x * self.gamma
101 | 
102 | class PatchEmbed(nn.Module):
103 |     """
104 |     Image to Patch Embedding.
105 |     """
106 | 
107 |     def __init__(
108 |         self, kernel_size=(16, 16), stride=(16, 16), padding=(0, 0), in_chans=3, embed_dim=768, bias=True
109 |     ):
110 |         """
111 |         Args:
112 |             kernel_size (Tuple): kernel size of the projection layer.
113 |             stride (Tuple): stride of the projection layer.
114 |             padding (Tuple): padding size of the projection layer.
115 |             in_chans (int): Number of input image channels.
116 |             embed_dim (int):  embed_dim (int): Patch embedding dimension.
117 |         """
118 |         super().__init__()
119 | 
120 |         self.proj = nn.Conv2d(
121 |             in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding, bias=bias
122 |         )
123 | 
124 |     def forward(self, x):
125 |         x = self.proj(x)
126 |         # B C H W -> B H W C
127 |         x = x.permute(0, 2, 3, 1)
128 |         return x
129 | 
130 | class Attention(nn.Module):
131 |     """Multi-head Attention block with relative position embeddings."""
132 | 
133 |     def __init__(
134 |         self,
135 |         dim,
136 |         num_heads=8,
137 |         qkv_bias=True,
138 |         proj_bias=True,
139 |         use_rel_pos=False,
140 |         rel_pos_zero_init=True,
141 |         input_size=None,
142 |         depth_modality=False,
143 |         depth_input_size=None,
144 |     ):
145 |         """
146 |         Args:
147 |             dim (int): Number of input channels.
148 |             num_heads (int): Number of attention heads.
149 |             qkv_bias (bool:  If True, add a learnable bias to query, key, value.
150 |             rel_pos (bool): If True, add relative positional embeddings to the attention map.
151 |             rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
152 |             input_size (int or None): Input resolution for calculating the relative positional
153 |                 parameter size.
154 |         """
155 |         super().__init__()
156 |         self.num_heads = num_heads
157 |         head_dim = dim // num_heads
158 |         self.scale = head_dim**-0.5
159 | 
160 |         self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
161 |         self.proj = nn.Linear(dim, dim, bias=proj_bias)
162 | 
163 |         self.use_rel_pos = use_rel_pos
164 |         if self.use_rel_pos:
165 |             # Not supported.
166 |             raise NotImplementedError
167 | 
168 |         self.depth_modality = depth_modality
169 | 
170 |     def forward(self, x, depth=None):
171 |         B, H, W, _ = x.shape
172 |         # qkv with shape (3, B, nHead, H * W, C)
173 |         qkv = self.qkv(x).reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
174 | 
175 |         # q, k, v with shape (B * nHead, H * W, C)
176 |         q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0)
177 | 
178 |         if self.depth_modality and (depth is not None):
179 |             B, H_d, W_d, _ = depth.shape
180 | 
181 |             # qkv with shape (3, B, nHead, H * W, C)
182 |             qkv_depth = self.qkv(depth).reshape(B, H_d * W_d, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
183 | 
184 |             # q, k, v with shape (B * nHead, H * W, C)
185 |             q_d, k_d, v_d = qkv_depth.reshape(3, B * self.num_heads, H_d * W_d, -1).unbind(0)
186 |             q, k, v = torch.cat((q, q_d), dim=1), torch.cat((k, k_d), dim=1), torch.cat((v, v_d), dim=1)
187 | 
188 |             # presumably, concatenate q, k. split (and then reconcatenate) attn.
189 | 
190 |         attn = (q * self.scale) @ k.transpose(-2, -1)
191 |         if self.depth_modality and (depth is not None):
192 |             attn, attn_d = torch.split(attn, (H * W, H_d * W_d), dim=1)
193 | 
194 |         attn = attn.softmax(dim=-1)
195 |         x = (attn @ v).view(B, self.num_heads, H, W, -1).permute(0, 2, 3, 1, 4).reshape(B, H, W, -1)
196 | 
197 |         if self.depth_modality and (depth is not None):
198 |             attn_d = attn_d.softmax(dim=-1)
199 |             depth = (attn_d @ v).view(B, self.num_heads, H_d, W_d, -1).permute(0, 2, 3, 1, 4).reshape(B, H_d, W_d, -1)
200 |             depth = self.proj(depth)
201 | 
202 |         x = self.proj(x)
203 |         return x, depth
204 | 
205 | DEPTH_WINDOW_SIZES = [4, 8, 16]
206 | class Block(nn.Module):
207 |     """Transformer blocks with support of window attention and residual propagation blocks"""
208 | 
209 |     def __init__(
210 |         self,
211 |         dim,
212 |         num_heads,
213 |         mlp_ratio=4.0,
214 |         qkv_bias=True,
215 |         proj_bias=True, 
216 |         mlp_bias=True,
217 |         norm_layer=nn.LayerNorm,
218 |         act_layer=nn.GELU,
219 |         use_rel_pos=False,
220 |         rel_pos_zero_init=True,
221 |         window_size=0,
222 |         use_residual_block=False,
223 |         input_size=None,
224 |         depth_modality=False,
225 |         depth_window_size=0,
226 |         layer_scale=False
227 |     ):
228 |         """
229 |         Args:
230 |             dim (int): Number of input channels.
231 |             num_heads (int): Number of attention heads in each ViT block.
232 |             mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
233 |             qkv_bias (bool): If True, add a learnable bias to query, key, value.
234 |             norm_layer (nn.Module): Normalization layer.
235 |             act_layer (nn.Module): Activation layer.
236 |             use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
237 |             rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
238 |             window_size (int): Window size for window attention blocks. If it equals 0, then not
239 |                 use window attention.
240 |             use_residual_block (bool): If True, use a residual block after the MLP block.
241 |             input_size (int or None): Input resolution for calculating the relative positional
242 |                 parameter size.
243 |         """
244 |         super().__init__()
245 | 
246 |         if depth_modality and (depth_window_size == 0):
247 |             raise ValueError("unsupported")
248 | 
249 |         self.norm1 = norm_layer(dim)
250 |         self.attn = Attention(
251 |             dim,
252 |             num_heads=num_heads,
253 |             qkv_bias=qkv_bias,
254 |             proj_bias=proj_bias,
255 |             use_rel_pos=use_rel_pos,
256 |             rel_pos_zero_init=rel_pos_zero_init,
257 |             input_size=input_size if window_size == 0 else (window_size, window_size),
258 |             depth_modality=depth_modality,
259 |             depth_input_size=(depth_window_size, depth_window_size) if depth_modality else None,
260 |         )
261 | 
262 |         self.ls1 = None
263 |         self.ls2 = None
264 |         
265 |         if layer_scale:
266 |             self.ls1 = LayerScale(dim, 1.)
267 |             self.ls2 = LayerScale(dim, 1.)
268 | 
269 |         self.depth_modality = depth_modality
270 | 
271 |         self.norm2 = norm_layer(dim)
272 |         mlp_hidden_dim = int(dim * mlp_ratio)
273 | 
274 |         self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, bias=mlp_bias)
275 |         self.drop_path = nn.Identity()
276 | 
277 |         self.window_size = window_size
278 |         self.depth_window_size = depth_window_size
279 | 
280 |     def forward(self, x, depth=None):
281 |         shortcut = x
282 | 
283 |         x = self.norm1(x)
284 |         # Window partition
285 |         if self.window_size > 0:
286 |             H, W = x.shape[1], x.shape[2]
287 |             x, pad_hw = window_partition(x, self.window_size)
288 | 
289 |         if self.depth_modality and (depth is not None):
290 |             shortcut_depth = depth
291 |             depth = self.norm1(depth)
292 | 
293 |             H_depth, W_depth = depth.shape[1], depth.shape[2]
294 | 
295 |             # Aggressive checking for now.
296 |             depth_window_size = self.depth_window_size or (self.window_size // (H / H_depth))
297 |             if isinstance(depth_window_size, torch.Tensor):
298 |                 depth_window_size = depth_window_size.int()
299 |                 if not depth_window_size.item() in DEPTH_WINDOW_SIZES:
300 |                     raise ValueError(f"Unexpected window size {depth_window_size}")
301 |             else:
302 |                 depth_window_size = int(depth_window_size)
303 |                 if not depth_window_size in DEPTH_WINDOW_SIZES:
304 |                     raise ValueError(f"Unexpected window size {depth_window_size}")
305 | 
306 |             # if depth_window_size is not given, dynamically compute it based on the RGB window size and relative scale.
307 |             depth, pad_hw_depth = window_partition(depth, depth_window_size)
308 | 
309 |         x, depth = self.attn(x, depth=depth)
310 | 
311 |         if self.depth_modality and (depth is not None):
312 |             if self.window_size > 0:
313 |                 depth = window_unpartition(depth, depth_window_size, pad_hw_depth, (H_depth, W_depth))
314 | 
315 |         # Reverse window partition
316 |         if self.window_size > 0:
317 |             x = window_unpartition(x, self.window_size, pad_hw, (H, W))
318 | 
319 |         if self.ls1 is not None:
320 |             x = self.ls1(x)
321 |             if self.depth_modality and (depth is not None):
322 |                 depth = self.ls1(depth)
323 | 
324 |         x = shortcut + self.drop_path(x)
325 |         shortcut = x
326 |         x = self.mlp(self.norm2(x))
327 | 
328 |         if self.ls2 is not None:
329 |             x = self.ls2(x)
330 | 
331 |         x = shortcut + self.drop_path(x)
332 | 
333 |         if self.depth_modality and (depth is not None):
334 |             depth = shortcut_depth + self.drop_path(depth)
335 |             shortcut_depth = depth
336 |             depth = self.mlp(self.norm2(depth))
337 |             if self.ls2 is not None:
338 |                 depth = self.ls2(depth)
339 | 
340 |             depth = shortcut_depth + self.drop_path(depth)
341 | 
342 |         return x, depth
343 | 
344 | class ViT(nn.Module):
345 |     """
346 |     This module implements Vision Transformer (ViT) backbone in :paper:`vitdet`.
347 |     "Exploring Plain Vision Transformer Backbones for Object Detection",
348 |     https://arxiv.org/abs/2203.16527
349 |     """
350 | 
351 |     def __init__(
352 |         self,
353 |         img_size=None,
354 |         patch_size=16,
355 |         in_chans=3,
356 |         embed_dim=768,
357 |         depth=12,
358 |         num_heads=12,
359 |         mlp_ratio=4.0,
360 |         qkv_bias=True,
361 |         proj_bias=True,
362 |         mlp_bias=True,
363 |         patch_embed_bias=True,
364 |         drop_path_rate=0.0,
365 |         norm_layer=nn.LayerNorm,
366 |         act_layer=nn.GELU,
367 |         gated_mlp=False,
368 |         use_abs_pos=True,
369 |         use_rel_pos=False,
370 |         rel_pos_zero_init=True,
371 |         window_size=0,
372 |         window_block_indexes=(),
373 |         residual_block_indexes=(),
374 |         use_act_checkpoint=False,
375 |         pretrain_img_size=224,
376 |         pretrain_use_cls_token=True,
377 |         out_feature="last_feat",
378 |         depth_modality=False,
379 |         depth_window_size=0,
380 |         encoder_norm=False,
381 |         layer_scale=False,
382 |         image_name="image",
383 |         depth_name="depth"
384 |     ):
385 |         """
386 |         Args:
387 |             img_size (int): Input image size.
388 |             patch_size (int): Patch size.
389 |             in_chans (int): Number of input image channels.
390 |             embed_dim (int): Patch embedding dimension.
391 |             depth (int): Depth of ViT.
392 |             num_heads (int): Number of attention heads in each ViT block.
393 |             mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
394 |             qkv_bias (bool): If True, add a learnable bias to query, key, value.
395 |             drop_path_rate (float): Stochastic depth rate.
396 |             norm_layer (nn.Module): Normalization layer.
397 |             act_layer (nn.Module): Activation layer.
398 |             use_abs_pos (bool): If True, use absolute positional embeddings.
399 |             use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
400 |             rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
401 |             window_size (int): Window size for window attention blocks.
402 |             window_block_indexes (list): Indexes for blocks using window attention.
403 |             residual_block_indexes (list): Indexes for blocks using conv propagation.
404 |             use_act_checkpoint (bool): If True, use activation checkpointing.
405 |             pretrain_img_size (int): input image size for pretraining models.
406 |             pretrain_use_cls_token (bool): If True, pretrainig models use class token.
407 |             out_feature (str): name of the feature from the last block.
408 |         """
409 |         super().__init__()
410 |         self.pretrain_use_cls_token = pretrain_use_cls_token
411 |         self.depth_modality = depth_modality
412 | 
413 |         self.image_name = image_name
414 |         self.depth_name = depth_name
415 | 
416 |         self.patch_embed = PatchEmbed(
417 |             kernel_size=(patch_size, patch_size),
418 |             stride=(patch_size, patch_size),
419 |             in_chans=in_chans,
420 |             embed_dim=embed_dim,
421 |             bias=patch_embed_bias,
422 |         )
423 | 
424 |         if use_abs_pos:
425 |             # Initialize absolute positional embedding with pretrain image size.
426 |             num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size)
427 |             num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches
428 |             self.pos_embed = nn.Parameter(torch.zeros(1, num_positions, embed_dim))
429 |             nn.init.trunc_normal_(self.pos_embed, std=0.02)
430 |         else:
431 |             self.pos_embed = None
432 | 
433 |         self.pos_embed_depth = None
434 |         if self.depth_modality:
435 |             self.patch_embed_depth = PatchEmbed(
436 |                 kernel_size=(16, 16),
437 |                 stride=(16, 16),
438 |                 in_chans=1,
439 |                 embed_dim=embed_dim,
440 |             )
441 | 
442 |             if use_abs_pos:
443 |                 # note, depth gets its own pos embed.
444 |                 # Initialize absolute positional embedding with pretrain image size.
445 |                 # at some point, this size may differ from RGB's size.
446 |                 num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size)
447 |                 num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches
448 |                 self.pos_embed_depth = nn.Parameter(torch.zeros(1, num_positions, embed_dim))
449 | 
450 |         self.blocks = nn.ModuleList()
451 |         for i in range(depth):
452 |             block = Block(
453 |                 dim=embed_dim,
454 |                 num_heads=num_heads,
455 |                 mlp_ratio=mlp_ratio,
456 |                 qkv_bias=qkv_bias,
457 |                 proj_bias=proj_bias, 
458 |                 mlp_bias=mlp_bias,
459 |                 norm_layer=norm_layer,
460 |                 act_layer=act_layer,
461 |                 use_rel_pos=use_rel_pos,
462 |                 rel_pos_zero_init=rel_pos_zero_init,
463 |                 window_size=window_size if i in window_block_indexes else 0,
464 |                 use_residual_block=i in residual_block_indexes,
465 |                 input_size=img_size,
466 |                 depth_modality=depth_modality and (i in window_block_indexes), # (for now, only attend to depth if windowing)
467 |                 depth_window_size=depth_window_size if i in window_block_indexes else 0,
468 |                 layer_scale=layer_scale
469 |             )
470 | 
471 |             self.blocks.append(block)
472 | 
473 |         self.encoder_norm = norm_layer(embed_dim) if encoder_norm else nn.Identity()
474 | 
475 |         self._out_feature_channels = {out_feature: embed_dim}
476 |         self._out_feature_strides = {out_feature: patch_size}
477 |         self._out_features = [out_feature]
478 |         self.window_block_indexes = window_block_indexes
479 | 
480 |         self.drop_path = nn.Identity()
481 | 
482 |         self._square_pad = [256, 384, 512, 640, 768, 896, 1024, 1280]
483 | 
484 |     @property
485 |     def num_channels(self):
486 |         return list(self._out_feature_channels.values())
487 | 
488 |     @property
489 |     def size_divisibility(self):
490 |         return next(iter(self._out_feature_strides.values()))
491 | 
492 |     def forward(self, s: BatchedPosedSensor):
493 |         x = s[self.image_name].data.tensor
494 |         image_shape = (x.shape[2], x.shape[3])
495 |         x = self.patch_embed(x)
496 |         if self.pos_embed is not None:
497 |             x = x + get_abs_pos(self.pos_embed, self.pretrain_use_cls_token, (x.shape[1], x.shape[2]))
498 | 
499 |         has_depth = self.depth_name in s
500 |         has_depth_dropped = self.depth_modality and not has_depth
501 |             
502 |         if self.depth_modality:
503 |             depth = s[self.depth_name].data.tensor[:, None]
504 |             depth = self.patch_embed_depth(depth)
505 |             if self.pos_embed_depth is not None:
506 |                 depth = depth + get_abs_pos(
507 |                     self.pos_embed_depth, self.pretrain_use_cls_token, (depth.shape[1], depth.shape[2]))
508 |         else:
509 |             depth = None
510 | 
511 |         for i, blk in enumerate(self.blocks):
512 |             if blk.depth_modality and has_depth:
513 |                 x, depth = blk(x, depth=depth)
514 |             else:
515 |                 x, *_ = blk(x)
516 | 
517 |         x = self.encoder_norm(x)
518 | 
519 |         outputs = {self._out_features[0]: x.permute(0, 3, 1, 2)}
520 |         return outputs
521 | 
522 | 


--------------------------------------------------------------------------------
/data/LICENSE_DATA:
--------------------------------------------------------------------------------
  1 | Attribution-NonCommercial-NoDerivatives 4.0 International
  2 | 
  3 | =======================================================================
  4 | 
  5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
  6 | does not provide legal services or legal advice. Distribution of
  7 | Creative Commons public licenses does not create a lawyer-client or
  8 | other relationship. Creative Commons makes its licenses and related
  9 | information available on an "as-is" basis. Creative Commons gives no
 10 | warranties regarding its licenses, any material licensed under their
 11 | terms and conditions, or any related information. Creative Commons
 12 | disclaims all liability for damages resulting from their use to the
 13 | fullest extent possible.
 14 | 
 15 | Using Creative Commons Public Licenses
 16 | 
 17 | Creative Commons public licenses provide a standard set of terms and
 18 | conditions that creators and other rights holders may use to share
 19 | original works of authorship and other material subject to copyright
 20 | and certain other rights specified in the public license below. The
 21 | following considerations are for informational purposes only, are not
 22 | exhaustive, and do not form part of our licenses.
 23 | 
 24 |      Considerations for licensors: Our public licenses are
 25 |      intended for use by those authorized to give the public
 26 |      permission to use material in ways otherwise restricted by
 27 |      copyright and certain other rights. Our licenses are
 28 |      irrevocable. Licensors should read and understand the terms
 29 |      and conditions of the license they choose before applying it.
 30 |      Licensors should also secure all rights necessary before
 31 |      applying our licenses so that the public can reuse the
 32 |      material as expected. Licensors should clearly mark any
 33 |      material not subject to the license. This includes other CC-
 34 |      licensed material, or material used under an exception or
 35 |      limitation to copyright. More considerations for licensors:
 36 |     wiki.creativecommons.org/Considerations_for_licensors
 37 | 
 38 |      Considerations for the public: By using one of our public
 39 |      licenses, a licensor grants the public permission to use the
 40 |      licensed material under specified terms and conditions. If
 41 |      the licensor's permission is not necessary for any reason--for
 42 |      example, because of any applicable exception or limitation to
 43 |      copyright--then that use is not regulated by the license. Our
 44 |      licenses grant only permissions under copyright and certain
 45 |      other rights that a licensor has authority to grant. Use of
 46 |      the licensed material may still be restricted for other
 47 |      reasons, including because others have copyright or other
 48 |      rights in the material. A licensor may make special requests,
 49 |      such as asking that all changes be marked or described.
 50 |      Although not required by our licenses, you are encouraged to
 51 |      respect those requests where reasonable. More considerations
 52 |      for the public:
 53 |     wiki.creativecommons.org/Considerations_for_licensees
 54 | 
 55 | =======================================================================
 56 | 
 57 | Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
 58 | International Public License
 59 | 
 60 | By exercising the Licensed Rights (defined below), You accept and agree
 61 | to be bound by the terms and conditions of this Creative Commons
 62 | Attribution-NonCommercial-NoDerivatives 4.0 International Public
 63 | License ("Public License"). To the extent this Public License may be
 64 | interpreted as a contract, You are granted the Licensed Rights in
 65 | consideration of Your acceptance of these terms and conditions, and the
 66 | Licensor grants You such rights in consideration of benefits the
 67 | Licensor receives from making the Licensed Material available under
 68 | these terms and conditions.
 69 | 
 70 | 
 71 | Section 1 -- Definitions.
 72 | 
 73 |   a. Adapted Material means material subject to Copyright and Similar
 74 |      Rights that is derived from or based upon the Licensed Material
 75 |      and in which the Licensed Material is translated, altered,
 76 |      arranged, transformed, or otherwise modified in a manner requiring
 77 |      permission under the Copyright and Similar Rights held by the
 78 |      Licensor. For purposes of this Public License, where the Licensed
 79 |      Material is a musical work, performance, or sound recording,
 80 |      Adapted Material is always produced where the Licensed Material is
 81 |      synched in timed relation with a moving image.
 82 | 
 83 |   b. Copyright and Similar Rights means copyright and/or similar rights
 84 |      closely related to copyright including, without limitation,
 85 |      performance, broadcast, sound recording, and Sui Generis Database
 86 |      Rights, without regard to how the rights are labeled or
 87 |      categorized. For purposes of this Public License, the rights
 88 |      specified in Section 2(b)(1)-(2) are not Copyright and Similar
 89 |      Rights.
 90 | 
 91 |   c. Effective Technological Measures means those measures that, in the
 92 |      absence of proper authority, may not be circumvented under laws
 93 |      fulfilling obligations under Article 11 of the WIPO Copyright
 94 |      Treaty adopted on December 20, 1996, and/or similar international
 95 |      agreements.
 96 | 
 97 |   d. Exceptions and Limitations means fair use, fair dealing, and/or
 98 |      any other exception or limitation to Copyright and Similar Rights
 99 |      that applies to Your use of the Licensed Material.
100 | 
101 |   e. Licensed Material means the artistic or literary work, database,
102 |      or other material to which the Licensor applied this Public
103 |      License.
104 | 
105 |   f. Licensed Rights means the rights granted to You subject to the
106 |      terms and conditions of this Public License, which are limited to
107 |      all Copyright and Similar Rights that apply to Your use of the
108 |      Licensed Material and that the Licensor has authority to license.
109 | 
110 |   g. Licensor means the individual(s) or entity(ies) granting rights
111 |      under this Public License.
112 | 
113 |   h. NonCommercial means not primarily intended for or directed towards
114 |      commercial advantage or monetary compensation. For purposes of
115 |      this Public License, the exchange of the Licensed Material for
116 |      other material subject to Copyright and Similar Rights by digital
117 |      file-sharing or similar means is NonCommercial provided there is
118 |      no payment of monetary compensation in connection with the
119 |      exchange.
120 | 
121 |   i. Share means to provide material to the public by any means or
122 |      process that requires permission under the Licensed Rights, such
123 |      as reproduction, public display, public performance, distribution,
124 |      dissemination, communication, or importation, and to make material
125 |      available to the public including in ways that members of the
126 |      public may access the material from a place and at a time
127 |      individually chosen by them.
128 | 
129 |   j. Sui Generis Database Rights means rights other than copyright
130 |      resulting from Directive 96/9/EC of the European Parliament and of
131 |      the Council of 11 March 1996 on the legal protection of databases,
132 |      as amended and/or succeeded, as well as other essentially
133 |      equivalent rights anywhere in the world.
134 | 
135 |   k. You means the individual or entity exercising the Licensed Rights
136 |      under this Public License. Your has a corresponding meaning.
137 | 
138 | 
139 | Section 2 -- Scope.
140 | 
141 |   a. License grant.
142 | 
143 |        1. Subject to the terms and conditions of this Public License,
144 |           the Licensor hereby grants You a worldwide, royalty-free,
145 |           non-sublicensable, non-exclusive, irrevocable license to
146 |           exercise the Licensed Rights in the Licensed Material to:
147 | 
148 |             a. reproduce and Share the Licensed Material, in whole or
149 |                in part, for NonCommercial purposes only; and
150 | 
151 |             b. produce and reproduce, but not Share, Adapted Material
152 |                for NonCommercial purposes only.
153 | 
154 |        2. Exceptions and Limitations. For the avoidance of doubt, where
155 |           Exceptions and Limitations apply to Your use, this Public
156 |           License does not apply, and You do not need to comply with
157 |           its terms and conditions.
158 | 
159 |        3. Term. The term of this Public License is specified in Section
160 |           6(a).
161 | 
162 |        4. Media and formats; technical modifications allowed. The
163 |           Licensor authorizes You to exercise the Licensed Rights in
164 |           all media and formats whether now known or hereafter created,
165 |           and to make technical modifications necessary to do so. The
166 |           Licensor waives and/or agrees not to assert any right or
167 |           authority to forbid You from making technical modifications
168 |           necessary to exercise the Licensed Rights, including
169 |           technical modifications necessary to circumvent Effective
170 |           Technological Measures. For purposes of this Public License,
171 |           simply making modifications authorized by this Section 2(a)
172 |           (4) never produces Adapted Material.
173 | 
174 |        5. Downstream recipients.
175 | 
176 |             a. Offer from the Licensor -- Licensed Material. Every
177 |                recipient of the Licensed Material automatically
178 |                receives an offer from the Licensor to exercise the
179 |                Licensed Rights under the terms and conditions of this
180 |                Public License.
181 | 
182 |             b. No downstream restrictions. You may not offer or impose
183 |                any additional or different terms or conditions on, or
184 |                apply any Effective Technological Measures to, the
185 |                Licensed Material if doing so restricts exercise of the
186 |                Licensed Rights by any recipient of the Licensed
187 |                Material.
188 | 
189 |        6. No endorsement. Nothing in this Public License constitutes or
190 |           may be construed as permission to assert or imply that You
191 |           are, or that Your use of the Licensed Material is, connected
192 |           with, or sponsored, endorsed, or granted official status by,
193 |           the Licensor or others designated to receive attribution as
194 |           provided in Section 3(a)(1)(A)(i).
195 | 
196 |   b. Other rights.
197 | 
198 |        1. Moral rights, such as the right of integrity, are not
199 |           licensed under this Public License, nor are publicity,
200 |           privacy, and/or other similar personality rights; however, to
201 |           the extent possible, the Licensor waives and/or agrees not to
202 |           assert any such rights held by the Licensor to the limited
203 |           extent necessary to allow You to exercise the Licensed
204 |           Rights, but not otherwise.
205 | 
206 |        2. Patent and trademark rights are not licensed under this
207 |           Public License.
208 | 
209 |        3. To the extent possible, the Licensor waives any right to
210 |           collect royalties from You for the exercise of the Licensed
211 |           Rights, whether directly or through a collecting society
212 |           under any voluntary or waivable statutory or compulsory
213 |           licensing scheme. In all other cases the Licensor expressly
214 |           reserves any right to collect such royalties, including when
215 |           the Licensed Material is used other than for NonCommercial
216 |           purposes.
217 | 
218 | 
219 | Section 3 -- License Conditions.
220 | 
221 | Your exercise of the Licensed Rights is expressly made subject to the
222 | following conditions.
223 | 
224 |   a. Attribution.
225 | 
226 |        1. If You Share the Licensed Material, You must:
227 | 
228 |             a. retain the following if it is supplied by the Licensor
229 |                with the Licensed Material:
230 | 
231 |                  i. identification of the creator(s) of the Licensed
232 |                     Material and any others designated to receive
233 |                     attribution, in any reasonable manner requested by
234 |                     the Licensor (including by pseudonym if
235 |                     designated);
236 | 
237 |                 ii. a copyright notice;
238 | 
239 |                iii. a notice that refers to this Public License;
240 | 
241 |                 iv. a notice that refers to the disclaimer of
242 |                     warranties;
243 | 
244 |                  v. a URI or hyperlink to the Licensed Material to the
245 |                     extent reasonably practicable;
246 | 
247 |             b. indicate if You modified the Licensed Material and
248 |                retain an indication of any previous modifications; and
249 | 
250 |             c. indicate the Licensed Material is licensed under this
251 |                Public License, and include the text of, or the URI or
252 |                hyperlink to, this Public License.
253 | 
254 |           For the avoidance of doubt, You do not have permission under
255 |           this Public License to Share Adapted Material.
256 | 
257 |        2. You may satisfy the conditions in Section 3(a)(1) in any
258 |           reasonable manner based on the medium, means, and context in
259 |           which You Share the Licensed Material. For example, it may be
260 |           reasonable to satisfy the conditions by providing a URI or
261 |           hyperlink to a resource that includes the required
262 |           information.
263 | 
264 |        3. If requested by the Licensor, You must remove any of the
265 |           information required by Section 3(a)(1)(A) to the extent
266 |           reasonably practicable.
267 | 
268 | 
269 | Section 4 -- Sui Generis Database Rights.
270 | 
271 | Where the Licensed Rights include Sui Generis Database Rights that
272 | apply to Your use of the Licensed Material:
273 | 
274 |   a. for the avoidance of doubt, Section 2(a)(1) grants You the right
275 |      to extract, reuse, reproduce, and Share all or a substantial
276 |      portion of the contents of the database for NonCommercial purposes
277 |      only and provided You do not Share Adapted Material;
278 | 
279 |   b. if You include all or a substantial portion of the database
280 |      contents in a database in which You have Sui Generis Database
281 |      Rights, then the database in which You have Sui Generis Database
282 |      Rights (but not its individual contents) is Adapted Material; and
283 | 
284 |   c. You must comply with the conditions in Section 3(a) if You Share
285 |      all or a substantial portion of the contents of the database.
286 | 
287 | For the avoidance of doubt, this Section 4 supplements and does not
288 | replace Your obligations under this Public License where the Licensed
289 | Rights include other Copyright and Similar Rights.
290 | 
291 | 
292 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
293 | 
294 |   a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
295 |      EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
296 |      AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
297 |      ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
298 |      IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
299 |      WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
300 |      PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
301 |      ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
302 |      KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
303 |      ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
304 | 
305 |   b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
306 |      TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
307 |      NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
308 |      INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
309 |      COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
310 |      USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
311 |      ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
312 |      DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
313 |      IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
314 | 
315 |   c. The disclaimer of warranties and limitation of liability provided
316 |      above shall be interpreted in a manner that, to the extent
317 |      possible, most closely approximates an absolute disclaimer and
318 |      waiver of all liability.
319 | 
320 | 
321 | Section 6 -- Term and Termination.
322 | 
323 |   a. This Public License applies for the term of the Copyright and
324 |      Similar Rights licensed here. However, if You fail to comply with
325 |      this Public License, then Your rights under this Public License
326 |      terminate automatically.
327 | 
328 |   b. Where Your right to use the Licensed Material has terminated under
329 |      Section 6(a), it reinstates:
330 | 
331 |        1. automatically as of the date the violation is cured, provided
332 |           it is cured within 30 days of Your discovery of the
333 |           violation; or
334 | 
335 |        2. upon express reinstatement by the Licensor.
336 | 
337 |      For the avoidance of doubt, this Section 6(b) does not affect any
338 |      right the Licensor may have to seek remedies for Your violations
339 |      of this Public License.
340 | 
341 |   c. For the avoidance of doubt, the Licensor may also offer the
342 |      Licensed Material under separate terms or conditions or stop
343 |      distributing the Licensed Material at any time; however, doing so
344 |      will not terminate this Public License.
345 | 
346 |   d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
347 |      License.
348 | 
349 | 
350 | Section 7 -- Other Terms and Conditions.
351 | 
352 |   a. The Licensor shall not be bound by any additional or different
353 |      terms or conditions communicated by You unless expressly agreed.
354 | 
355 |   b. Any arrangements, understandings, or agreements regarding the
356 |      Licensed Material not stated herein are separate from and
357 |      independent of the terms and conditions of this Public License.
358 | 
359 | 
360 | Section 8 -- Interpretation.
361 | 
362 |   a. For the avoidance of doubt, this Public License does not, and
363 |      shall not be interpreted to, reduce, limit, restrict, or impose
364 |      conditions on any use of the Licensed Material that could lawfully
365 |      be made without permission under this Public License.
366 | 
367 |   b. To the extent possible, if any provision of this Public License is
368 |      deemed unenforceable, it shall be automatically reformed to the
369 |      minimum extent necessary to make it enforceable. If the provision
370 |      cannot be reformed, it shall be severed from this Public License
371 |      without affecting the enforceability of the remaining terms and
372 |      conditions.
373 | 
374 |   c. No term or condition of this Public License will be waived and no
375 |      failure to comply consented to unless expressly agreed to by the
376 |      Licensor.
377 | 
378 |   d. Nothing in this Public License constitutes or may be interpreted
379 |      as a limitation upon, or waiver of, any privileges and immunities
380 |      that apply to the Licensor or You, including from the legal
381 |      processes of any jurisdiction or authority.
382 | 
383 | =======================================================================
384 | 
385 | Creative Commons is not a party to its public
386 | licenses. Notwithstanding, Creative Commons may elect to apply one of
387 | its public licenses to material it publishes and in those instances
388 | will be considered the "Licensor". The text of the Creative Commons
389 | public licenses is dedicated to the public domain under the CC0 Public
390 | Domain Dedication. Except for the limited purpose of indicating that
391 | material is shared under a Creative Commons public license or as
392 | otherwise permitted by the Creative Commons policies published at
393 | creativecommons.org/policies, Creative Commons does not authorize the
394 | use of the trademark "Creative Commons" or any other trademark or logo
395 | of Creative Commons without its prior written consent including,
396 | without limitation, in connection with any unauthorized modifications
397 | to any of its public licenses or any other arrangements,
398 | understandings, or agreements concerning use of licensed material. For
399 | the avoidance of doubt, this paragraph does not form part of the
400 | public licenses.
401 | 
402 | Creative Commons may be contacted at creativecommons.org.
403 | 


--------------------------------------------------------------------------------
/data/val.txt:
--------------------------------------------------------------------------------
  1 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45662921.tar
  2 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261179.tar
  3 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47115543.tar
  4 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261143.tar
  5 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261615.tar
  6 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897545.tar
  7 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261133.tar
  8 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897552.tar
  9 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45663113.tar
 10 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897521.tar
 11 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897501.tar
 12 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261587.tar
 13 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45260903.tar
 14 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42446540.tar
 15 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47204559.tar
 16 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897561.tar
 17 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331068.tar
 18 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897538.tar
 19 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331262.tar
 20 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898486.tar
 21 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47204552.tar
 22 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897599.tar
 23 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47332893.tar
 24 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897692.tar
 25 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331311.tar
 26 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897647.tar
 27 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333923.tar
 28 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898811.tar
 29 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47204573.tar
 30 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898521.tar
 31 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331651.tar
 32 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898538.tar
 33 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47115452.tar
 34 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47204605.tar
 35 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42897688.tar
 36 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898570.tar
 37 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331319.tar
 38 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898867.tar
 39 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331661.tar
 40 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331971.tar
 41 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899617.tar
 42 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333452.tar
 43 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899611.tar
 44 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42898849.tar
 45 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47332000.tar
 46 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899459.tar
 47 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47332885.tar
 48 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899698.tar
 49 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331988.tar
 50 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899679.tar
 51 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47331963.tar
 52 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899691.tar
 53 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333431.tar
 54 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899725.tar
 55 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333898.tar
 56 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899729.tar
 57 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47332915.tar
 58 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-43896260.tar
 59 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333440.tar
 60 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-43896321.tar
 61 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-43896330.tar
 62 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333916.tar
 63 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261631.tar
 64 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899736.tar
 65 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-44358442.tar
 66 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47334107.tar
 67 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45260898.tar
 68 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48458415.tar
 69 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45260854.tar
 70 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47334239.tar
 71 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-42899712.tar
 72 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-44358451.tar
 73 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333927.tar
 74 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47895552.tar
 75 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261121.tar
 76 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47333934.tar
 77 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45260920.tar
 78 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47334115.tar
 79 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45261575.tar
 80 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45662942.tar
 81 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47115469.tar
 82 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018730.tar
 83 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45662981.tar
 84 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018375.tar
 85 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47115525.tar
 86 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45662970.tar
 87 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47334234.tar
 88 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45663164.tar
 89 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-45663149.tar
 90 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47334256.tar
 91 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47430475.tar
 92 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47895534.tar
 93 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47895341.tar
 94 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47895542.tar
 95 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47430485.tar
 96 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018345.tar
 97 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018367.tar
 98 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018559.tar
 99 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018566.tar
100 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018382.tar
101 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018947.tar
102 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-47895364.tar
103 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48018737.tar
104 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48458481.tar
105 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48458427.tar
106 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48458654.tar
107 | https://ml-site.cdn-apple.com/datasets/ca1m/val/ca1m-val-48458647.tar


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | cyclonedds
 2 | rerun-sdk
 3 | scipy
 4 | tifffile
 5 | timm
 6 | torch
 7 | torchvision
 8 | webdataset==0.2.86
 9 | Pillow
10 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import glob
 2 | import os
 3 | import platform
 4 | import shutil
 5 | import sys
 6 | import warnings
 7 | from os import path as osp
 8 | from pkg_resources import DistributionNotFound, get_distribution
 9 | from setuptools import find_packages, setup
10 | 
11 | 
12 | if __name__ == '__main__':
13 |     setup(
14 |         name='cubifyanything',
15 |         version='0.0.1',
16 |         description=("Public release of Cubify Anything"),
17 |         author='Apple Inc.',
18 |         author_email='jlazarow@apple.com',
19 |         url='https://github.com/apple/ml-cubifyanything',
20 |         packages=find_packages(),
21 |         include_package_data=True,
22 |         zip_safe=False)
23 | 


--------------------------------------------------------------------------------
/teaser.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/apple/ml-cubifyanything/7419eb0cb9b19cb5257b4a1dc905476c155cd343/teaser.jpg


--------------------------------------------------------------------------------
/tools/demo.py:
--------------------------------------------------------------------------------
  1 | # For licensing see accompanying LICENSE file.
  2 | # Copyright (C) 2025 Apple Inc. All Rights Reserved.
  3 | import os
  4 | os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
  5 | 
  6 | import argparse
  7 | import glob
  8 | import itertools
  9 | import numpy as np
 10 | import rerun
 11 | import rerun.blueprint as rrb
 12 | import torch
 13 | import torchvision
 14 | import sys
 15 | import uuid
 16 | 
 17 | from pathlib import Path
 18 | from PIL import Image
 19 | from scipy.spatial.transform import Rotation
 20 | 
 21 | from cubifyanything.batching import Sensors
 22 | from cubifyanything.boxes import GeneralInstance3DBoxes
 23 | from cubifyanything.capture_stream import CaptureDataset
 24 | from cubifyanything.color import random_color
 25 | from cubifyanything.cubify_transformer import make_cubify_transformer
 26 | from cubifyanything.dataset import CubifyAnythingDataset
 27 | from cubifyanything.instances import Instances3D
 28 | from cubifyanything.preprocessor import Augmentor, Preprocessor
 29 | 
 30 | def move_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
 31 |     try:
 32 |         return src.to(dst)
 33 |     except:
 34 |         return src.to(dst.device)
 35 | 
 36 | def move_to_current_device(x, t):
 37 |     if isinstance(x, (list, tuple)):
 38 |         return [move_device_like(x_, t) for x_ in x]
 39 |     
 40 |     return move_device_like(x, t)
 41 | 
 42 | def move_input_to_current_device(batched_input: Sensors, t: torch.Tensor):
 43 |     # Assume only two levels of nesting for now.
 44 |     return { name: { name_: move_to_current_device(m, t) for name_, m in s.items() } for name, s in batched_input.items() }
 45 | 
 46 | # A global dictionary we use to for consistent colors for instances across frames.
 47 | ID_TO_COLOR = {}
 48 | 
 49 | def log_instances(instances, prefix, boxes_3d_name="gt_boxes_3d", ids_name="gt_ids", log_instances_name="instances", **kwargs):
 50 |     global ID_TO_COLOR
 51 |     boxes_3d = instances.get(boxes_3d_name)
 52 | 
 53 |     colors = []    
 54 |     if instances.has(ids_name):
 55 |         ids = instances.get(ids_name)
 56 |         for id_ in ids:
 57 |             ID_TO_COLOR[id_] = ID_TO_COLOR.get(id_, random_color(rgb=True))
 58 |             colors.append(ID_TO_COLOR[id_])
 59 |     else:
 60 |         ids = None
 61 |         colors = [random_color(rgb=True) for _ in range(len(instances))]
 62 | 
 63 |     quaternions = [
 64 |         rerun.Quaternion(
 65 |             xyzw=Rotation.from_matrix(r).as_quat()
 66 |         )
 67 | 
 68 |         for r in boxes_3d.R.cpu().numpy()
 69 |     ]
 70 | 
 71 |     # Hard-code these suffixes.
 72 |     rerun.log(
 73 |         f"{prefix}/{log_instances_name}",
 74 |         rerun.Boxes3D(
 75 |             centers=boxes_3d.gravity_center.cpu().numpy(),
 76 |             sizes=boxes_3d.dims.cpu().numpy(),
 77 |             quaternions=quaternions,
 78 |             colors=colors,
 79 |             labels=ids,
 80 |             show_labels=False),
 81 |         **kwargs)
 82 | 
 83 | def load_data_and_visualize(dataset):
 84 |     blueprint = rrb.Blueprint(
 85 |         rrb.Vertical(
 86 |             contents=[
 87 |                 rrb.Spatial3DView(
 88 |                     name="World",
 89 |                     origin="/world"),
 90 |                 rrb.Horizontal(
 91 |                     contents=[
 92 |                         rrb.Spatial2DView(
 93 |                             name="Image",
 94 |                             origin="/device/wide/image",
 95 |                             contents=[
 96 |                                 "+ $origin/**",
 97 |                                 "+ /device/wide/instances/**"
 98 |                             ]),
 99 |                         rrb.Spatial2DView(
100 |                             name="Depth",
101 |                             origin="/device/wide/depth"),
102 |                         rrb.Spatial2DView(
103 |                             name="Depth (GT)",
104 |                             origin="/device/gt/depth"),
105 |                     ],
106 |                     name="Wide")
107 |             ]))
108 | 
109 |     recording = None
110 |     video_id = None
111 |     for sample in dataset:
112 |         sample_video_id = sample["meta"]["video_id"]
113 |         if (recording is None) or (video_id != sample_video_id):
114 |             new_recording = rerun.new_recording(
115 |                 application_id=str(sample_video_id), recording_id=uuid.uuid4(), make_default=True)            
116 | 
117 |             new_recording.send_blueprint(blueprint, make_active=True)
118 |             rerun.spawn()
119 | 
120 |             recording = new_recording
121 |             video_id = sample_video_id
122 |         
123 |         # Check for the world. Note that this may not show if --every-nth-frame is used.
124 |         if "world" in sample:
125 |             world_instances = sample["world"]["instances"]
126 |             log_instances(world_instances, prefix="/world", static=True)
127 |             continue
128 | 
129 |         rerun.set_time_seconds("pts", sample["meta"]["timestamp"], recording=recording)
130 | 
131 |         # -> channels last.
132 |         image = np.moveaxis(sample["wide"]["image"][-1].numpy(), 0, -1)        
133 |         camera = rerun.Pinhole(
134 |             image_from_camera=sample["sensor_info"].wide.image.K[-1].numpy(), resolution=sample["sensor_info"].wide.image.size)
135 | 
136 |         # Log this to both the device (per-frame) and to the world.
137 |         rerun.log("/device/wide/image", rerun.Image(image).compress())
138 |         rerun.log("/device/wide/image", camera)
139 | 
140 |         # RT here corresponds to the laser-scanner space, as registered to the capture device, so this allows us
141 |         # to visualize the camera with respect to the annotation space.
142 |         RT = sample["sensor_info"].gt.RT[-1].numpy()
143 |         pose_transform = rerun.Transform3D(
144 |             translation=RT[:3, 3],
145 |             rotation=rerun.Quaternion(xyzw=Rotation.from_matrix(RT[:3, :3]).as_quat()))
146 |         
147 |         rerun.log("/world/image", pose_transform)
148 |         rerun.log("/world/image", camera)
149 |         rerun.log("/world/image/image", rerun.Image(image, opacity=0.5))
150 | 
151 |         rerun.log("/device/wide/depth", rerun.DepthImage(sample["wide"]["depth"][-1].numpy()))
152 |         rerun.log("/device/gt/depth", rerun.DepthImage(sample["gt"]["depth"][-1].numpy()))
153 | 
154 |         per_frame_instances = sample["wide"]["instances"]
155 |         log_instances(per_frame_instances, prefix="/device/wide")
156 | 
157 | def get_camera_coords(depth):
158 |     height, width = depth.shape
159 |     device = depth.device
160 | 
161 |     # camera xy.
162 |     camera_coords = torch.stack(
163 |         torch.meshgrid(
164 |             torch.arange(0, width, device=device),
165 |             torch.arange(0, height, device=device), indexing="xy"),
166 |         dim=-1)
167 | 
168 |     return camera_coords
169 | 
170 | def unproject(depth, K, RT, max_depth=10.0):
171 |     camera_coords = get_camera_coords(depth) * depth[..., None]
172 | 
173 |     intrinsics_4x4 = torch.eye(4, device=depth.device)
174 |     intrinsics_4x4[:3, :3] = K
175 | 
176 |     valid = depth > 0
177 |     if max_depth is not None:
178 |         valid &= (depth < max_depth)
179 | 
180 |     depth = depth[..., None]
181 |     uvd = torch.cat((camera_coords, depth, torch.ones_like(depth)), dim=-1)
182 | 
183 |     camera_xyz =  torch.linalg.inv(intrinsics_4x4) @ uvd.view(-1, 4).T
184 |     world_xyz = RT @ camera_xyz
185 | 
186 |     return world_xyz.T[..., :-1].reshape(uvd.shape[0], uvd.shape[1], 3), valid
187 | 
188 | def load_data_and_execute_model(model, dataset, augmentor, preprocessor, score_thresh=0.0, viz_on_gt_points=False):
189 |     is_depth_model = "wide/depth" in augmentor.measurement_keys
190 |     blueprint = rrb.Blueprint(
191 |         rrb.Vertical(
192 |             contents=[
193 |                 rrb.Spatial3DView(
194 |                     name="World",
195 |                     contents=[
196 |                         "+ $origin/**",
197 |                         "+ /device/wide/pred_instances/**"
198 |                     ],
199 |                     origin="/world"),
200 |                 rrb.Horizontal(
201 |                     contents=([
202 |                         rrb.Spatial2DView(
203 |                             name="Image",
204 |                             origin="/device/wide/image",
205 |                             contents=[
206 |                                 "+ $origin/**",
207 |                                 "+ /device/wide/pred_instances/**"
208 |                             ])
209 |                     ] + ([
210 |                         # Only show this for RGB-D.
211 |                         rrb.Spatial2DView(
212 |                             name="Depth",
213 |                             origin="/device/wide/depth")
214 |                     ] if is_depth_model else [])),
215 |                     name="Wide")
216 |             ]))
217 | 
218 |     recording = None
219 |     video_id = None
220 | 
221 |     device = model.pixel_mean
222 |     for sample in dataset:
223 |         sample_video_id = sample["meta"]["video_id"]
224 |         if (recording is None) or (video_id != sample_video_id):
225 |             new_recording = rerun.new_recording(
226 |                 application_id=str(sample_video_id), recording_id=uuid.uuid4(), make_default=True)
227 |             new_recording.send_blueprint(blueprint, make_active=True)
228 |             rerun.spawn()
229 | 
230 |             recording = new_recording
231 |             video_id = sample_video_id
232 | 
233 |             # Keep things in image space, so adjust accordingly.
234 |             rerun.log("/world", rerun.ViewCoordinates.RIGHT_HAND_Y_DOWN, static=True) 
235 | 
236 |         rerun.set_time_seconds("pts", sample["meta"]["timestamp"], recording=recording)
237 | 
238 |         # -> channels last.
239 |         image = np.moveaxis(sample["wide"]["image"][-1].numpy(), 0, -1)        
240 |         color_camera = rerun.Pinhole(
241 |             image_from_camera=sample["sensor_info"].wide.image.K[-1].numpy(), resolution=sample["sensor_info"].wide.image.size)
242 | 
243 |         if is_depth_model:
244 |             # Show the depth being sent to the model.            
245 |             depth_camera = rerun.Pinhole(
246 |                 image_from_camera=sample["sensor_info"].wide.depth.K[-1].numpy(), resolution=sample["sensor_info"].wide.depth.size)
247 | 
248 |         xyzrgb = None
249 |         if viz_on_gt_points and sample["sensor_info"].has("gt"):
250 |             # Backproject GT depth to world so we can compare our predictions.
251 |             depth_gt = sample["gt"]["depth"][-1]
252 |             matched_image = torch.tensor(np.array(Image.fromarray(image).resize((depth_gt.shape[1], depth_gt.shape[0]))))
253 | 
254 |             # Feel free to change max_depth, but know CA is only trained up to 5m.
255 |             xyz, valid = unproject(depth_gt, sample["sensor_info"].gt.depth.K[-1], torch.eye(4), max_depth=10.0)
256 |             xyzrgb = torch.cat((xyz, matched_image / 255.0), dim=-1)[valid]            
257 |                     
258 |         packaged = augmentor.package(sample)
259 |         packaged = move_input_to_current_device(packaged, device)
260 |         packaged = preprocessor.preprocess([packaged])
261 | 
262 |         with torch.no_grad():
263 |             pred_instances = model(packaged)[0]
264 | 
265 |         pred_instances = pred_instances[pred_instances.scores >= score_thresh]
266 |         
267 |         # Hold off on logging anything until now, since the delay might confuse the user in the visualizer.
268 |         rerun.log("/device/wide/image", rerun.Image(image).compress())
269 |         rerun.log("/device/wide/image", color_camera)
270 | 
271 |         if is_depth_model:
272 |             rerun.log("/device/wide/depth", rerun.DepthImage(sample["wide"]["depth"][-1].numpy()))
273 |             rerun.log("/device/wide/depth", depth_camera)
274 |         
275 |         if xyzrgb is not None:
276 |             rerun.log("/world/xyz", rerun.Points3D(positions=xyzrgb[..., :3], colors=xyzrgb[..., 3:], radii=None))        
277 | 
278 |         log_instances(pred_instances, prefix="/device/wide", boxes_3d_name="pred_boxes_3d", ids_name=None, log_instances_name="pred_instances")
279 | 
280 | if __name__ == "__main__":
281 |     parser = argparse.ArgumentParser()
282 | 
283 |     parser.add_argument("dataset_path", help="Path to the directory containing the .tar files, the full path to a single tar file (recommended), or a path to a txt file containing HTTP links. Using the value \"stream\" will attempt to stream from your device using the NeRFCapture app")
284 |     parser.add_argument("--model-path", help="Path to the model to load")
285 |     parser.add_argument("--no-depth", default=False, action="store_true", help="Skip loading depth.")
286 |     parser.add_argument("--score-thresh", default=0.25, help="Threshold for detections")
287 |     parser.add_argument("--every-nth-frame", default=None, type=int, help="Load every `n` frames")
288 |     parser.add_argument("--viz-only", default=False, action="store_true", help="Skip loading a model and only visualize data.")
289 |     parser.add_argument("--viz-on-gt-points", default=False, action="store_true", help="Backproject the GT depth to form a point cloud in order to visualize the predictions")
290 |     parser.add_argument("--device", default="cpu", help="Which device to push the model to (cpu, mps, cuda)")
291 |     parser.add_argument("--video-ids", nargs="+", help="Subset of videos to execute on. By default, all. Ignored if a tar file is explicitly given or in stream mode.")
292 | 
293 |     args = parser.parse_args()
294 |     print("Command Line Args:", args)
295 | 
296 |     dataset_path = args.dataset_path
297 |     use_cache = False
298 |     
299 |     if dataset_path == "stream":
300 |         dataset = CaptureDataset()
301 |     else:
302 |         dataset_files = []
303 | 
304 |         # Allow the user to specify a single tar or a txt file containing an http link per line.
305 |         if os.path.isfile(dataset_path):
306 |             if dataset_path.endswith(".txt"):
307 |                 with open(dataset_path, "r") as dataset_file:
308 |                     dataset_files = [l.strip() for l in dataset_file.readlines()]
309 | 
310 |                 # Cache these files locally to prevent repeated downlods.
311 |                 use_cache = True
312 |             else:
313 |                 args.video_ids = None
314 |                 dataset_files = [dataset_path]
315 |         else:
316 |             # Try to glob all files matching ca1m-*.tar
317 |             dataset_files = glob.glob(os.path.join(dataset_path, "ca1m-*.tar"))
318 |             if len(dataset_files) == 0:
319 |                 raise ValueError(f"Failed to find any .tar files matching ca1m- prefix at {dataset_path}")
320 | 
321 |         if args.video_ids is not None:
322 |             dataset_files = [df for df in dataset_files if Path(df).with_suffix("").name.split("-")[-1] in args.video_ids]
323 | 
324 |         if len(dataset_files) == 0:
325 |             raise ValueError("No data was found")
326 |             
327 |         dataset = CubifyAnythingDataset(
328 |             [Path(df).as_uri() if not df.startswith("https://") else df for df in dataset_files],
329 |             yield_world_instances=args.viz_only,
330 |             load_arkit_depth=not args.no_depth,
331 |             use_cache=use_cache)
332 | 
333 |     if args.viz_only:
334 |         if args.every_nth_frame is not None:
335 |             dataset = itertools.islice(dataset, 0, None, args.every_nth_frame)
336 |         
337 |         load_data_and_visualize(dataset)
338 |         sys.exit(0)
339 | 
340 |     assert args.model_path is not None
341 |     checkpoint = torch.load(args.model_path, map_location=args.device or "cpu")["model"]
342 | 
343 |     # Figure out which model this is based on the weights.
344 | 
345 |     # Basic detection of the actual ViT backbone being used (for our setup, dimension is 1:1 with which ViT).
346 |     backbone_embedding_dimension = checkpoint["backbone.0.patch_embed.proj.weight"].shape[0]
347 |         
348 |     # We need to detect RGB or RGB only models so we can disable sending depth.                
349 |     is_depth_model = any(k.startswith("backbone.0.patch_embed_depth.") for k in checkpoint.keys())
350 | 
351 |     model = make_cubify_transformer(dimension=backbone_embedding_dimension, depth_model=is_depth_model).eval()
352 |     model.load_state_dict(checkpoint)
353 | 
354 |     # No need for ARKit depth if running an RGB only model.
355 |     dataset.load_arkit_depth = is_depth_model
356 |     if args.every_nth_frame is not None:
357 |         dataset = itertools.islice(dataset, 0, None, args.every_nth_frame)
358 | 
359 |     augmentor = Augmentor(("wide/image", "wide/depth") if is_depth_model else ("wide/image",))
360 |     preprocessor = Preprocessor()
361 |         
362 |     if args.device is not None:
363 |         model = model.to(args.device)
364 | 
365 |     load_data_and_execute_model(model, dataset, augmentor, preprocessor, score_thresh=args.score_thresh, viz_on_gt_points=args.viz_on_gt_points)
366 | 


--------------------------------------------------------------------------------