├── .gitignore ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── demo.py ├── images ├── loc-narr.gif ├── paper_thumb_1.jpeg ├── paper_thumb_10.jpeg ├── paper_thumb_11.jpeg ├── paper_thumb_12.jpeg ├── paper_thumb_13.jpeg ├── paper_thumb_14.jpeg ├── paper_thumb_2.jpeg ├── paper_thumb_3.jpeg ├── paper_thumb_4.jpeg ├── paper_thumb_5.jpeg ├── paper_thumb_6.jpeg ├── paper_thumb_7.jpeg ├── paper_thumb_8.jpeg └── paper_thumb_9.jpeg ├── index.html ├── localized_narratives.py ├── transcription_example.py └── web.js /.gitignore: -------------------------------------------------------------------------------- 1 | speech_api_env 2 | .idea/ 3 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | We'd love to accept your patches and contributions to this project. There are 4 | just a few small guidelines you need to follow. 5 | 6 | ## Contributor License Agreement 7 | 8 | Contributions to this project must be accompanied by a Contributor License 9 | Agreement. You (or your employer) retain the copyright to your contribution; 10 | this simply gives us permission to use and redistribute your contributions as 11 | part of the project. Head over to to see 12 | your current agreements on file or to sign a new one. 13 | 14 | You generally only need to submit a CLA once, so if you've already submitted one 15 | (even if it was for a different project), you probably don't need to do it 16 | again. 17 | 18 | ## Code reviews 19 | 20 | All submissions, including submissions by project members, require review. We 21 | use GitHub pull requests for this purpose. Consult 22 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more 23 | information on using pull requests. 24 | 25 | ## Community Guidelines 26 | 27 | This project follows 28 | [Google's Open Source Community Guidelines](https://opensource.google.com/conduct/). 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Localized Narratives 2 | Visit the [project page](https://google.github.io/localized-narratives) for all the information about Localized Narratives, data downloads, visualizations, and much more. 3 | -------------------------------------------------------------------------------- /demo.py: -------------------------------------------------------------------------------- 1 | # python3 2 | # coding=utf-8 3 | # Copyright 2020 The Google Research Authors. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | """Demo usage of the Localized Narratives data loader.""" 17 | import localized_narratives 18 | 19 | # This folder is where you would like to download the annotation files to and 20 | # where to read them from. 21 | local_dir = '/path/to/downloaded/data' 22 | 23 | # The DataLoader class allows us to download the data and read it from file. 24 | data_loader = localized_narratives.DataLoader(local_dir) 25 | 26 | # Downloads the annotation files (it first checks if they are not downloaded). 27 | data_loader.download_annotations('coco_val') 28 | 29 | # Iterates through all or a limited number of (e.g. 1 in this case) annotations 30 | # for all files found in the local folder for a given dataset and split. E.g. 31 | # for `open_images_train` it will read only one shard if only one file was 32 | # downloaded manually. 33 | loc_narr = next(data_loader.load_annotations('coco_val', 1)) 34 | 35 | print(f'\nLocalized Narrative sample:\n{loc_narr}') 36 | 37 | print(f'\nVoice recording URL:\n {loc_narr.voice_recording_url}\n') 38 | -------------------------------------------------------------------------------- /images/loc-narr.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/loc-narr.gif -------------------------------------------------------------------------------- /images/paper_thumb_1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_1.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_10.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_10.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_11.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_11.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_12.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_12.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_13.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_13.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_14.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_14.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_2.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_2.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_3.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_3.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_4.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_4.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_5.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_5.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_6.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_6.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_7.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_7.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_8.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_8.jpeg -------------------------------------------------------------------------------- /images/paper_thumb_9.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/localized-narratives/5c5b3031bc6feb1b453410b8cedece4541cf6e7c/images/paper_thumb_9.jpeg -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 18 | 19 | 20 | 21 | Localized Narratives 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 573 |
574 |

Connecting Vision and Language with

575 |

Localized Narratives

576 |
577 |

Publication

578 | 579 |
580 |
581 |
582 |
583 |
584 |
585 |
586 |
587 |
588 |
589 |
590 |
591 |
592 |
593 |
594 |
595 |
596 | 597 |
Connecting Vision and Language with 598 | Localized Narratives
599 | Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari
600 | ECCV (Spotlight), 2020
601 | [PDF] [BibTeX] [1'30'' video] [10' video] 602 |
603 |
@inproceedings{PontTuset_eccv2020,
 604 |   author    = {Jordi Pont-Tuset and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
 605 |   title     = {Connecting Vision and Language with Localized Narratives},
 606 |   booktitle = {ECCV},
 607 |   year      = {2020}
 608 | }
609 |
610 |
611 |

Abstract

612 |
613 | We propose Localized Narratives, a new form of multimodal image annotations connecting vision and 614 | language. We ask annotators to describe an image with their voice while simultaneously hovering 615 | their mouse over the region they are describing. Since the voice and the mouse pointer are 616 | synchronized, we can localize every single word in the description. This dense visual grounding 617 | takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k 618 | images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k 619 | images of Open Images, all of which we make publicly available. We provide an extensive 620 | analysis of these annotations showing they are diverse, accurate, and efficient to produce. 621 | We also demonstrate their utility on the application of controlled image captioning. 622 |
623 |
624 |

Explore Localized Narratives

625 |
626 | Explore some images and play the Localized Narrative annotation: synchronized voice, caption, 627 | and mouse trace. Don't forget to turn the sound on! 628 |
629 | 630 | 631 |
632 |
Explore  633 |
634 |
635 |
636 | Explore 637 |
638 |

License

639 |
640 | All the annotations available through this website are released under a CC BY 4.0 license. 641 | You are free to redistribute and modify the annotations, but we ask you to please keep the original attribution to our paper. 642 |
643 |
644 |

Code

645 |
646 |
647 |
Python Data Loader and Helpers
648 |
649 | Visit the GitHub repository 650 | to view the code to download and work with Localized 651 | Narratives.
Here is the documentation 652 | about the file formats used.
653 | Alternatively, you can manually download the data below. 654 |
655 |
From Traces to Boxes
656 |
657 | This colab 658 | demonstrates how we get from a trace segment to a bounding box. 659 |
660 |
661 |
662 |
663 |

Downloads

664 |
665 |
666 |
Full Localized Narratives
667 |
668 | Here you can download the full set of Localized Narratives (format description). 669 |
Large files are split in shards (a list of them will appear when you click 670 | below). 671 |
In parantheses, the number of Localized Narratives in each split. Please note that some images have more than one Localized Narrative annotation, e.g. 5k images in COCO are annotated 5 times. 672 |
673 | 674 | 675 |
676 |

File formats

677 |

The annotations are in JSON Lines format, 678 | that is, each line of the file is an independent valid JSON-encoded object. The largest 679 | files are split into smaller sub-files (shards) for ease of download. Since each 680 | line of the file is independent, the whole file can be reconstructed by simply 681 | concatenating the contents of the shards.

682 |

Each line represents one Localized Narrative annotation on one image by one annotator 683 | and has the following fields:

684 |
    685 |
  • dataset_id String identifying the dataset and split where 686 | the image belongs, e.g. mscoco_val2017.
  • 687 |
  • image_id String identifier of the image, as specified on 688 | each dataset.
  • 689 |
  • annotator_id Integer number uniquely identifying each annotator.
  • 690 |
  • caption Image caption as a string of characters.
  • 691 |
  • timed_caption List of timed utterances, i.e. {utterance, start_time, end_time} 692 | where utterance is a word (or group of words) and (start_time, end_time) is 693 | the time during which it was spoken, with respect to the start of the recording.
  • 694 |
  • traces List of trace segments, one between each time the mouse 695 | pointer enters the image and goes away from it. Each trace segment is represented as a list 696 | of timed points, i.e. {x, y, t}, where x and y are the normalized 697 | image coordinates (with origin at the top-left corner of the image) and t is 698 | the time in seconds since the start of the recording. 699 | Please note that the coordinates can go a bit beyond the image, i.e. <0 700 | or >1, as we recorded the mouse traces including a small band around the image.
  • 701 |
  • voice_recording Relative URL path with respect to 702 | https://storage.googleapis.com/localized-narratives/voice-recordings 703 | where to find the voice recording (in 704 | OGG format) for that particular image.
  • 705 |
706 |

Below a sample of one Localized Narrative in this format:

707 |
{
 708 |   dataset_id: 'mscoco_val2017',
 709 |   image_id: '137576',
 710 |   annotator_id: 93,
 711 |   caption: 'In this image there are group of cows standing and eating th...',
 712 |   timed_caption: [{'utterance': 'In this', 'start_time': 0.0, 'end_time': 0.4}, ...],
 713 |   traces: [[{'x': 0.2086, 'y': -0.0533, 't': 0.022}, ...], ...],
 714 |   voice_recording: 'coco_val/coco_val_137576_93.ogg'
 715 | }
716 |
717 |
718 |
719 |
720 |
721 | Open Images 722 |
723 |
724 |
725 | 726 | Train (507,444) 727 | 728 | 729 |
730 | 732 | 1 733 | 734 |
735 |
736 | 738 | 2 739 | 740 |
741 |
742 | 744 | 3 745 | 746 |
747 |
748 | 750 | 4 751 | 752 |
753 |
754 | 756 | 5 757 | 758 |
759 |
760 | 762 | 6 763 | 764 |
765 |
766 | 768 | 7 769 | 770 |
771 |
772 | 774 | 8 775 | 776 |
777 |
778 | 780 | 9 781 | 782 |
783 |
784 | 786 | 10 787 | 788 |
789 |
790 |
791 | 796 | 801 |
802 |
803 |
804 |
805 | COCO 806 |
807 |
808 |
809 | 810 | Train (134,272) 811 | 812 | 813 |
814 | 816 | 1 817 | 818 |
819 |
820 | 822 | 2 823 | 824 |
825 |
826 | 828 | 3 829 | 830 |
831 |
832 | 834 | 4 835 | 836 |
837 |
838 |
839 | 844 |
845 |
846 |
847 |
848 |
849 |
850 | Flickr30k 851 |
852 |
853 | 858 | 863 | 868 |
869 |
870 |
871 |
872 | ADE20k 873 |
874 |
875 | 880 | 885 |
886 |
887 |
888 |
889 |
890 |
Textual captions only
891 |
892 | To facilitate download, below are the annotations on the same images as above but containing 893 | only the textual caption, in case you are only interested in this part of Localized 894 | Narratives. 895 |
896 |
897 |
898 |
899 | Open Images 900 |
901 | 918 |
919 |
920 |
921 | COCO 922 |
923 |
924 | 929 | 934 |
935 |
936 |
937 |
938 |
939 |
940 | Flickr30k 941 |
942 |
943 | 948 | 953 | 958 |
959 |
960 |
961 |
962 | ADE20k 963 |
964 |
965 | 970 | 975 |
976 |
977 |
978 |
979 |
980 |
Automatic speech-to-text transcriptions
981 |
982 |

Below you can download the automatic speech-to-text 983 | transcriptions from the voice recordings. 984 | The format is a list of text chunks, each of which is a list of ten alternatives along with its confidence.

985 |

Please note: the final caption text of Localized Narratives is given manually by the annotators. 986 | The automatic transcriptions below are only used to temporally align the manual transcription to the mouse traces. 987 | The timestamps used for this, though, were not stored, so the alignment process cannot be reproduced. 988 | To have some timestamps, you'd need to re-run Google's speech-to-text transcription 989 | (here the code we used). 990 | Given that the API is constantly evolving, though, the transcription will likely not match the one stored below.

991 |
992 |
993 |
994 |
995 | Open Images 996 |
997 |
998 | 1003 | 1008 | 1013 |
1014 |
1015 |
1016 |
1017 | COCO 1018 |
1019 |
1020 | 1025 | 1030 |
1031 |
1032 |
1033 |
1034 |
1035 |
1036 | Flickr30k 1037 |
1038 |
1039 | 1044 | 1049 | 1054 |
1055 |
1056 |
1057 |
1058 | ADE20k 1059 |
1060 |
1061 | 1066 | 1071 |
1072 |
1073 |
1074 |
1075 |
1076 |
1077 | 1078 | 1079 |
1080 | 1081 |
1082 |
1083 | 1084 |
1085 | 1086 | 1087 |
1088 | 1089 | 1090 | Previous 1091 | 1092 | Random 1093 | Next 1094 | 1095 | 1096 |
1097 |
1098 | 1100 | 1101 | 1102 | 1103 |
Caption
1104 |
1105 |
1106 | 1107 |
Metadata
1108 |
1109 | Image source: . 1110 | Author: . 1111 | Image license. 1112 |
1113 | Dataset: Open Images. 1114 | ID: . 1115 | Recording file. 1116 |
1117 |
1118 | 1120 |
1121 |
1122 |
1123 | 1124 | 1125 | 1126 | -------------------------------------------------------------------------------- /localized_narratives.py: -------------------------------------------------------------------------------- 1 | # python3 2 | # coding=utf-8 3 | # Copyright 2020 The Google Research Authors. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | """Data Loader for Localized Narratives.""" 17 | 18 | import json 19 | import os 20 | import re 21 | from typing import Dict, Generator, List, NamedTuple 22 | import wget # type: ignore 23 | 24 | 25 | _ROOT_URL = 'https://storage.googleapis.com/localized-narratives' 26 | _ANNOTATIONS_ROOT_URL = f'{_ROOT_URL}/annotations' 27 | _RECORDINGS_ROOT_URL = f'{_ROOT_URL}/voice-recordings' 28 | 29 | _ANNOTATION_FILES = { 30 | 'open_images_train': [ 31 | f'open_images_train_v6_localized_narratives-{i:05d}-of-00010.jsonl' 32 | for i in range(10) 33 | ], 34 | 'open_images_val': ['open_images_validation_localized_narratives.jsonl'], 35 | 'open_images_test': ['open_images_test_localized_narratives.jsonl'], 36 | 'coco_train': [ 37 | f'coco_train_localized_narratives-{i:05d}-of-00004.jsonl' 38 | for i in range(4) 39 | ], 40 | 'coco_val': ['coco_val_localized_narratives.jsonl'], 41 | 'flickr30k_train': ['flickr30k_train_localized_narratives.jsonl'], 42 | 'flickr30k_val': ['flickr30k_val_localized_narratives.jsonl'], 43 | 'flickr30k_test': ['flickr30k_test_localized_narratives.jsonl'], 44 | 'ade20k_train': ['ade20k_train_localized_narratives.jsonl'], 45 | 'ade20k_val': ['ade20k_validation_localized_narratives.jsonl'] 46 | } # type: Dict[str, List[str, ...]]] 47 | 48 | 49 | class TimedPoint(NamedTuple): 50 | x: float 51 | y: float 52 | t: float 53 | 54 | 55 | class TimedUtterance(NamedTuple): 56 | utterance: str 57 | start_time: float 58 | end_time: float 59 | 60 | 61 | class LocalizedNarrative(NamedTuple): 62 | """Represents a Localized Narrative annotation. 63 | 64 | Visit https://google.github.io/localized-narratives/index.html?file-formats=1 65 | for the documentation of each field. 66 | """ 67 | dataset_id: str 68 | image_id: str 69 | annotator_id: int 70 | caption: str 71 | timed_caption: List[TimedUtterance] 72 | traces: List[List[TimedPoint]] 73 | voice_recording: str 74 | 75 | @property 76 | def voice_recording_url(self) -> str: 77 | """Returns the absolute path where to find the voice recording file.""" 78 | # Fixes the voice recording path for Flickr30K and ADE20k 79 | if 'Flic' in self.dataset_id or 'ADE' in self.dataset_id: 80 | split_id, image_id = re.search(r'(\w+)/\w+_([0-9]+)_[0-9]+\.', 81 | self.voice_recording).groups() 82 | image_id = image_id.zfill(16) 83 | voice_recording = (f'{split_id}/' 84 | f'{split_id}_{image_id}_{self.annotator_id}.ogg') 85 | else: 86 | voice_recording = self.voice_recording 87 | 88 | return f'{_RECORDINGS_ROOT_URL}/{voice_recording}' 89 | 90 | def __repr__(self): 91 | truncated_caption = self.caption[:60] + '...' if len( 92 | self.caption) > 63 else self.caption 93 | truncated_timed_caption = self.timed_caption[0].__str__() 94 | truncated_traces = self.traces[0][0].__str__() 95 | return (f'{{\n' 96 | f' dataset_id: {self.dataset_id},\n' 97 | f' image_id: {self.image_id},\n' 98 | f' annotator_id: {self.annotator_id},\n' 99 | f' caption: {truncated_caption},\n' 100 | f' timed_caption: [{truncated_timed_caption}, ...],\n' 101 | f' traces: [[{truncated_traces}, ...], ...],\n' 102 | f' voice_recording: {self.voice_recording}\n' 103 | f'}}') 104 | 105 | 106 | def _expected_files(dataset_and_split: str) -> Generator[str, None, None]: 107 | try: 108 | yield from _ANNOTATION_FILES[dataset_and_split] 109 | except KeyError: 110 | raise ValueError( 111 | f'Unknown value for `dataset_and_split`: {dataset_and_split}') 112 | 113 | 114 | class DataLoader: 115 | """Data Loader for Localized Narratives.""" 116 | 117 | def __init__(self, local_root_dir: str): 118 | """DataLoader constructor. 119 | 120 | Args: 121 | local_root_dir: Local directory where the annotation files can be 122 | downloaded to and read from. 123 | """ 124 | self._local_root_dir = local_root_dir 125 | self._current_open_file = None 126 | 127 | def download_annotations(self, dataset_and_split: str): 128 | """Downloads the Localized Narratives annotations. 129 | 130 | Args: 131 | dataset_and_split: Name of the dataset and split to download. 132 | Possible values are the keys in _ANNOTATION_FILES. 133 | """ 134 | os.makedirs(self._local_root_dir, exist_ok=True) 135 | 136 | for filename in _expected_files(dataset_and_split): 137 | self._download_one_file(filename) 138 | 139 | def load_annotations( 140 | self, dataset_and_split: str, max_num_annotations: int = int(1e30) 141 | ) -> Generator[LocalizedNarrative, None, None]: 142 | """Loads the Localized Narratives annotations from local files. 143 | 144 | Args: 145 | dataset_and_split: Name of the dataset and split to load. Possible values 146 | are the keys in _ANNOTATION_FILES. 147 | max_num_annotations: Maximum number of annotations to load. 148 | 149 | Yields: 150 | One Localized Narrative at a time. 151 | """ 152 | num_loaded = 0 153 | for local_file in self._find_files(dataset_and_split): 154 | self._current_open_file = open(local_file, 'rb') 155 | for line in self._current_open_file: 156 | yield LocalizedNarrative(**json.loads(line)) 157 | num_loaded += 1 158 | if num_loaded == max_num_annotations: 159 | self._current_open_file.close() 160 | return 161 | self._current_open_file.close() 162 | 163 | def _local_file(self, filename: str) -> str: 164 | return os.path.join(self._local_root_dir, filename) 165 | 166 | def _find_files(self, dataset_and_split: str) -> Generator[str, None, None]: 167 | for filename in _expected_files(dataset_and_split): 168 | if os.path.exists(self._local_file(filename)): 169 | yield self._local_file(filename) 170 | 171 | def _download_one_file(self, filename: str): 172 | if not os.path.exists(self._local_file(filename)): 173 | print(f'Downloading: {filename}') 174 | wget.download(f'{_ANNOTATIONS_ROOT_URL}/{filename}', 175 | self._local_file(filename)) 176 | print() 177 | else: 178 | print(f'Already downloaded: {filename}') 179 | -------------------------------------------------------------------------------- /transcription_example.py: -------------------------------------------------------------------------------- 1 | # python3 2 | # coding=utf-8 3 | # Copyright 2020 The Google Research Authors. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | """Example call to Google's speech-to-text API to transcribe Localized Narrative recordings. 17 | 18 | Pre-requisites: 19 | - Set up Google's API authentication: 20 | https://cloud.google.com/docs/authentication/getting-started 21 | - Install dependencies: 22 | + pip install ffmpeg 23 | + pip install pydub 24 | + pip install google-cloud-speech 25 | 26 | Comments: 27 | - Google's speech-to-text API does not support the Vorbis encoding in which the 28 | Localized Narrative recordings were released. We therefore need to transcode 29 | them Opus, which is supported. We do this in`convert_recording`. 30 | - Transcription is limited to 60 seconds if loaded from a local file. For audio 31 | longer than 1 minute, we need to upload the file to a GCS bucket and load the 32 | audio using its URI: `audio = speech.RecognitionAudio(uri=gcs_uri)`. 33 | """ 34 | import io 35 | import os 36 | 37 | from google.cloud import speech 38 | import pydub 39 | 40 | 41 | def convert_recording(input_file, output_file): 42 | with open(input_file, 'rb') as f: 43 | recording = pydub.AudioSegment.from_file(f, codec='libvorbis') 44 | 45 | with open(output_file, 'wb') as f: 46 | recording.export(f, format='ogg', codec='libopus') 47 | 48 | 49 | def speech_to_text(recording_file): 50 | # Loads from local file. If longer than 60 seconds, upload to GCS and use 51 | # `audio = speech.RecognitionAudio(uri=gcs_uri)` 52 | with io.open(recording_file, 'rb') as audio_file: 53 | content = audio_file.read() 54 | audio = speech.RecognitionAudio(content=content) 55 | 56 | config = speech.RecognitionConfig( 57 | encoding=speech.RecognitionConfig.AudioEncoding.OGG_OPUS, 58 | sample_rate_hertz=48000, 59 | audio_channel_count=2, 60 | max_alternatives=10, 61 | enable_word_time_offsets=True, 62 | language_code='en-IN') 63 | 64 | client = speech.SpeechClient() 65 | operation = client.long_running_recognize(config=config, audio=audio) 66 | return operation.result(timeout=90) 67 | 68 | 69 | if __name__ == '__main__': 70 | 71 | # Input encoded in Vorbis in an OGG container. 72 | input_recording = '/Users/jponttuset/Downloads/coco_val_137576_93.ogg' 73 | basename, extension = os.path.splitext(input_recording) 74 | output_recording = f'{basename}_opus{extension}' 75 | 76 | # Re-encodes in Opus and saves to file. 77 | convert_recording(input_recording, output_recording) 78 | 79 | # Actual call to Google's speech-to-text API. 80 | result = speech_to_text(output_recording) 81 | print(result) 82 | --------------------------------------------------------------------------------