├── .gitignore ├── DeclarationGeneration ├── README.md ├── finetuning_T5.py ├── run_predict.sh └── run_train.sh ├── README.md └── VinVL ├── Oscar ├── .gitignore ├── .gitmodules ├── CODE_OF_CONDUCT.md ├── DOWNLOAD.md ├── INSTALL.md ├── LICENSE ├── MODEL_ZOO.md ├── README.md ├── SECURITY.md ├── VinVL_DOWNLOAD.md ├── VinVL_MODEL_ZOO.md ├── docs │ ├── oscar.PNG │ ├── oscar_logo.png │ └── pretrain_corpus.PNG ├── oscar │ ├── __init__.py │ ├── datasets │ │ ├── __init__.py │ │ ├── build.py │ │ └── oscar_tsv.py │ ├── modeling │ │ ├── __init__.py │ │ ├── modeling_bert.py │ │ └── modeling_utils.py │ ├── run_captioning.py │ ├── run_gqa.py │ ├── run_gqa_prompt_itm.py │ ├── run_gqa_prompt_mlm.py │ ├── run_gqa_prompt_zero_few.py │ ├── run_nlvr.py │ ├── run_oscarplus_pretrain.py │ ├── run_retrieval.py │ ├── run_vqa.py │ ├── run_vqa_prompt_itm.py │ ├── run_vqa_prompt_mlm.py │ └── utils │ │ ├── __init__.py │ │ ├── caption_evaluate.py │ │ ├── cbs.py │ │ ├── cider │ │ └── pyciderevalcap │ │ │ ├── __init__.py │ │ │ ├── cider │ │ │ ├── __init__.py │ │ │ ├── cider.py │ │ │ └── cider_scorer.py │ │ │ └── ciderD │ │ │ ├── __init__.py │ │ │ ├── ciderD.py │ │ │ └── ciderD_scorer.py │ │ ├── logger.py │ │ ├── metric_logger.py │ │ ├── misc.py │ │ ├── task_utils.py │ │ ├── tsv_file.py │ │ └── tsv_file_ops.py ├── requirements.txt └── setup.py └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .idea -------------------------------------------------------------------------------- /DeclarationGeneration/README.md: -------------------------------------------------------------------------------- 1 | # README 2 | 3 | ___ 4 | 5 | This directory contains the files for declaration generation. [T5-small](https://arxiv.org/pdf/1910.10683.pdf) is 6 | exploited as the encoder-decoder model for training and evaluation. 7 | 8 | ## INSTALL 9 | 10 | + Follow the [documentation](https://huggingface.co/docs/transformers/model_doc/t5) for installation. 11 | 12 | ## Data 13 | 14 | Assume the root data dir is `[ROOT_DATA_DIR]`, then the declaration dataset for training and 15 | validation is placed in `[ROOT_DATA_DIR]/declaration/*`: 16 | ``` 17 | |- [ROOT_DATA_DIR] 18 | |- declaration/ 19 | |- question_to_declarative_train.json 20 | |- question_to_declarative_val.json 21 | ``` 22 | 23 | ## Model Training 24 | 25 | Run the script for training: 26 | ```bash 27 | bash run_train.sh 28 | ``` 29 | 30 | or the fine-tuned T5-small model (`model/declaration/checkpoint-480000`) can be 31 | downloaded from [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA). The fine-tuned or downloaded checkpoint is placed in 32 | `data/model/declaration/checkpoint-480000`. 33 | 34 | ## Declaration Generation 35 | 36 | Once T5 is trained on the declaration dataset, the model can be used to generate 37 | declarative sentences for GQA and VQA datasets. Just follow the steps as bellow: 38 | 1. Transform the questions (from GQA and VQA v2.0 datasets) into the translation 39 | format (one sample per line), where `en_q` denotes the source question string 40 | and `en_a` denotes the target declarative sentence we want to generate. The file 41 | is named `source_file.txt`: 42 | ```text 43 | {"translation": {"en_q": "Is the sky dark?", "en_a": ""}} 44 | {"translation": {"en_q": "What is on the white wall?", "en_a": ""}} 45 | {"translation": {"en_q": "Is that pipe red?", "en_a": ""}} 46 | ... 47 | ``` 48 | 2. Assume the path of `source_file.txt` is `[SOURCE_FILE_DIR]/source_file.txt`, then 49 | run the script: 50 | ```bash 51 | bash run_predict.sh 52 | ``` 53 | Finally, there will be a `.txt` file in `output` dir, _i.e._, `generated_predictions.txt`. 54 | This file contains one sentence per line, representing the declarative 55 | sentence of the corresponding question in `source_file.txt`. 56 | The format of `generated_predictions.txt` is shown as follows: 57 | ```text 58 | [MASK], the sky [BE] dark. 59 | the [MASK] is on the wall. 60 | [MASK], that pipe [BE] red. 61 | ``` 62 | 3. We provide the pre-generated declaration files (from the questions of GQA and VQA v2.0 datasets) 63 | for easy-to-use. The files can be downloaded from [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA), the files are arranged as 64 | follows: 65 | ``` 66 | |- [ROOT_DATA_DIR] 67 | |- declaration/ 68 | |- gqa/ 69 | |- gqa_all_submission_declaration.json 70 | |- gqa_all_train_declaration.json 71 | |- gqa_all_val_declaration.json 72 | |- gqa_bal_train_declaration.json 73 | |- gqa_bal_val_declaration.json 74 | |- vqa/ 75 | |- test2015_declarative.json 76 | |- test-dev2015_declarative.json 77 | |- train2014_declarative.json 78 | |- val2014_declarative.json 79 | |- vqa_vg_declarative.json 80 | ``` -------------------------------------------------------------------------------- /DeclarationGeneration/run_predict.sh: -------------------------------------------------------------------------------- 1 | python ./finetuning_T5.py \ 2 | --model_name_or_path "[ROOT_DATA_DIR]/model/declaration/checkpoint-480000" \ 3 | --do_predict \ 4 | --source_lang en_q \ 5 | --target_lang en_a \ 6 | --source_prefix "translate question to declarative sentence: " \ 7 | --dataset_config_name en-en \ 8 | --train_file "[ROOT_DATA_DIR]/declaration/question_to_declarative_train.json" \ 9 | --validation_file "[ROOT_DATA_DIR]/declaration/question_to_declarative_val.json" \ 10 | --test_file "[SOURCE_FILE_DIR]/source_file.txt" \ 11 | --output_dir "./output" \ 12 | --per_device_train_batch_size=4 \ 13 | --per_device_eval_batch_size=64 \ 14 | --overwrite_output_dir \ 15 | --predict_with_generate \ 16 | --max_source_length 50 \ 17 | --max_target_length 50 \ 18 | --generation_max_length 50 \ 19 | --eval_accumulation_steps 80000 -------------------------------------------------------------------------------- /DeclarationGeneration/run_train.sh: -------------------------------------------------------------------------------- 1 | python ./finetuning_T5.py \ 2 | --model_name_or_path "t5-small" \ 3 | --do_train \ 4 | --source_lang en_q \ 5 | --target_lang en_a \ 6 | --source_prefix "translate question to declarative sentence: " \ 7 | --dataset_config_name en-en \ 8 | --train_file "[ROOT_DATA_DIR]/declaration/question_to_declarative_train.json" \ 9 | --validation_file "[ROOT_DATA_DIR]/declaration/question_to_declarative_val.json" \ 10 | --output_dir "./output" \ 11 | --per_device_train_batch_size=4 \ 12 | --per_device_eval_batch_size=4 \ 13 | --overwrite_output_dir \ 14 | --predict_with_generate \ 15 | --max_source_length 50 \ 16 | --max_target_length 50 \ 17 | --generation_max_length 50 \ 18 | --eval_accumulation_steps 80000 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Declaration-based Prompt Tuning for Visual Question Answering 2 | The implementation code of IJCAI 2022 paper: 3 | _Declaration-based Prompt Tuning for Visual Question Answering_. 4 | 5 | ## Requirements 6 | + Python 3 7 | + Pytorch>=1.7.1 8 | + pytorch-transformers==1.2.0 9 | 10 | ## Usage 11 | 12 | ### Declaration Generation 13 | 14 | Please follow [DeclarationGeneration](DeclarationGeneration/README.md) to set up the 15 | experiments for declaration generation. 16 | 17 | ### Visual Question Answering 18 | 19 | Please follow [VinVL](VinVL/README.md) to set up the experiments for visual question 20 | answering. 21 | 22 | ## Citation 23 | 24 | **Please kindly cite our paper if this paper and the code are helpful.** 25 | 26 | ``` 27 | @inproceedings{liu2022dpt, 28 | author={Liu, Yuhang and Wei, Wei and Peng, Daowan and Zhu, Feida}, 29 | title={Declaration-based Prompt Tuning for Visual Question Answering}, 30 | booktitle={Proceedings of the Thirty-first International Joint Conference on Artificial Intelligence, {IJCAI-22}}, 31 | year={2022} 32 | } 33 | ``` 34 | -------------------------------------------------------------------------------- /VinVL/Oscar/.gitignore: -------------------------------------------------------------------------------- 1 | # Initially taken from Github's Python gitignore file 2 | 3 | # Byte-compiled / optimized / DLL files 4 | __pycache__/ 5 | *.py[cod] 6 | *$py.class 7 | 8 | # C extensions 9 | *.so 10 | 11 | # Distribution / packaging 12 | .Python 13 | build/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | .eggs/ 19 | lib/ 20 | lib64/ 21 | parts/ 22 | sdist/ 23 | var/ 24 | wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | 53 | # Translations 54 | *.mo 55 | *.pot 56 | 57 | # Django stuff: 58 | *.log 59 | local_settings.py 60 | db.sqlite3 61 | 62 | # Flask stuff: 63 | instance/ 64 | .webassets-cache 65 | 66 | # Scrapy stuff: 67 | .scrapy 68 | 69 | # Sphinx documentation 70 | docs/_build/ 71 | 72 | # PyBuilder 73 | target/ 74 | 75 | # Jupyter Notebook 76 | .ipynb_checkpoints 77 | 78 | # IPython 79 | profile_default/ 80 | ipython_config.py 81 | 82 | # pyenv 83 | .python-version 84 | 85 | # celery beat schedule file 86 | celerybeat-schedule 87 | 88 | # SageMath parsed files 89 | *.sage.py 90 | 91 | # Environments 92 | .env 93 | .venv 94 | env/ 95 | venv/ 96 | ENV/ 97 | env.bak/ 98 | venv.bak/ 99 | 100 | # Spyder project settings 101 | .spyderproject 102 | .spyproject 103 | 104 | # Rope project settings 105 | .ropeproject 106 | 107 | # mkdocs documentation 108 | /site 109 | 110 | # mypy 111 | .mypy_cache/ 112 | .dmypy.json 113 | dmypy.json 114 | 115 | # Pyre type checker 116 | .pyre/ 117 | 118 | # vscode 119 | .vscode 120 | 121 | # TF code 122 | tensorflow_code 123 | 124 | # Models 125 | models 126 | proc_data 127 | 128 | # examples 129 | runs 130 | examples/runs 131 | 132 | # pyCharm 133 | .idea/ 134 | 135 | # local folders 136 | data 137 | models 138 | output 139 | -------------------------------------------------------------------------------- /VinVL/Oscar/.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "transformers"] 2 | path = transformers 3 | url = git@github.com:huggingface/transformers.git 4 | [submodule "coco_caption"] 5 | path = coco_caption 6 | url = git@github.com:LuoweiZhou/coco-caption.git 7 | -------------------------------------------------------------------------------- /VinVL/Oscar/CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Microsoft Open Source Code of Conduct 2 | 3 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 4 | 5 | Resources: 6 | 7 | - [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/) 8 | - [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) 9 | - Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns 10 | -------------------------------------------------------------------------------- /VinVL/Oscar/DOWNLOAD.md: -------------------------------------------------------------------------------- 1 | # Download 2 | 3 | ## Datasets 4 | We provide the extracted image region features, object tags, and the original text annotations for each downstream tasks. 5 | ```bash 6 | wget https://biglmdiag.blob.core.windows.net/oscar/datasets/$TASK_NAME.zip 7 | unzip $TASK_NAME.zip -d $DATA_DIR 8 | ``` 9 | `TASK_NAME` could be `coco_caption`, `coco_ir`, `vqa`, `GQA`, `nlvr2`. 10 | 11 | ## Pre-trained Models 12 | We provide pre-trained *Oscar* models of Bert-base and Bert-large structures, with the name starting with `base` and `large`, respectively. 13 | ```bash 14 | wget https://biglmdiag.blob.core.windows.net/oscar/pretrained_models/$MODEL_NAME.zip 15 | unzip $MODEL_NAME.zip -d $MODEL_DIR 16 | ``` 17 | `MODEL_NAME` could be `base-vg-labels`, `large-vg-labels`, `base-oid-labels`, `base-no-labels`. 18 | 19 | The models are trained with both image region features and object tags. The image region features are extracted by the Faster R-CNN with 20 | ResNet-101, using object and attribute annotations from [Visual Genome](http://visualgenome.org/). 21 | The object tags are from: 22 | 1) the same VisualGenome model, named as `-vg-labels`. Or, 23 | 2) the model trained on object annotations from [Open Images V5](https://storage.googleapis.com/openimages/web/index.html). named as `-oid-labels`. Or, 24 | 3) no object tags provied, serving as baseline, named as `-no-labels`. 25 | 26 | 27 | ### Note 28 | It is recommended to download large files with **AzCopy** for faster speed. 29 | AzCopy executable tools can be downloaded [here](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10#download-azcopy). 30 | Decompress the tar file and put the executable in any path. To download from 31 | any URL above, the command is: 32 | ```bash 33 | path/to/azcopy copy 34 | 35 | # for example, downloading coco_caption.zip 36 | path/to/azcopy copy https://biglmdiag.blob.core.windows.net/oscar/datasets/coco_caption.zip 37 | ``` 38 | 39 | -------------------------------------------------------------------------------- /VinVL/Oscar/INSTALL.md: -------------------------------------------------------------------------------- 1 | ## Installation 2 | ### Requirements 3 | - Python 3.7 4 | - Pytorch 1.2 5 | - torchvision 0.4.0 6 | - cuda 10.0 7 | 8 | ### Setup with Conda 9 | ```bash 10 | # create a new environment 11 | conda create --name oscar python=3.7 12 | conda activate oscar 13 | 14 | # install pytorch1.2 15 | conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch 16 | 17 | export INSTALL_DIR=$PWD 18 | 19 | # install apex 20 | cd $INSTALL_DIR 21 | git clone https://github.com/NVIDIA/apex.git 22 | cd apex 23 | python setup.py install --cuda_ext --cpp_ext 24 | 25 | # install oscar 26 | cd $INSTALL_DIR 27 | git clone --recursive git@github.com:microsoft/Oscar.git 28 | cd Oscar/coco_caption 29 | ./get_stanford_models.sh 30 | cd .. 31 | python setup.py build develop 32 | 33 | # install requirements 34 | pip install -r requirements.txt 35 | 36 | unset INSTALL_DIR 37 | ``` 38 | 39 | -------------------------------------------------------------------------------- /VinVL/Oscar/LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) Microsoft Corporation. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE 22 | -------------------------------------------------------------------------------- /VinVL/Oscar/MODEL_ZOO.md: -------------------------------------------------------------------------------- 1 | ## Table of Contents 2 | - VQA 3 | - GQA 4 | - NLVR2 5 | - Image/Text Retrieval 6 | - Image Captioning on COCO 7 | 8 | 9 | ## Performance 10 | Task | t2i | t2i | i2t | i2t | IC | IC | IC | IC | NoCaps | NoCaps | VQA | NLVR2 | 11 | --------|-----|-----|-----|-----|-----|-----|------|------|--------|--------|----------|---------| 12 | Metric | R@1 | R@5 | R@1 | R@5 | B@4 | M | C | S | C | S | test-std | test-P | 13 | SoTA_S |39.2 | 68.0|56.6 | 84.5|38.9 |29.2 |129.8 | 22.4 | 61.5 | 9.2 | 70.90 | 53.50 | 14 | SoTA_B |48.4 | 76.7|63.3 | 87.0|39.5 |29.3 |129.3 | 23.2 | 73.1 | 11.2 | 72.54 | 78.87 | 15 | SoTA_L |51.7 | 78.4|66.6 | 89.4| - | - | - | - | - | - | 73.40 | 79.50 | 16 | ----- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- | 17 | Oscar_B |54.0 | 80.8|70.0 | 91.1|40.5 |29.7 |137.6 | 22.8 | 78.8 | 11.7 | 73.44 | 78.44 | 18 | Oscar_L |57.5 | 82.8|73.5 | 92.2|41.7 |30.6 |140.0 | 24.5 | 80.9 | 11.3 | 73.82 | 80.37 | 19 | gain | 5.8 | 4.4| 6.9 | 2.8| 2.2 | 1.3 | 10.7 | 1.3 | 7.8 | 0.5 | 0.42 | 0.87 | 20 | 21 | t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO. 22 | 23 | For reference, we also release the training logs and output. 24 | 25 | 26 | ## VQA 27 | Script to finetune for Oscar base model. 28 | Base model is trained on train split and evaluated on the val split. Good for later comparison. 29 | 30 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/base/base_9m_ep107_1192k_eu1/application_1575931286052_40649/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/base/base_9m_ep107_1192k_eu1/application_1575931286052_40649/results/stdout.txt).
31 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/base/results.txt).
32 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/base/vqa_base_best.zip). 33 | ```bash 34 | python oscar/run_vqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length 35 | 50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir datasets/vqa/2k 36 | --model_type bert --model_name_or_path pretrained_models/base-vg-labels/ep_107_1192087 37 | --task_name vqa_text --do_train --do_lower_case --max_seq_length 128 --per_gpu_eval_batch_size 38 | 256 --per_gpu_train_batch_size 32 --learning_rate 5e-05 --num_train_epochs 25 39 | --output_dir results --label_file datasets/vqa/cache/trainval_ans2label.pkl 40 | --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 --drop_out 41 | 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce --img_feat_format pt 42 | --classifier linear --cls_hidden_scale 3 --txt_data_dir datasets/vqa/2k 43 | ``` 44 | 45 | Script to finetune for Oscar large model. 46 | Large model is trained on train+val split and evaluated on the val split, for reproduce the paper's best result. 47 | 48 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/large/ab128_img_large_rr1_ep20_590k_tv_done_good/exp_ab128_img_large_rr1_ep20_590k_tv_0.00003_128_50_dp_0.3_wd_0.05_bce_3linear_s88_abcd/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/large/ab128_img_large_rr1_ep20_590k_tv_done_good/exp_ab128_img_large_rr1_ep20_590k_tv_0.00003_128_50_dp_0.3_wd_0.05_bce_3linear_s88_abcd/stdout.txt).
49 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/large/results.txt).
50 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/large/vqa_large_best.zip). 51 | ```bash 52 | python oscar/run_vqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length 53 | 50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir datasets/vqa/2k 54 | --model_type bert --model_name_or_path pretrained_models/large-vg-labels/ep_20_590000 55 | --task_name vqa_text --do_train_val --do_lower_case --max_seq_length 128 --per_gpu_eval_batch_size 56 | 256 --per_gpu_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 25 57 | --label_file datasets/vqa/cache/trainval_ans2label.pkl --save_epoch 30 58 | --seed 88 --evaluate_during_training --logging_steps 4000 --drop_out 0.3 --weight_decay 59 | 0.05 --warmup_steps 0 --loss_type bce --save_after_epoch 15 --output_dir results --img_feat_format pt --classifier linear --cls_hidden_scale 3 --txt_data_dir datasets/vqa/2k 60 | ``` 61 | 62 | 63 | ## GQA 64 | Script to finetune for Oscar base model. 65 | 66 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/gqa/base/ab175_base_ep107_1192k_0.4true_taeb_done_25eps_good/exp_ab175_base_ep107_1192k_0.4true_taeb_b_48_0.00005_165_45_dp_0.3_abce/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/gqa/base/ab175_base_ep107_1192k_0.4true_taeb_done_25eps_good/exp_ab175_base_ep107_1192k_0.4true_taeb_b_48_0.00005_165_45_dp_0.3_abce/stdout.txt).
67 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/gqa/base/ab165_img45_1568928610179_62515_test_done_good/results.txt).
68 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/gqa/base/gqa_base_best.zip). 69 | ```bash 70 | python oscar/run_gqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length 71 | 45 --data_dir datasets/GQA/0.4true --model_type bert --model_name_or_path pretrained_models/base-vg-labels/ep_107_1192087 72 | --task_name gqa --do_lower_case --max_seq_length 165 --per_gpu_eval_batch_size 73 | 256 --per_gpu_train_batch_size 48 --learning_rate 5e-05 --num_train_epochs 5 --output_dir 74 | results --label_file datasets/GQA/questions1.2/trainval_testdev_all_ans2label.pkl 75 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type all --eval_data_type 76 | bal --label2ans_file datasets/GQA/questions1.2/trainval_testdev_all_label2ans.pkl 77 | --loss_type xe --save_epoch 2 --seed 88 --evaluate_during_training --logging_steps 78 | 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 79 | ``` 80 | 81 | ## NLVR2 82 | Script to finetune for Oscar base model. 83 | 84 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/base/exp_rvln_base_ep107_1192k_wm1w_b72_0.00003_55_40_dp0.3_3mlp_wm10000_abcf_best/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/base/exp_rvln_base_ep107_1192k_wm1w_b72_0.00003_55_40_dp0.3_3mlp_wm10000_abcf_best/stdout.txt).
85 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/base/exp_nlvr_base_11123_testall_b24_0.00003_55_43_dp_0.3_mlp_abcj_best/stdout.txt). 86 | ```bash 87 | python oscar/run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length 88 | 40 --data_dir datasets/nlvr2/ft_corpus --model_type bert --model_name_or_path pretrained_models/base-vg-labels/ep_107_1192087 89 | --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size 90 | 64 --per_gpu_train_batch_size 72 --learning_rate 3e-05 --num_train_epochs 20 --output_dir 91 | results --img_feature_type faster_r-cnn --data_label_type all --train_data_type 92 | all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training 93 | --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 94 | 10000 --classifier mlp --cls_hidden_scale 3 --num_choice 2 --use_pair 95 | ``` 96 | 97 | Script to finetune for Oscar large model. 98 | 99 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_rvln_large_ep55_1618k_b24_0.00002_seq55_img40_dp0.3_2mlp_wm5000_abcj/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_rvln_large_ep55_1618k_b24_0.00002_seq55_img40_dp0.3_2mlp_wm5000_abcj/stdout.txt).
100 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_nlvr_large_1583307153868_14140_testall_b24_0.00003_55_43_dp_0.3_mlp_abck/stdout.txt). 101 | ```bash 102 | python oscar/run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length 103 | 40 --data_dir datasets/nlvr2/ft_corpus --model_type bert --model_name_or_path pretrained_models/large-vg-labels/ep_55_1617000 104 | --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size 105 | 64 --per_gpu_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 20 --output_dir 106 | results --img_feature_type faster_r-cnn --data_label_type all --train_data_type 107 | all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training 108 | --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 109 | 5000 --classifier mlp --cls_hidden_scale 2 --num_choice 2 --use_pair 110 | ``` 111 | 112 | 126 | 127 | ## Image Text Retrieval 128 | Script to finetune for Oscar base model (4 V100 with 16G mem): 129 | 130 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/base/eval_logs.json), [log.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/base/log.txt). 131 | Model checkpoint: [checkpoint.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/base/checkpoint.zip). 132 | 133 | ```bash 134 | python oscar/run_retrieval.py \ 135 | --model_name_or_path pretrained_models/base-vg-labels/ep_67_588997 \ 136 | --do_train \ 137 | --do_lower_case \ 138 | --evaluate_during_training \ 139 | --num_captions_per_img_val 20 \ 140 | --eval_caption_index_file minival_caption_indexs_top20.pt \ 141 | --per_gpu_train_batch_size 32 \ 142 | --learning_rate 0.00002 \ 143 | --num_train_epochs 30 \ 144 | --weight_decay 0.05 \ 145 | --save_steps 5000 \ 146 | --add_od_labels \ 147 | --od_label_type vg \ 148 | --max_seq_length 70 \ 149 | --output_dir output/ 150 | ``` 151 | 152 | Script to finetune for Oscar large model (8 V100 with 32G mem): 153 | 154 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/large/eval_logs.json), [log.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/large/log.txt). 155 | Model checkpoint: [checkpoint.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/large/checkpoint.zip). 156 | 157 | ```bash 158 | python oscar/run_retrieval.py \ 159 | --model_name_or_path pretrained_models/large-vg-labels/ep_7_816000 \ 160 | --do_train \ 161 | --do_lower_case \ 162 | --evaluate_during_training \ 163 | --num_captions_per_img_val 20 \ 164 | --eval_caption_index_file minival_caption_indexs_top20.pt \ 165 | --per_gpu_train_batch_size 16 \ 166 | --learning_rate 0.00001 \ 167 | --num_train_epochs 30 \ 168 | --save_steps 5000 \ 169 | --add_od_labels \ 170 | --od_label_type vg \ 171 | --max_seq_length 70 \ 172 | --output_dir output/ 173 | ``` 174 | 175 | Script to inference on COCO 1K test set: 176 | ```bash 177 | python oscar/run_retrieval.py \ 178 | --do_test \ 179 | --do_eval \ 180 | --test_split test \ 181 | --num_captions_per_img_val 5 \ 182 | --eval_img_keys_file test_img_keys_1k.tsv \ 183 | --cross_image_eval \ 184 | --per_gpu_eval_batch_size 64 \ 185 | --eval_model_dir your_model_for_evaluation # could be base/large models. 186 | ``` 187 | 188 | Script to inference on COCO 5K test set: 189 | ```bash 190 | python oscar/run_retrieval.py \ 191 | --do_test \ 192 | --do_eval \ 193 | --test_split test \ 194 | --num_captions_per_img_val 5 \ 195 | --eval_img_keys_file test_img_keys.tsv \ 196 | --cross_image_eval \ 197 | --per_gpu_eval_batch_size 64 \ 198 | --eval_model_dir your_model_for_evaluation # could be base/large models. 199 | ``` 200 | 201 | 202 | ## Image Captioning on COCO 203 | Script to finetune for Oscar base model (4 V100 with 16G mem): 204 | 205 | Training logs: [log.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/coco_caption/base/log.txt). 206 | Model checkpoint: [checkpoint.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/coco_caption/base/checkpoint.zip). 207 | 208 | 1) First train with cross-entropy loss: 209 | ```bash 210 | python oscar/run_captioning.py \ 211 | --model_name_or_path pretrained_models/base-vg-labels/ep_67_588997 \ 212 | --do_train \ 213 | --do_lower_case \ 214 | --evaluate_during_training \ 215 | --add_od_labels \ 216 | --learning_rate 0.00003 \ 217 | --per_gpu_train_batch_size 64 \ 218 | --num_train_epochs 30 \ 219 | --save_steps 5000 \ 220 | --output_dir output/ 221 | ``` 222 | 2) Finetune with CIDEr optimization: 223 | ```bash 224 | python oscar/run_captioning.py \ 225 | --model_name_or_path your_checkpoint_from_cross_entropy \ 226 | --do_train \ 227 | --do_lower_case \ 228 | --evaluate_during_training \ 229 | --add_od_labels \ 230 | --learning_rate 0.000005 \ 231 | --per_gpu_train_batch_size 16 \ 232 | --num_train_epochs 5 \ 233 | --scst \ 234 | --save_steps 2000 \ 235 | --output_dir output/ 236 | ``` 237 | 238 | Script to finetune for Oscar large model (8 V100 with 32G mem): 239 | 1) First train with cross-entropy loss: 240 | ```bash 241 | python oscar/run_captioning.py \ 242 | --model_name_or_path pretrained_models/large-vg-labels/ep_7_816000 \ 243 | --do_train \ 244 | --do_lower_case \ 245 | --evaluate_during_training \ 246 | --add_od_labels \ 247 | --learning_rate 0.00001 \ 248 | --per_gpu_train_batch_size 32 \ 249 | --num_train_epochs 30 \ 250 | --save_steps 5000 \ 251 | --output_dir output/ 252 | ``` 253 | 2) Finetune with CIDEr optimization: 254 | ```bash 255 | python oscar/run_captioning.py \ 256 | --model_name_or_path your_checkpoint_from_cross_entropy \ 257 | --do_train \ 258 | --do_lower_case \ 259 | --evaluate_during_training \ 260 | --add_od_labels \ 261 | --learning_rate 0.000005 \ 262 | --per_gpu_train_batch_size 8 \ 263 | --num_train_epochs 5 \ 264 | --scst \ 265 | --save_steps 2000 \ 266 | --output_dir output/ 267 | ``` 268 | 269 | Script to inference on COCO test set: 270 | ```bash 271 | python oscar/run_captioning.py \ 272 | --do_test \ 273 | --do_eval \ 274 | --test_yaml test.yaml \ 275 | --per_gpu_eval_batch_size 64 \ 276 | --num_beams 5 \ 277 | --max_gen_length 20 \ 278 | --eval_model_dir your_model_for_evaluation # could be bert base/large. 279 | ``` 280 | -------------------------------------------------------------------------------- /VinVL/Oscar/README.md: -------------------------------------------------------------------------------- 1 | # Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks 2 | # VinVL: Revisiting Visual Representations in Vision-Language Models 3 | ## Updates 4 | 05/28/2020: Released finetuned models on downstream tasks, please check [MODEL_ZOO.md](MODEL_ZOO.md).
5 | 05/15/2020: Released pretrained models, datasets, and code for downstream tasks finetuning.
6 | 01/13/2021: our new work [VinVL](https://arxiv.org/abs/2101.00529) proposed OSCAR+, an improved version of OSCAR, and provided a better object-attribute detection model to extract features for V+L tasks. The VinVL work achieved SOTA performance on all seven V+L tasks here. Please stay tuned for the model and code release.
7 | 03/08/2021: Oscar+ pretraining code released, please check the last section in [VinVL_MODEL_ZOO.md](VinVL_MODEL_ZOO.md). All image features and model checkpoints in VinVL are also released. Please check [VinVL](https://github.com/pzzhang/VinVL) for details.
8 | 04/13/2021: Our [Scene Graph Benchmark Repo](https://github.com/microsoft/scene_graph_benchmark) has been released. Welcome to use the code there to extract image features with VinVL pretrained models.
9 | 10 | 11 | ## Introduction 12 | This repository contains source code necessary to reproduce the results presented in the paper [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks](https://arxiv.org/abs/2004.06165). 13 | We propose a new cross-modal pre-training method **Oscar** (Object-Semantics Aligned Pre-training). It leverages **object tags** detected in images as anchor points to significantly ease the learning of image-text alignments. We pre-train Oscar on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. For more on this project, see the [Microsoft Research Blog post](https://www.microsoft.com/en-us/research/blog/objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language/). 14 | 15 | 16 | 17 | 18 | ## Performance 19 | Task | t2i | t2i | i2t | i2t | IC | IC | IC | IC | NoCaps | NoCaps | VQA | NLVR2 | GQA | 20 | --------|-----|-----|-----|-----|-----|-----|------|------|--------|--------|----------|---------|---------| 21 | Metric | R@1 | R@5 | R@1 | R@5 | B@4 | M | C | S | C | S | test-std | test-P | test-std| 22 | SoTA_S |39.2 | 68.0|56.6 | 84.5|38.9 |29.2 |129.8 | 22.4 | 61.5 | 9.2 | 70.92 | 58.80 | 63.17 | 23 | SoTA_B |54.0 | 80.8|70.0 | 91.1|40.5 |29.7 |137.6 | 22.8 | 86.58| 12.38 | 73.67 | 79.30 | - | 24 | SoTA_L |57.5 | 82.8|73.5 | 92.2|41.7 |30.6 |140.0 | 24.5 | - | - | 74.93 | 81.47 | - | 25 | ----- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- | 26 | Oscar_B |54.0 | 80.8|70.0 | 91.1|40.5 |29.7 |137.6 | 22.8 | 78.8 | 11.7 | 73.44 | 78.36 | 61.62 | 27 | Oscar_L |57.5 | 82.8|73.5 | 92.2|41.7 |30.6 |140.0 | 24.5 | 80.9 | 11.3 | 73.82 | 80.05 | - | 28 | ----- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- | 29 | VinVL_B |58.1 | 83.2|74.6 | 92.6|40.9 |30.9 |140.6 | 25.1 | 92.46| 13.07 | 76.12 | 83.08 | 64.65 | 30 | VinVL_L |58.8 | 83.5|75.4 | 92.9|41.0 |31.1 |140.9 | 25.2 | - | - | 76.62 | 83.98 | - | 31 | gain | 1.3 | 0.7| 1.9 | 0.6| -0.7| 0.5 | 0.9 | 0.7 | 5.9 | 0.7 | 1.69 | 2.51 | 1.48 | 32 | 33 | t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO. 34 | 35 | 36 | ## Download 37 | We released pre-trained models, datasets, VinVL image features, and Oscar+ pretraining corpus for downstream tasks. 38 | Please check [VinVL_DOWNLOAD.md](VinVL_DOWNLOAD.md) for details. 39 | 40 | To download checkpoints for the Vanilla OSCAR, please check [DOWNLOAD.md](DOWNLOAD.md) for details. 41 | 42 | ## Installation 43 | Check [INSTALL.md](INSTALL.md) for installation instructions. 44 | 45 | ## Model Zoo 46 | Check [MODEL_ZOO.md](MODEL_ZOO.md) for scripts to run oscar downstream finetuning. 47 | 48 | Check [VinVL_MODEL_ZOO.md](VinVL_MODEL_ZOO.md) for scripts to run oscar+ pretraining and downstream finetuning. 49 | 50 | ## Citations 51 | Please consider citing this paper if you use the code: 52 | ``` 53 | @article{li2020oscar, 54 | title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks}, 55 | author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng}, 56 | journal={ECCV 2020}, 57 | year={2020} 58 | } 59 | 60 | @article{zhang2021vinvl, 61 | title={VinVL: Making Visual Representations Matter in Vision-Language Models}, 62 | author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng}, 63 | journal={CVPR 2021}, 64 | year={2021} 65 | } 66 | ``` 67 | 68 | ## License 69 | Oscar is released under the MIT license. See [LICENSE](LICENSE) for details. 70 | 71 | -------------------------------------------------------------------------------- /VinVL/Oscar/SECURITY.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## Security 4 | 5 | Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/). 6 | 7 | If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us as described below. 8 | 9 | ## Reporting Security Issues 10 | 11 | **Please do not report security vulnerabilities through public GitHub issues.** 12 | 13 | Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report). 14 | 15 | If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc). 16 | 17 | You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 18 | 19 | Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue: 20 | 21 | * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.) 22 | * Full paths of source file(s) related to the manifestation of the issue 23 | * The location of the affected source code (tag/branch/commit or direct URL) 24 | * Any special configuration required to reproduce the issue 25 | * Step-by-step instructions to reproduce the issue 26 | * Proof-of-concept or exploit code (if possible) 27 | * Impact of the issue, including how an attacker might exploit the issue 28 | 29 | This information will help us triage your report more quickly. 30 | 31 | If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs. 32 | 33 | ## Preferred Languages 34 | 35 | We prefer all communications to be in English. 36 | 37 | ## Policy 38 | 39 | Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd). 40 | 41 | -------------------------------------------------------------------------------- /VinVL/Oscar/VinVL_DOWNLOAD.md: -------------------------------------------------------------------------------- 1 | # Download 2 | 3 | ## Datasets 4 | We provide the extracted image region features, object tags, and the original text annotations for each downstream tasks. 5 | ```bash 6 | path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/datasets/TASK_NAME' --recursive 7 | ``` 8 | `TASK_NAME` could be `coco_caption`, `nocaps`, `coco_ir`, `vqa`, `gqa`, `nlvr2`. 9 | 10 | ## Pre-trained Models 11 | We provide pre-trained *Oscar+* models of Bert-base and Bert-large structures, with the name starting with `base` and `large`, respectively. 12 | ```bash 13 | path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/TASK_NAME' --recursive 14 | ``` 15 | `TASK_NAME` could be `image_captioning` (including `nocaps`), `coco_ir`, `vqa`, `gqa`, `nlvr2`, `od_models`. 16 | 17 | The models are trained with both image region features and object tags. The image region features are extracted by the Faster R-CNN with 18 | ResNet-101, using object and attribute annotations from [Visual Genome](http://visualgenome.org/). 19 | The object tags are from: 20 | 1) the same VisualGenome model, named as `-vg-labels`. Or, 21 | 2) the model trained on object annotations from [Open Images V5](https://storage.googleapis.com/openimages/web/index.html). named as `-oid-labels`. Or, 22 | 3) no object tags provied, serving as baseline, named as `-no-labels`. 23 | 24 | ## Pre-exacted Image Features 25 | For ease-of-use, we make pretrained features available for all pretraining datasets and downstream tasks. 26 | Features are stored in tsv (tab-separated-values) format that can be used in [pretraining](oscar/datasets/oscar_tsv.py) and dowstream tasks like [COCO Image-Text Retrieval](oscar/run_retrieval.py). 27 | 28 | Notice that all the links below are links to a folder. We recommend using the following AzCopy command to download. 29 | ``` 30 | path/to/azcopy copy --recursive 31 | ``` 32 | 33 | [COCO 2014 Train/Val Image Features (~50G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/) 34 | 35 | [COCO 2014 Test Image Features (~16G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/coco2014test/) 36 | 37 | [COCO 2015 Test Image Features (~32G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/coco2015test/) 38 | 39 | [GQA All Image Features (~62G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/gqa_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/) 40 | 41 | [NVLR2 Train/Del/Test Image Features (~28G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/nlvr2_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/) 42 | 43 | [Flickr30k All Image Features (~14G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/flickr30k_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/) 44 | 45 | [Google Conceptual Captions Image Features (Huge, ~960G, splitted into 12 chunks)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/googlecc_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/) 46 | 47 | [SBU Image Features (Huge, ~280G, splitted into 4 chunks)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/sbu_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/) 48 | 49 | [Open Images Detection Image Features (Huge, ~530G, splitted into 8 chunks)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/oi_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/) 50 | 51 | 52 | ## Oscar+ pretraining corpus 53 | 54 | 55 | [Small corpus](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_gqa.tsv) 56 | 57 | [Medium corpus](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_gqa_oi.tsv) 58 | 59 | [Large corpus](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_googlecc_gqa_sbu_oi.tsv) 60 | 61 | We have tried our best to make sure that there is no data contamination between pretraining corpus and test sets for downstream tasks. 62 | More specifically, we use two methods to achieve this. 63 | (1) We use the COCO Image ID of Visual Genome and Flickr30k images. 64 | (2) For COCO, Visual Genome and Flickr30k, we calucate the pair-wise l2 norm between two images after resizing them into the same size. 65 | 66 | 67 | ### Note 68 | It is recommended to download large files with **AzCopy** for faster speed. 69 | AzCopy executable tools can be downloaded [here](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10#download-azcopy). 70 | Decompress the tar file and put the executable in any path. To download from 71 | any URL above, the command is: 72 | ```bash 73 | path/to/azcopy copy 74 | 75 | # for example, downloading coco_caption.zip 76 | path/to/azcopy copy https://biglmdiag.blob.core.windows.net/oscar/datasets/coco_caption.zip 77 | ``` 78 | 79 | -------------------------------------------------------------------------------- /VinVL/Oscar/VinVL_MODEL_ZOO.md: -------------------------------------------------------------------------------- 1 | ## Table of Contents 2 | - VQA 3 | - GQA 4 | - NLVR2 5 | - Image/Text Retrieval 6 | - Image Captioning on COCO 7 | - Oscarplus pretraining 8 | 9 | 10 | ## Performance 11 | Task | t2i | t2i | i2t | i2t | IC | IC | IC | IC | NoCaps | NoCaps | VQA | NLVR2 | GQA | 12 | --------|-----|-----|-----|-----|-----|-----|------|------|--------|--------|----------|---------|---------| 13 | Metric | R@1 | R@5 | R@1 | R@5 | B@4 | M | C | S | C | S | test-std | test-P | test-std| 14 | SoTA_S |39.2 | 68.0|56.6 | 84.5|38.9 |29.2 |129.8 | 22.4 | 61.5 | 9.2 | 70.92 | 58.80 | 63.17 | 15 | SoTA_B |54.0 | 80.8|70.0 | 91.1|40.5 |29.7 |137.6 | 22.8 | 86.58| 12.38 | 73.67 | 79.30 | 61.62 | 16 | SoTA_L |57.5 | 82.8|73.5 | 92.2|41.7 |30.6 |140.0 | 24.5 | - | - | 74.93 | 81.47 | - | 17 | ----- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- | 18 | VinVL_B |58.1 | 83.2|74.6 | 92.6|40.9 |30.9 |140.4 | 25.1 | 92.46 (with [VIVO](https://arxiv.org/abs/2009.13682))| 13.07 (with [VIVO](https://arxiv.org/abs/2009.13682)) | 76.12 | 83.08 | 64.65 | 19 | VinVL_L |58.8 | 83.5|75.4 | 92.9|41.0 |31.1 |140.9 | 25.2 | - | - | 76.62 | 83.98 | - | 20 | gain | 1.3 | 0.7| 1.9 | 0.6| -0.7| 0.5 | 0.9 | 0.7 | 5.9 | 0.7 | 1.69 | 2.51 | 1.48 | 21 | 22 | t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO. 23 | 24 | For reference, we also release the training logs and output. 25 | 26 | 27 | ## VQA 28 | Script to finetune for Oscar base model. 29 | Base model is trained on train split and evaluated on the val split. Good for later comparison. 30 | 33 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/vqa/base/test/results.txt).
34 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/vqa/base/best.zip). 35 | ```bash 36 | python oscar/run_vqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length 37 | 50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir vinvl/datasets/vqa 38 | --model_type bert --model_name_or_path vinvl/model_ckpts/vqa/base/checkpoint-2000000 39 | --task_name vqa_text --do_train --do_lower_case --max_seq_length 128 --per_gpu_eval_batch_size 40 | 256 --per_gpu_train_batch_size 32 --learning_rate 5e-05 --num_train_epochs 25 41 | --output_dir results --label_file datasets/vqa/cache/trainval_ans2label.pkl 42 | --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 --drop_out 43 | 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce --img_feat_format pt 44 | --classifier linear --cls_hidden_scale 3 --txt_data_dir vinvl/datasets/vqa 45 | ``` 46 | 47 | Script to finetune for Oscar large model. 48 | Large model is trained on train+val split and evaluated on the val split, for reproduce the paper's best result. 49 | 50 | 53 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/vqa/large/test/results.txt).
54 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/vqa/large/best.zip). 55 | ```bash 56 | python oscar/run_vqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length 57 | 50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir vinvl/datasets/vqa 58 | --model_type bert --model_name_or_path vinvl/model_ckpts/vqa/large/checkpoint-2000000 59 | --task_name vqa_text --do_train_val --do_lower_case --max_seq_length 128 --per_gpu_eval_batch_size 60 | 256 --per_gpu_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 25 61 | --label_file datasets/vqa/cache/trainval_ans2label.pkl --save_epoch 30 62 | --seed 88 --evaluate_during_training --logging_steps 4000 --drop_out 0.3 --weight_decay 63 | 0.05 --warmup_steps 0 --loss_type bce --save_after_epoch 15 --output_dir results --img_feat_format pt --classifier linear --cls_hidden_scale 3 --txt_data_dir vinvl/datasets/vqa 64 | ``` 65 | 66 | 67 | ## GQA 68 | Script to finetune for Oscar base model. 69 | 70 | 73 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/gqa/base/results.txt).
74 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/gqa/base/best.zip). 75 | ```bash 76 | python oscar/run_gqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length 77 | 45 --data_dir vinvl/datasets/gqa --model_type bert --model_name_or_path vinvl/model_ckpts/vqa/base/checkpoint-2000000 78 | --task_name gqa --do_lower_case --max_seq_length 165 --per_gpu_eval_batch_size 79 | 256 --per_gpu_train_batch_size 48 --learning_rate 5e-05 --num_train_epochs 5 --output_dir 80 | results --label_file vinvl/datasets/gqa/trainval_testdev_all_ans2label.pkl 81 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type all --eval_data_type 82 | bal --label2ans_file vinvl/datasets/gqa/trainval_testdev_all_label2ans.pkl 83 | --loss_type xe --save_epoch 2 --seed 88 --evaluate_during_training --logging_steps 84 | 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 85 | ``` 86 | 87 | ## NLVR2 88 | Script to finetune for Oscar base model. 89 | 90 | 93 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/nlvr2/base/rvln_base_oscar_v2_71.5_86236_test_done_best/exp_rvln_base_oscar_v2_71.5_86236_test_b24_0.00003_55_41_dp_0.3_mlp_abch/stdout.txt).
94 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/nlvr2/base/best.zip). 95 | ```bash 96 | python oscar/run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length 97 | 40 --data_dir vinvl/datasets/nlvr2 --model_type bert --model_name_or_path vinvl/model_ckpts/vqa/base/checkpoint-2000000 98 | --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size 99 | 64 --per_gpu_train_batch_size 72 --learning_rate 3e-05 --num_train_epochs 20 --output_dir 100 | results --img_feature_type faster_r-cnn --data_label_type all --train_data_type 101 | all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training 102 | --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 103 | 10000 --classifier mlp --cls_hidden_scale 3 --num_choice 2 --use_pair 104 | ``` 105 | 106 | Script to finetune for Oscar large model. 107 | 108 | 111 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/nlvr2/large/rvln_oscar_v2_large_99617_test_done_best/exp_rvln_oscar_v2_large_99617_test_b24_0.00003_55_50_dp_0.3_mlp_abce/stdout.txt).
112 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/nlvr2/large/best.zip). 113 | ```bash 114 | python oscar/run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length 115 | 40 --data_dir vinvl/datasets/nlvr2 --model_type bert --model_name_or_path vinvl/model_ckpts/vqa/large/checkpoint-2000000 116 | --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size 117 | 64 --per_gpu_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 20 --output_dir 118 | results --img_feature_type faster_r-cnn --data_label_type all --train_data_type 119 | all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training 120 | --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 121 | 5000 --classifier mlp --cls_hidden_scale 2 --num_choice 2 --use_pair 122 | ``` 123 | 124 | 138 | 139 | ## Image Text Retrieval 140 | Script to finetune for Oscarplus base model (8 V100 with 16G mem): 141 | 142 | Training logs: [train_logs](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/train_logs/), 143 | 144 | Training logs: [test_logs](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/test_logs/), 145 | 146 | Command [command](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/philly.yaml). 147 | 148 | Model checkpoint: [ckeckpoint](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/checkpoint-0132780/). 149 | 150 | ```bash 151 | python oscar/run_retrieval.py \ 152 | --model_name_or_path vinvl/coco_ir/base/checkpoint-1340000 \ 153 | --do_train \ 154 | --do_lower_case \ 155 | --evaluate_during_training \ 156 | --num_captions_per_img_val 20 \ 157 | --eval_caption_index_file minival_caption_indexs_top20.pt \ 158 | --per_gpu_train_batch_size 16 \ 159 | --learning_rate 0.00002 \ 160 | --num_train_epochs 30 \ 161 | --weight_decay 0.05 \ 162 | --save_steps 5000 \ 163 | --add_od_labels \ 164 | --od_label_type vg \ 165 | --max_seq_length 70 \ 166 | --max_img_seq_length 70 \ 167 | --output_dir output/ 168 | ``` 169 | 170 | Script to finetune for Oscarplus large model (8 V100 with 32G mem): 171 | 172 | Training logs: [train_logs](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/large/train_logs/), 173 | 174 | Training logs: [test_logs](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/large/test_logs/), 175 | 176 | Command [command](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/large/philly.yaml). 177 | 178 | Model checkpoint: [ckeckpoint](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/large/checkpoint-0132780/). 179 | 180 | ```bash 181 | python oscar/run_retrieval.py \ 182 | --model_name_or_path vinvl/coco_ir/base/checkpoint-0660000 \ 183 | --do_train \ 184 | --do_lower_case \ 185 | --evaluate_during_training \ 186 | --num_captions_per_img_val 20 \ 187 | --eval_caption_index_file minival_caption_indexs_top20.pt \ 188 | --per_gpu_train_batch_size 16 \ 189 | --learning_rate 7.5e-06 \ 190 | --num_train_epochs 30 \ 191 | --save_steps 5000 \ 192 | --add_od_labels \ 193 | --od_label_type vg \ 194 | --max_seq_length 70 \ 195 | --max_img_seq_length 70 \ 196 | --output_dir output \ 197 | --img_feat_file vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/features.tsv 198 | ``` 199 | 200 | Script to inference on COCO 1K test set: 201 | ```bash 202 | python oscar/run_retrieval.py \ 203 | --do_test \ 204 | --do_eval \ 205 | --test_split test \ 206 | --num_captions_per_img_val 5 \ 207 | --eval_img_keys_file test_img_keys_1k.tsv \ 208 | --cross_image_eval \ 209 | --per_gpu_eval_batch_size 64 \ 210 | --img_feat_file vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/features.tsv \ 211 | --eval_model_dir your_model_for_evaluation # could be base/large models. 212 | ``` 213 | 214 | Script to inference on COCO 5K test set: 215 | ```bash 216 | python oscar/run_retrieval.py \ 217 | --do_test \ 218 | --do_eval \ 219 | --test_split test \ 220 | --num_captions_per_img_val 5 \ 221 | --eval_img_keys_file test_img_keys.tsv \ 222 | --cross_image_eval \ 223 | --per_gpu_eval_batch_size 64 \ 224 | --img_feat_file vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/features.tsv \ 225 | --eval_model_dir your_model_for_evaluation # could be base/large models. 226 | ``` 227 | 228 | 229 | ## Image Captioning on COCO 230 | Script to finetune for base model: 231 | 232 | Pretrained model checkpoint: [pretrained_base.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/pretrained_base.zip). 233 | Finetuned model checkpoint (w/ cross entropy): [coco_captioning_base_xe.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/coco_captioning_base_xe.zip). 234 | Finetuned model checkpoint (w/ CIDEr optimization): [coco_captioning_base_scst.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/coco_captioning_base_scst.zip). 235 | 236 | 1) First train with cross-entropy loss (8 V100 with 16G mem): 237 | ```bash 238 | python oscar/run_captioning.py \ 239 | --model_name_or_path pretrained_models/image_captioning/pretrained_base \ 240 | --do_train \ 241 | --do_lower_case \ 242 | --add_od_labels \ 243 | --learning_rate 3e-5 \ 244 | --per_gpu_train_batch_size 64 \ 245 | --num_train_epochs 60 \ 246 | --tie_weights \ 247 | --freeze_embedding \ 248 | --label_smoothing 0.1 \ 249 | --drop_worst_ratio 0.2 \ 250 | --drop_worst_after 20000 \ 251 | --output_dir output/ 252 | ``` 253 | 2) Finetune with CIDEr optimization (8 V100 with 32G mem): 254 | ```bash 255 | python oscar/run_captioning.py \ 256 | --model_name_or_path your_checkpoint_from_cross_entropy \ 257 | --do_train \ 258 | --do_lower_case \ 259 | --add_od_labels \ 260 | --learning_rate 3e-6 \ 261 | --per_gpu_train_batch_size 16 \ 262 | --num_train_epochs 75 \ 263 | --tie_weights \ 264 | --freeze_embedding \ 265 | --scst \ 266 | --output_dir output/ 267 | ``` 268 | 269 | Script to finetune for large model: 270 | 271 | Pretrained model checkpoint: [pretrained_large.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/pretrained_large.zip). 272 | Finetuned model checkpoint (w/ cross entropy): [coco_captioning_large_xe.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/coco_captioning_large_xe.zip). 273 | Finetuned model checkpoint (w/ CIDEr optimization): [coco_captioning_large_scst.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/coco_captioning_large_scst.zip). 274 | 275 | 1) First train with cross-entropy loss (8 V100 with 32G mem): 276 | ```bash 277 | python oscar/run_captioning.py \ 278 | --model_name_or_path pretrained_models/image_captioning/pretrained_large \ 279 | --do_train \ 280 | --do_lower_case \ 281 | --add_od_labels \ 282 | --learning_rate 1e-5 \ 283 | --per_gpu_train_batch_size 64 \ 284 | --num_train_epochs 60 \ 285 | --tie_weights \ 286 | --freeze_embedding \ 287 | --label_smoothing 0.1 \ 288 | --drop_worst_ratio 0.2 \ 289 | --drop_worst_after 20000 \ 290 | --output_dir output/ 291 | ``` 292 | 2) Finetune with CIDEr optimization (8 V100 with 32G mem): 293 | ```bash 294 | python oscar/run_captioning.py \ 295 | --model_name_or_path your_checkpoint_from_cross_entropy \ 296 | --do_train \ 297 | --do_lower_case \ 298 | --add_od_labels \ 299 | --learning_rate 8e-7 \ 300 | --per_gpu_train_batch_size 6 \ 301 | --num_train_epochs 25 \ 302 | --tie_weights \ 303 | --freeze_embedding \ 304 | --scst \ 305 | --output_dir output/ 306 | ``` 307 | 308 | Script to inference on COCO test set: 309 | ```bash 310 | python oscar/run_captioning.py \ 311 | --do_test \ 312 | --do_eval \ 313 | --test_yaml test.yaml \ 314 | --per_gpu_eval_batch_size 64 \ 315 | --num_beams 5 \ 316 | --max_gen_length 20 \ 317 | --eval_model_dir your_model_for_evaluation # could be base or large models 318 | ``` 319 | 320 | ## Image Captioning on NoCaps 321 | Note that [NoCaps] (https://nocaps.org/) does not allow to use extra 322 | image-caption pairs for training except COCO. So the model is directly initialized 323 | from bert-base, and trained on COCO data. 324 | 325 | Script to train base model: 326 | 327 | Finetuned model checkpoint (w/ cross entropy): [nocaps_base_xe.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/nocaps_base_xe.zip). 328 | Finetuned model checkpoint (w/ CIDEr optimization): [nocaps_base_scst.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/nocaps_base_scst.zip). 329 | 330 | 1) First train with cross-entropy loss (4 V100 with 16G mem): 331 | ```bash 332 | python oscar/run_captioning.py \ 333 | --model_name_or_path bert-base-uncased \ 334 | --do_train \ 335 | --do_lower_case \ 336 | --add_od_labels \ 337 | --learning_rate 0.0001 \ 338 | --per_gpu_train_batch_size 64 \ 339 | --num_train_epochs 30 \ 340 | --tie_weights \ 341 | --freeze_embedding \ 342 | --output_dir output/ 343 | ``` 344 | 2) Train with CIDEr optimization (8 V100 with 32G mem): 345 | ```bash 346 | python oscar/run_captioning.py \ 347 | --model_name_or_path your_checkpoint_from_cross_entropy \ 348 | --do_train \ 349 | --do_lower_case \ 350 | --add_od_labels \ 351 | --scheduler constant \ 352 | --learning_rate 5e-6 \ 353 | --per_gpu_train_batch_size 14 \ 354 | --num_train_epochs 50 \ 355 | --tie_weights \ 356 | --freeze_embedding \ 357 | --scst \ 358 | --output_dir output/ 359 | ``` 360 | 361 | Script to inference on NoCaps val set with Constrained Beam Search: 362 | ```bash 363 | python oscar/run_captioning.py \ 364 | --do_test \ 365 | --do_eval \ 366 | --data_dir datasets/nocaps \ 367 | --test_yaml val.yaml \ 368 | --per_gpu_eval_batch_size 2 \ 369 | --num_beams 5 \ 370 | --use_cbs \ 371 | --max_gen_length 20 \ 372 | --eval_model_dir your_model_for_evaluation 373 | ``` 374 | 375 | 383 | 384 | ## Oscarplus pretraining 385 | Table 16 below shows the statistics of image and text of the pre-training corpora. 386 | In our ablation study, we have corpora of three different sizes: [Small](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_gqa_x152c4big2exp168.yaml), [Medium](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_gqa_oi_x152c4big2exp168.yaml), [Large](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml). 387 | Notice that we make use of image tagging datasets OpenImages, by generating captions using OSCAR's image captioning model to form triplets of ``(generated caption, image tags, image features)'' for the OSCAR+ pre-training. 388 | 389 | 390 | Script to perform oscar+ pretraining with the [large corpus](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml). 391 | ```bash 392 | python -m torch.distributed.launch --nproc_per_node=8 oscar/run_oscarplus_pretrain.py \ 393 | --use_b 1 \ 394 | --max_grad_norm 10.0 --gradient_accumulation_steps 1 \ 395 | --use_img_layernorm 1 \ 396 | --output_dir \ 397 | --bert_model bert --model_name_or_path bert-base-uncased \ 398 | --do_lower_case --learning_rate 5e-05 399 | --warmup_steps 0 --do_train --max_seq_length 35 --on_memory \ 400 | --max_img_seq_length 50 --img_feature_dim 2054 \ 401 | --drop_out 0.1 --train_batch_size 8 \ 402 | --ckpt_period 10000 --max_iters 2000000 --log_period 100 \ 403 | --data_dir --dataset_file coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml 404 | --textb_sample_mode 1 --texta_false_prob 0.25 405 | ``` 406 | 407 | 408 | One can perform the vanilla OSCAR pretraining by setting 409 | ```bash 410 | --textb_sample_mode 0 --texta_false_prob 0.0 411 | ``` 412 | 413 | One can also split the large pretraining corpus into two parts, i.e., coco_flickr30k_gqa + googlecc_sbu_oi, and use different textb_sample_modes for them. 414 | To set textb_sample_mode=2 for coco_flickr30k_gqa has the potential to emphasize the QA-pairs in the small corpus. 415 | ```bash 416 | --data_dir --dataset_file coco_flickr30k_gqa_x152c4big2exp168.yaml 417 | --textb_sample_mode 2 --texta_false_prob 0.25 \ 418 | --extra_dataset_file googlecc_sbu_oi_x152c4big2exp168.yaml \ 419 | --extra_textb_sample_mode 1 --extra_loss_weight 0.5 420 | ``` -------------------------------------------------------------------------------- /VinVL/Oscar/docs/oscar.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CCIIPLab/DPT/b6b4835ce84cff0d594854d4a7d3c2768f87cd9e/VinVL/Oscar/docs/oscar.PNG -------------------------------------------------------------------------------- /VinVL/Oscar/docs/oscar_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CCIIPLab/DPT/b6b4835ce84cff0d594854d4a7d3c2768f87cd9e/VinVL/Oscar/docs/oscar_logo.png -------------------------------------------------------------------------------- /VinVL/Oscar/docs/pretrain_corpus.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CCIIPLab/DPT/b6b4835ce84cff0d594854d4a7d3c2768f87cd9e/VinVL/Oscar/docs/pretrain_corpus.PNG -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = "0.1.0" 2 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/datasets/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = "0.1.0" 2 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/datasets/build.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | import torch 4 | from oscar.utils.misc import get_world_size 5 | from .oscar_tsv import OscarTSVDataset 6 | from transformers.pytorch_transformers import BertTokenizer 7 | 8 | 9 | class BatchCollator(object): 10 | """ 11 | From a list of samples from the dataset, 12 | returns the images and targets. 13 | """ 14 | def __call__(self, batch): 15 | return list(zip(*batch)) 16 | 17 | 18 | def build_dataset(args): 19 | """ 20 | Arguments: 21 | args: configuration. 22 | """ 23 | full_yaml_file = os.path.join(args.data_dir, args.dataset_file) 24 | assert os.path.isfile(full_yaml_file) 25 | 26 | tokenizer = BertTokenizer.from_pretrained( 27 | args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, 28 | do_lower_case=args.do_lower_case) 29 | 30 | cfg = dict( 31 | yaml_file=full_yaml_file, 32 | args=args, 33 | seq_len=args.max_seq_length, 34 | on_memory=args.on_memory, 35 | tokenizer=tokenizer, 36 | ) 37 | # make dataset from factory 38 | datasets = [OscarTSVDataset(**cfg)] 39 | if args.extra_dataset_file: 40 | full_yaml_file = os.path.join(args.data_dir, args.extra_dataset_file) 41 | assert os.path.isfile(full_yaml_file) 42 | cfg['yaml_file'] = full_yaml_file 43 | cfg['textb_sample_mode'] = args.extra_textb_sample_mode 44 | datasets.append(OscarTSVDataset(**cfg)) 45 | 46 | return datasets 47 | 48 | 49 | def make_data_sampler(dataset, shuffle, distributed): 50 | if distributed: 51 | return torch.utils.data.distributed.DistributedSampler( 52 | dataset, shuffle=shuffle 53 | ) 54 | if shuffle: 55 | sampler = torch.utils.data.sampler.RandomSampler(dataset) 56 | else: 57 | sampler = torch.utils.data.sampler.SequentialSampler(dataset) 58 | return sampler 59 | 60 | 61 | class IterationBasedBatchSampler(torch.utils.data.sampler.BatchSampler): 62 | """ 63 | Wraps a BatchSampler, resampling from it until 64 | a specified number of iterations have been sampled 65 | """ 66 | 67 | def __init__(self, batch_sampler, num_iterations, start_iter=0): 68 | self.batch_sampler = batch_sampler 69 | self.num_iterations = num_iterations 70 | self.start_iter = start_iter 71 | 72 | def __iter__(self): 73 | iteration = self.start_iter 74 | while iteration <= self.num_iterations: 75 | # if the underlying sampler has a set_epoch method, like 76 | # DistributedSampler, used for making each process see 77 | # a different split of the dataset, then set it 78 | if hasattr(self.batch_sampler.sampler, "set_epoch"): 79 | self.batch_sampler.sampler.set_epoch(iteration) 80 | for batch in self.batch_sampler: 81 | iteration += 1 82 | if iteration > self.num_iterations: 83 | break 84 | yield batch 85 | 86 | def __len__(self): 87 | return self.num_iterations 88 | 89 | 90 | def make_batch_data_sampler( 91 | sampler, images_per_batch, num_iters=None, 92 | start_iter=0 93 | ): 94 | batch_sampler = torch.utils.data.sampler.BatchSampler( 95 | sampler, images_per_batch, drop_last=False 96 | ) 97 | if num_iters is not None and num_iters >= 0: 98 | batch_sampler = IterationBasedBatchSampler( 99 | batch_sampler, num_iters, start_iter 100 | ) 101 | return batch_sampler 102 | 103 | 104 | def make_data_loader(args, is_distributed=False, arguments=None): 105 | num_gpus = get_world_size() 106 | # figure out start iteration 107 | if arguments is None: 108 | start_iter = 0 109 | else: 110 | start_iter = arguments['iteration'] 111 | # figure out the batchsize 112 | grad_accumulate_steps = 1 113 | if hasattr(args, 'gradient_accumulation_steps'): 114 | grad_accumulate_steps = args.gradient_accumulation_steps 115 | assert ( 116 | args.train_batch_size % grad_accumulate_steps == 0 117 | ), "train_batch_size ({}) must be divisible by the number " 118 | "of Gradient accumulation ({}) used."\ 119 | .format(args.train_batch_size, grad_accumulate_steps) 120 | images_per_batch = args.train_batch_size//grad_accumulate_steps 121 | assert ( 122 | images_per_batch % num_gpus == 0 123 | ), "SOLVER.IMS_PER_BATCH ({}) must be divisible by the number " 124 | "of GPUs ({}) used.".format(images_per_batch, num_gpus) 125 | images_per_gpu = images_per_batch // num_gpus 126 | logger = logging.getLogger(__name__) 127 | logger.info("Train with {} images per GPU".format(images_per_gpu)) 128 | shuffle = True 129 | num_iters = args.max_iters * grad_accumulate_steps 130 | 131 | # build dataset 132 | datasets = build_dataset(args) 133 | 134 | data_loaders = [] 135 | for i, dataset in enumerate(datasets): 136 | sampler = make_data_sampler(dataset, shuffle, is_distributed) 137 | 138 | batch_sampler = make_batch_data_sampler( 139 | sampler, images_per_gpu, num_iters, start_iter 140 | ) 141 | num_workers = args.num_workers 142 | data_loader = torch.utils.data.DataLoader( 143 | dataset, 144 | num_workers=num_workers, 145 | batch_sampler=batch_sampler, 146 | collate_fn=BatchCollator(), 147 | pin_memory=True, 148 | ) 149 | data_loaders.append(data_loader) 150 | return data_loaders 151 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/modeling/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = "0.1.0" 2 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/run_oscarplus_pretrain.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import, division, print_function 2 | 3 | import argparse 4 | import datetime 5 | import json 6 | import logging 7 | import os 8 | import random 9 | import sys 10 | import time 11 | import math 12 | import shutil 13 | 14 | sys.path.insert(0, '.') 15 | 16 | import numpy as np 17 | import torch 18 | 19 | from oscar.modeling.modeling_bert import BertImgForPreTraining 20 | from transformers.pytorch_transformers import (WEIGHTS_NAME, BertConfig, 21 | BertTokenizer) 22 | 23 | from oscar.datasets.build import make_data_loader 24 | 25 | from transformers.pytorch_transformers import AdamW, WarmupLinearSchedule 26 | from oscar.utils.misc import mkdir, get_rank 27 | from oscar.utils.metric_logger import TensorboardLogger 28 | 29 | logger = logging.getLogger(__name__) 30 | 31 | ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig,)), ()) 32 | 33 | MODEL_CLASSES = { 34 | 'bert': (BertConfig, BertImgForPreTraining, BertTokenizer), 35 | } 36 | 37 | 38 | """ ****** Pretraining ****** """ 39 | 40 | 41 | def main(): 42 | parser = argparse.ArgumentParser() 43 | 44 | ## Required parameters 45 | parser.add_argument("--data_dir", default=None, type=str, required=False, 46 | help="The input data dir. " 47 | "Should contain the .yaml files for the task.") 48 | parser.add_argument("--dataset_file", default=None, type=str, required=True, 49 | help="The training dataset yaml file.") 50 | parser.add_argument("--extra_dataset_file", default=None, type=str, required=False, 51 | help="The extra training dataset yaml file.") 52 | parser.add_argument("--bert_model", default=None, type=str, required=True, 53 | help="Bert pre-trained model selected in the list: bert-base-uncased, " 54 | "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.") 55 | parser.add_argument("--output_dir", default=None, type=str, required=True, 56 | help="The output directory where the model checkpoints will be written.") 57 | 58 | # image chunks 59 | parser.add_argument("--chunk_start_id", default=-1, type=int, 60 | help="Image Chunk Start ID") 61 | parser.add_argument("--chunk_end_id", default=-1, type=int, 62 | help="Image Chunk End ID") 63 | 64 | ## Image parameters 65 | parser.add_argument("--max_img_seq_length", default=50, type=int, 66 | help="The maximum total input image sequence length.") 67 | parser.add_argument("--img_feature_dim", default=2054, type=int, 68 | help="The Image Feature Dimension.") 69 | parser.add_argument("--img_feature_type", default='faster_r-cnn', type=str, 70 | help="faster_r-cnn or mask_r-cnn") 71 | parser.add_argument("--use_layernorm", action='store_true', 72 | help="use_layernorm") 73 | 74 | parser.add_argument("--drop_out", default=0.1, type=float, 75 | help="Drop out for BERT.") 76 | 77 | parser.add_argument("--use_b", type=int, default=1, help="use_b") 78 | parser.add_argument("--textb_sample_mode", type=int, default=0, 79 | help="0: sample from both texta&textb, " 80 | "1: sample from textb, " 81 | "2: sample from QA answers") 82 | parser.add_argument("--extra_textb_sample_mode", type=int, default=1) 83 | parser.add_argument("--texta_false_prob", type=float, default=0.0, 84 | help="the probality that we sample wrong texta, should in [0.0, 0.5]") 85 | 86 | parser.add_argument("--model_name_or_path", default=None, type=str, 87 | required=True, 88 | help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join( 89 | ALL_MODELS)) 90 | parser.add_argument("--config_name", default="", type=str, 91 | help="Pretrained config name or path if not the same as model_name") 92 | parser.add_argument("--tokenizer_name", default="", type=str, 93 | help="Pretrained tokenizer name or path if not the same as model_name") 94 | parser.add_argument("--cache_dir", default="", type=str, 95 | help="Where do you want to store the pre-trained models downloaded from s3") 96 | 97 | parser.add_argument("--max_seq_length", default=35, type=int, 98 | help="The maximum total input sequence length after WordPiece tokenization. \n" 99 | "Sequences longer than this will be truncated, and sequences shorter than this will be padded.") 100 | parser.add_argument("--do_train", action='store_true', 101 | help="Whether to run training.") 102 | parser.add_argument("--learning_rate", default=5e-5, type=float, 103 | help="The initial learning rate for Adam.") 104 | parser.add_argument("--max_iters", default=2000000, type=int, 105 | help="Maximal number of training iterations.") 106 | parser.add_argument("--train_batch_size", default=1024, type=int, 107 | help="Batch size for training.") 108 | parser.add_argument("--num_workers", default=6, type=int, 109 | help="Number of workers for dataset.") 110 | parser.add_argument("--adam_epsilon", default=1e-8, type=float, 111 | help="Epsilon for Adam optimizer.") 112 | parser.add_argument("--optim", default='adamw', type=str, 113 | help="The optimizer used for Bert, [adamw, lamb], default: adamw") 114 | parser.add_argument("--max_grad_norm", default=-1.0, type=float, help="Max gradient norm.") 115 | parser.add_argument("--warmup_steps", default=0, type=int, 116 | help="Linear warmup over warmup_steps.") 117 | parser.add_argument("--no_cuda", action='store_true', 118 | help="Whether not to use CUDA when available") 119 | parser.add_argument("--on_memory", action='store_true', 120 | help="Whether to load train samples into memory or use disk") 121 | parser.add_argument("--do_lower_case", action='store_true', 122 | help="Whether to lower case the input text. True for uncased models, False for cased models.") 123 | parser.add_argument("--local_rank", type=int, default=-1, 124 | help="local_rank for distributed training on gpus") 125 | parser.add_argument('--seed', type=int, default=42, 126 | help="random seed for initialization") 127 | parser.add_argument('--gradient_accumulation_steps', type=int, default=1, 128 | help="Number of updates steps to accumualte before performing a backward/update pass.") 129 | 130 | parser.add_argument("--from_scratch", action='store_true', 131 | help="train from scratch") 132 | parser.add_argument("--use_img_layernorm", type=int, default=0, 133 | help="Normalize image features with bertlayernorm") 134 | parser.add_argument("--img_layer_norm_eps", default=1e-12, type=float, 135 | help="The eps in image feature laynorm layer") 136 | # distributed 137 | parser.add_argument('--gpu_ids', type=str, default='-1') 138 | parser.add_argument("--mask_loss_for_unmatched", type=int, default=1, 139 | help="masked language model loss for unmatched triplets") 140 | parser.add_argument("--extra_loss_weight", type=float, default=0.0, 141 | help="the loss weight for the extra train data batch (should be in [0,1])") 142 | parser.add_argument( 143 | "--use_gtlabels", 144 | type=int, default=1, 145 | help="use groundtruth labels for text b or not" 146 | ) 147 | # logging 148 | parser.add_argument('--ckpt_period', type=int, default=10000, 149 | help="Period for saving checkpoint") 150 | parser.add_argument('--log_period', type=int, default=100, 151 | help="Period for saving logging info") 152 | args = parser.parse_args() 153 | 154 | if args.gpu_ids != '-1': 155 | os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu_ids 156 | 157 | args.num_gpus = int( 158 | os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1 159 | args.distributed = args.num_gpus > 1 160 | 161 | if args.gpu_ids != '-1': 162 | os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu_ids 163 | 164 | if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train: 165 | logger.info("Output Directory Exists.") 166 | 167 | # Setup CUDA, GPU & distributed training 168 | if args.local_rank == -1 or args.no_cuda: 169 | device = torch.device( 170 | "cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") 171 | args.n_gpu = torch.cuda.device_count() 172 | else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs 173 | torch.cuda.set_device(args.local_rank) 174 | device = torch.device("cuda", args.local_rank) 175 | torch.distributed.init_process_group( 176 | backend='nccl', init_method="env://" 177 | ) 178 | args.n_gpu = 1 179 | args.device = device 180 | 181 | # Setup logging 182 | logging.basicConfig( 183 | format='%(asctime)s - %(levelname)s - %(name)s - %(message)s', 184 | datefmt='%m/%d/%Y %H:%M:%S', 185 | level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN) 186 | logger.warning( 187 | "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s", 188 | args.local_rank, device, args.n_gpu, bool(args.local_rank != -1) 189 | ) 190 | 191 | if args.gradient_accumulation_steps < 1: 192 | raise ValueError( 193 | "Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format( 194 | args.gradient_accumulation_steps)) 195 | 196 | random.seed(args.seed) 197 | np.random.seed(args.seed) 198 | torch.manual_seed(args.seed) 199 | if args.n_gpu > 0: 200 | torch.cuda.manual_seed_all(args.seed) 201 | 202 | if not args.do_train: 203 | raise ValueError( 204 | "Training is currently the only implemented execution option. Please set `do_train`.") 205 | 206 | if not os.path.exists(args.output_dir): 207 | mkdir(args.output_dir) 208 | 209 | last_checkpoint_dir = None 210 | arguments = {"iteration": 0} 211 | if os.path.exists(args.output_dir): 212 | save_file = os.path.join(args.output_dir, "last_checkpoint") 213 | try: 214 | with open(save_file, "r") as f: 215 | last_saved = f.read() 216 | last_saved = last_saved.strip() 217 | except IOError: 218 | # if file doesn't exist, maybe because it has just been 219 | # deleted by a separate process 220 | last_saved = "" 221 | if last_saved: 222 | folder_name = os.path.splitext(last_saved.split('/')[0])[0] # in the form of checkpoint-00001 or checkpoint-00001/pytorch_model.bin 223 | last_checkpoint_dir = os.path.join(args.output_dir, folder_name) 224 | arguments["iteration"] = int(folder_name.split('-')[-1]) 225 | assert os.path.isfile(os.path.join(last_checkpoint_dir, WEIGHTS_NAME)), "Last_checkpoint detected, but file not found!" 226 | 227 | # model first 228 | if get_rank() != 0: 229 | torch.distributed.barrier() 230 | config_class, model_class, tokenizer_class = MODEL_CLASSES[args.bert_model] 231 | if last_checkpoint_dir is not None: # recovery 232 | args.model_name_or_path = last_checkpoint_dir 233 | logger.info(" -> Recovering model from {}".format(last_checkpoint_dir)) 234 | 235 | config = config_class.from_pretrained( 236 | args.config_name if args.config_name else args.model_name_or_path, 237 | ) 238 | config.img_layer_norm_eps = args.img_layer_norm_eps 239 | config.use_img_layernorm = args.use_img_layernorm 240 | 241 | # discrete code 242 | config.img_feature_dim = args.img_feature_dim 243 | config.img_feature_type = args.img_feature_type 244 | config.hidden_dropout_prob = args.drop_out 245 | if args.texta_false_prob < 0.5 and (args.texta_false_prob > 0 or not args.use_b): 246 | args.num_contrast_classes = 3 247 | else: 248 | args.num_contrast_classes = 2 249 | config.num_contrast_classes = args.num_contrast_classes 250 | 251 | # Prepare model 252 | # model = BertForPreTraining.from_pretrained(args.bert_model) 253 | load_num = 0 254 | while load_num < 10: 255 | try: 256 | model = BertImgForPreTraining.from_pretrained( 257 | args.model_name_or_path, 258 | from_tf=bool('.ckpt' in args.model_name_or_path), 259 | config=config) 260 | break 261 | except: 262 | load_num += 1 263 | 264 | # train from scratch 265 | if args.from_scratch: 266 | if last_checkpoint_dir is None: 267 | logger.info("Training from scratch ... ") 268 | model.apply(model.init_weights) 269 | total_params = sum(p.numel() for p in model.parameters()) 270 | logger.info( 271 | 'Total Parameters: {}'.format(total_params)) 272 | 273 | for key, val in vars(config).items(): 274 | setattr(args, key, val) 275 | 276 | if get_rank() == 0 and args.local_rank != -1: 277 | torch.distributed.barrier() 278 | 279 | model.to(args.device) 280 | 281 | logger.info("Training/evaluation parameters %s", args) 282 | 283 | tb_log_dir = os.path.join(args.output_dir, 'train_logs') 284 | meters = TensorboardLogger( 285 | log_dir=tb_log_dir, 286 | delimiter=" ", 287 | ) 288 | 289 | # Prepare optimizer 290 | param_optimizer = list(model.named_parameters()) 291 | no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] 292 | optimizer_grouped_parameters = [ 293 | {'params': [p for n, p in param_optimizer if 294 | not any(nd in n for nd in no_decay)], 295 | 'weight_decay': 0.01}, 296 | {'params': [p for n, p in param_optimizer if 297 | any(nd in n for nd in no_decay)], 'weight_decay': 0.0} 298 | ] 299 | 300 | optimizer = AdamW(optimizer_grouped_parameters, 301 | lr=args.learning_rate, eps=args.adam_epsilon) 302 | scheduler = WarmupLinearSchedule(optimizer, 303 | warmup_steps=args.warmup_steps, 304 | t_total=args.max_iters) 305 | 306 | if arguments['iteration'] > 0 and os.path.isfile(os.path.join(last_checkpoint_dir, 'optimizer.pth')): # recovery 307 | logger.info( 308 | "Load BERT optimizer from {}".format(last_checkpoint_dir)) 309 | optimizer_to_load = torch.load( 310 | os.path.join(last_checkpoint_dir, 'optimizer.pth'), 311 | map_location=torch.device("cpu")) 312 | optimizer.load_state_dict(optimizer_to_load.pop("optimizer")) 313 | scheduler.load_state_dict(optimizer_to_load.pop("scheduler")) 314 | 315 | if args.distributed: 316 | model = torch.nn.parallel.DistributedDataParallel( 317 | model, device_ids=[args.local_rank], output_device=args.local_rank, 318 | find_unused_parameters=True) 319 | elif args.n_gpu > 1: 320 | model = torch.nn.DataParallel(model) 321 | 322 | # train_examples = None 323 | train_dataloaders = make_data_loader( 324 | args, is_distributed=args.distributed, arguments=arguments 325 | ) 326 | 327 | if isinstance(train_dataloaders, list): 328 | train_dataloader = train_dataloaders[0] 329 | else: 330 | train_dataloader = train_dataloaders 331 | train_dataloader_extra = [None] * len(train_dataloader) 332 | if isinstance(train_dataloaders, list) and len(train_dataloaders) > 1: 333 | logger.info("Having two train dataloaders!") 334 | train_dataloader_extra = train_dataloaders[1] 335 | tokenizer = train_dataloader.dataset.tokenizer 336 | 337 | # torch.backends.cudnn.benchmark = True 338 | 339 | max_iter = len(train_dataloader) 340 | start_iter = arguments["iteration"] 341 | logger.info("***** Running training *****") 342 | logger.info(" Num examples = {}".format(len(train_dataloader.dataset))) 343 | logger.info(" Instantaneous batch size = %d", 344 | args.train_batch_size // args.gradient_accumulation_steps) 345 | logger.info( 346 | " Total train batch size (w. parallel, distributed & accumulation) = %d", 347 | args.train_batch_size) 348 | logger.info(" Gradient Accumulation steps = %d", 349 | args.gradient_accumulation_steps) 350 | logger.info(" Total optimization steps = %d", 351 | max_iter // args.gradient_accumulation_steps) 352 | 353 | log_json = {} 354 | 355 | model.train() 356 | model.zero_grad() 357 | 358 | clock_started = False 359 | # Every args.ckpt_period, report train_score and save model 360 | tr_loss = 0 361 | nb_tr_examples, nb_tr_steps = 0, 0 362 | for step, (batch, batch_extra) in enumerate(zip(train_dataloader, train_dataloader_extra), start_iter): 363 | if not clock_started: 364 | start_training_time = time.time() 365 | end = time.time() 366 | clock_started = True 367 | 368 | def data_process(mini_batch): 369 | images, targets, qa_inds = \ 370 | mini_batch[0], mini_batch[1], mini_batch[2] 371 | targets_transposed = list(zip(*targets)) 372 | input_ids = torch.stack(targets_transposed[0]).to(args.device, non_blocking=True) 373 | input_mask = torch.stack(targets_transposed[1]).to(args.device, non_blocking=True) 374 | segment_ids = torch.stack(targets_transposed[2]).to(args.device, non_blocking=True) 375 | lm_label_ids = torch.stack(targets_transposed[3]).to(args.device, non_blocking=True) 376 | is_next = torch.stack(targets_transposed[4]).to(args.device, non_blocking=True) 377 | is_img_match = torch.stack(targets_transposed[5]).to(args.device, non_blocking=True) 378 | 379 | return images, input_ids, input_mask, segment_ids, lm_label_ids, is_next 380 | 381 | images1, input_ids1, input_mask1, segment_ids1, lm_label_ids1, is_next1 \ 382 | = data_process(batch) 383 | if batch_extra is not None: 384 | images2, input_ids2, input_mask2, segment_ids2, lm_label_ids2, is_next2 \ 385 | = data_process(batch_extra) 386 | 387 | data_time = time.time() - end 388 | 389 | def forward_backward(images, input_ids, input_mask, segment_ids, 390 | lm_label_ids, is_next, loss_weight=1.0): 391 | # feature as input 392 | image_features = torch.stack(images).to(args.device, non_blocking=True) 393 | 394 | outputs = model(input_ids, segment_ids, input_mask, 395 | lm_label_ids, is_next, img_feats=image_features) 396 | 397 | loss = loss_weight * outputs[0] 398 | 399 | if args.n_gpu > 1: 400 | loss = loss.mean() # mean() to average on multi-gpu. 401 | 402 | if args.gradient_accumulation_steps > 1: 403 | loss = loss / args.gradient_accumulation_steps 404 | loss.backward() 405 | 406 | return loss.item(), input_ids.size(0) 407 | 408 | start1 = time.time() 409 | loss1, nb_tr_example1 = forward_backward( 410 | images1, input_ids1, input_mask1, 411 | segment_ids1, lm_label_ids1, is_next1, 412 | loss_weight=1.0-args.extra_loss_weight 413 | ) 414 | tr_loss += loss1 415 | nb_tr_examples += nb_tr_example1 416 | compute_time1 = time.time() - start1 417 | 418 | loss2, nb_tr_example2 = 0.0, 0 419 | compute_time2 = 0.0 420 | if batch_extra is not None: 421 | start2 = time.time() 422 | loss2, nb_tr_example2 = forward_backward( 423 | images2, input_ids2, input_mask2, 424 | segment_ids2, lm_label_ids2, is_next2, 425 | loss_weight=args.extra_loss_weight 426 | ) 427 | tr_loss += loss2 428 | nb_tr_examples += nb_tr_example2 429 | compute_time2 = time.time() - start2 430 | 431 | nb_tr_steps += 1 432 | arguments["iteration"] = step + 1 433 | 434 | if (step + 1) % args.gradient_accumulation_steps == 0: 435 | # do gradient clipping 436 | if args.max_grad_norm > 0: 437 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) 438 | # do the optimization steps 439 | optimizer.step() 440 | scheduler.step() # Update learning rate schedule 441 | optimizer.zero_grad() 442 | 443 | # measure elapsed time 444 | batch_time = time.time() - end 445 | end = time.time() 446 | metrics_to_log = { 447 | 'time_info': {'compute': batch_time, 'data': data_time, 448 | 'compute1': compute_time1, 449 | 'compute2': compute_time2}, 450 | 'batch_metrics': {'loss': loss1+loss2} 451 | } 452 | params_to_log = {'params': {'bert_lr': optimizer.param_groups[0]["lr"]}} 453 | meters.update_metrics(metrics_to_log) 454 | meters.update_params(params_to_log) 455 | 456 | if args.log_period > 0 and (step + 1) % args.log_period == 0: 457 | avg_time = meters.meters['time_info']['compute'].global_avg 458 | eta_seconds = avg_time * (max_iter - step - 1) 459 | eta_string = str( 460 | datetime.timedelta(seconds=int(eta_seconds))) 461 | logger.info( 462 | meters.delimiter.join( 463 | [ 464 | "eta: {eta}", 465 | "iter: {iter}", 466 | "max mem: {memory:.0f}", 467 | ] 468 | ).format( 469 | eta=eta_string, 470 | iter=step + 1, 471 | memory=torch.cuda.max_memory_allocated() / 1024.0 / 1024.0, 472 | ) + "\n " + meters.get_logs(step + 1) 473 | ) 474 | 475 | if (step + 1) == max_iter or (step + 1) % args.ckpt_period == 0: # Save a trained model 476 | log_json[step+1] = tr_loss 477 | train_metrics_total = torch.Tensor([tr_loss, nb_tr_examples, nb_tr_steps]).to(args.device) 478 | torch.distributed.all_reduce(train_metrics_total) 479 | # reset metrics 480 | tr_loss = 0 481 | nb_tr_examples, nb_tr_steps = 0, 0 482 | 483 | if get_rank() == 0: 484 | # report metrics 485 | train_score_gathered = train_metrics_total[0] / \ 486 | train_metrics_total[2] 487 | logger.info("PROGRESS: {}%".format( 488 | round(100 * (step + 1) / max_iter, 4))) 489 | logger.info( 490 | "EVALERR: {}%".format(train_score_gathered)) 491 | meters.update_metrics( 492 | { 493 | 'epoch_metrics': {'ex_cnt': train_metrics_total[1], 494 | 'loss': train_score_gathered} 495 | } 496 | ) 497 | with open(os.path.join(args.output_dir, 'loss_logs.json'), 498 | 'w') as fp: 499 | json.dump(log_json, fp) 500 | 501 | # save checkpoint 502 | output_dir = os.path.join(args.output_dir, 503 | 'checkpoint-{:07d}'.format( 504 | step + 1)) 505 | if not os.path.exists(output_dir): 506 | os.makedirs(output_dir) 507 | model_to_save = model.module if hasattr( 508 | model, 509 | 'module') else model # Take care of distributed/parallel training 510 | optimizer_to_save = { 511 | "optimizer": optimizer.state_dict(), 512 | "scheduler": scheduler.state_dict()} 513 | 514 | save_num = 0 515 | while save_num < 10: 516 | try: 517 | model_to_save.save_pretrained(output_dir) 518 | torch.save(args, os.path.join(output_dir, 519 | 'training_args.bin')) 520 | tokenizer.save_pretrained(output_dir) 521 | torch.save(optimizer_to_save, 522 | os.path.join(output_dir, 523 | 'optimizer.pth')) 524 | save_file = os.path.join(args.output_dir, "last_checkpoint") 525 | with open(save_file, "w") as f: 526 | f.write('checkpoint-{:07d}/pytorch_model.bin'.format(step + 1)) 527 | break 528 | except: 529 | save_num += 1 530 | logger.info( 531 | "Saving model checkpoint {0} to {1}".format( 532 | step + 1, output_dir)) 533 | 534 | if clock_started: 535 | total_training_time = time.time() - start_training_time 536 | else: 537 | total_training_time = 0.0 538 | total_time_str = str(datetime.timedelta(seconds=total_training_time)) 539 | logger.info( 540 | "Total training time: {} ({:.4f} s / it)".format( 541 | total_time_str, total_training_time / max_iter 542 | ) 543 | ) 544 | # close the tb logger 545 | meters.close() 546 | 547 | 548 | if __name__ == "__main__": 549 | main() 550 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = "0.1.0" 2 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/caption_evaluate.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 2 | 3 | from collections import OrderedDict, defaultdict 4 | import json 5 | import numpy as np 6 | import os.path as op 7 | from pprint import pprint 8 | import torch 9 | import re 10 | import subprocess 11 | import tempfile 12 | import time 13 | from typing import Dict, Optional 14 | 15 | from coco_caption.pycocotools.coco import COCO 16 | from coco_caption.pycocoevalcap.eval import COCOEvalCap 17 | from .cider.pyciderevalcap.ciderD.ciderD import CiderD 18 | 19 | 20 | def evaluate_on_nocaps(split, predict_file, data_dir='data/nocaps/', evaluate_file=None): 21 | ''' 22 | NOTE: Put the auth file in folder ~/.evalai/ 23 | ''' 24 | if not evaluate_file: 25 | evaluate_file = op.splitext(predict_file)[0] + '.eval.json' 26 | if op.isfile(evaluate_file): 27 | print('{} already exists'.format(evaluate_file)) 28 | with open(evaluate_file, 'r') as fp: 29 | metrics = json.load(fp) 30 | return metrics 31 | 32 | image_info_file = op.join(data_dir, 33 | 'nocaps_{}_image_info.json'.format(split)) 34 | image_info = json.load(open(image_info_file)) 35 | open_image_id2id = {} 36 | for it in image_info['images']: 37 | open_image_id2id[it['open_images_id']] = it['id'] 38 | predictions = [] 39 | cap_id = 0 40 | with open(predict_file, 'r') as fp: 41 | for line in fp: 42 | p = line.strip().split('\t') 43 | predictions.append( 44 | {'image_id': open_image_id2id[p[0]], 45 | 'caption': json.loads(p[1])[0]['caption'], 46 | 'id': cap_id}) 47 | cap_id += 1 48 | if split == 'test': 49 | print('Are you sure to submit test split result at: {}'.format(predict_file)) 50 | import ipdb;ipdb.set_trace() 51 | nocapseval = NocapsEvaluator(phase=split) 52 | metrics = nocapseval.evaluate(predictions) 53 | pprint(metrics) 54 | with open(evaluate_file, 'w') as fp: 55 | json.dump(metrics, fp) 56 | return metrics 57 | 58 | 59 | def evaluate_on_coco_caption(res_file, label_file, outfile=None): 60 | """ 61 | res_tsv: TSV file, each row is [image_key, json format list of captions]. 62 | Each caption is a dict, with fields "caption", "conf". 63 | label_file: JSON file of ground truth captions in COCO format. 64 | """ 65 | assert label_file.endswith('.json') 66 | if res_file.endswith('.tsv'): 67 | res_file_coco = op.splitext(res_file)[0] + '_coco_format.json' 68 | convert_tsv_to_coco_format(res_file, res_file_coco) 69 | else: 70 | raise ValueError('unknown prediction result file format: {}'.format(res_file)) 71 | 72 | coco = COCO(label_file) 73 | cocoRes = coco.loadRes(res_file_coco) 74 | cocoEval = COCOEvalCap(coco, cocoRes, 'corpus') 75 | 76 | # evaluate on a subset of images by setting 77 | # cocoEval.params['image_id'] = cocoRes.getImgIds() 78 | # please remove this line when evaluating the full validation set 79 | cocoEval.params['image_id'] = cocoRes.getImgIds() 80 | 81 | # evaluate results 82 | # SPICE will take a few minutes the first time, but speeds up due to caching 83 | cocoEval.evaluate() 84 | result = cocoEval.eval 85 | if not outfile: 86 | print(result) 87 | else: 88 | with open(outfile, 'w') as fp: 89 | json.dump(result, fp, indent=4) 90 | return result 91 | 92 | 93 | def convert_tsv_to_coco_format(res_tsv, outfile, 94 | sep='\t', key_col=0, cap_col=1): 95 | results = [] 96 | with open(res_tsv) as fp: 97 | for line in fp: 98 | parts = line.strip().split(sep) 99 | key = parts[key_col] 100 | if cap_col < len(parts): 101 | caps = json.loads(parts[cap_col]) 102 | assert len(caps) == 1, 'cannot evaluate multiple captions per image' 103 | cap = caps[0].get('caption', '') 104 | else: 105 | # empty caption generated 106 | cap = "" 107 | results.append( 108 | {'image_id': key, 109 | 'caption': cap} 110 | ) 111 | with open(outfile, 'w') as fp: 112 | json.dump(results, fp) 113 | 114 | 115 | class ScstRewardCriterion(torch.nn.Module): 116 | CIDER_REWARD_WEIGHT = 1 117 | 118 | def __init__(self, cider_cached_tokens='corpus', baseline_type='greedy'): 119 | self.CiderD_scorer = CiderD(df=cider_cached_tokens) 120 | assert baseline_type in ['greedy', 'sample'] 121 | self.baseline_type = baseline_type 122 | self._cur_score = None 123 | super().__init__() 124 | 125 | def forward(self, gt_res, greedy_res, sample_res, sample_logprobs): 126 | batch_size = len(gt_res) 127 | sample_res_size = len(sample_res) 128 | seq_per_img = sample_res_size // batch_size 129 | 130 | gen_res = [] 131 | gen_res.extend(sample_res) 132 | gt_idx = [i // seq_per_img for i in range(sample_res_size)] 133 | if self.baseline_type == 'greedy': 134 | assert len(greedy_res) == batch_size 135 | gen_res.extend(greedy_res) 136 | gt_idx.extend([i for i in range(batch_size)]) 137 | 138 | scores = self._calculate_eval_scores(gen_res, gt_idx, gt_res) 139 | 140 | if self.baseline_type == 'greedy': 141 | baseline = scores[-batch_size:][:, np.newaxis] 142 | else: 143 | sc_ = scores.reshape(batch_size, seq_per_img) 144 | baseline = (sc_.sum(1, keepdims=True) - sc_) / (sc_.shape[1] - 1) 145 | 146 | # sample - baseline 147 | reward = scores[:sample_res_size].reshape(batch_size, seq_per_img) 148 | self._cur_score = reward.mean() 149 | reward = reward - baseline 150 | reward = reward.reshape(sample_res_size) 151 | 152 | reward = torch.as_tensor(reward, device=sample_logprobs.device, dtype=torch.float) 153 | loss = - sample_logprobs * reward 154 | loss = loss.mean() 155 | return loss 156 | 157 | def get_score(self): 158 | return self._cur_score 159 | 160 | def _calculate_eval_scores(self, gen_res, gt_idx, gt_res): 161 | ''' 162 | gen_res: generated captions, list of str 163 | gt_idx: list of int, of the same length as gen_res 164 | gt_res: ground truth captions, list of list of str. 165 | gen_res[i] corresponds to gt_res[gt_idx[i]] 166 | Each image can have multiple ground truth captions 167 | ''' 168 | gen_res_size = len(gen_res) 169 | 170 | res = OrderedDict() 171 | for i in range(gen_res_size): 172 | res[i] = [self._wrap_sentence(gen_res[i])] 173 | 174 | gts = OrderedDict() 175 | gt_res_ = [ 176 | [self._wrap_sentence(gt_res[i][j]) for j in range(len(gt_res[i]))] 177 | for i in range(len(gt_res)) 178 | ] 179 | for i in range(gen_res_size): 180 | gts[i] = gt_res_[gt_idx[i]] 181 | 182 | res_ = [{'image_id':i, 'caption': res[i]} for i in range(len(res))] 183 | _, batch_cider_scores = self.CiderD_scorer.compute_score(gts, res_) 184 | scores = self.CIDER_REWARD_WEIGHT * batch_cider_scores 185 | return scores 186 | 187 | @classmethod 188 | def _wrap_sentence(self, s): 189 | # ensure the sentence ends with token 190 | # in order to keep consisitent with cider_cached_tokens 191 | r = s.strip() 192 | if r.endswith('.'): 193 | r = r[:-1] 194 | r += ' ' 195 | return r 196 | 197 | 198 | class NocapsEvaluator(object): 199 | r""" 200 | Code from https://github.com/nocaps-org/updown-baseline/blob/master/updown/utils/evalai.py 201 | 202 | A utility class to submit model predictions on nocaps splits to EvalAI, and retrieve model 203 | performance based on captioning metrics (such as CIDEr, SPICE). 204 | 205 | Extended Summary 206 | ---------------- 207 | This class and the training script together serve as a working example for "EvalAI in the 208 | loop", showing how evaluation can be done remotely on privately held splits. Annotations 209 | (captions) and evaluation-specific tools (e.g. `coco-caption `_) 210 | are not required locally. This enables users to select best checkpoint, perform early 211 | stopping, learning rate scheduling based on a metric, etc. without actually doing evaluation. 212 | 213 | Parameters 214 | ---------- 215 | phase: str, optional (default = "val") 216 | Which phase to evaluate on. One of "val" or "test". 217 | 218 | Notes 219 | ----- 220 | This class can be used for retrieving metrics on both, val and test splits. However, we 221 | recommend to avoid using it for test split (at least during training). Number of allowed 222 | submissions to test split on EvalAI are very less, and can exhaust in a few iterations! However, 223 | the number of submissions to val split are practically infinite. 224 | """ 225 | 226 | def __init__(self, phase: str = "val"): 227 | 228 | # Constants specific to EvalAI. 229 | self._challenge_id = 355 230 | self._phase_id = 742 if phase == "val" else 743 231 | 232 | def evaluate( 233 | self, predictions, iteration: Optional[int] = None 234 | ) -> Dict[str, Dict[str, float]]: 235 | r""" 236 | Take the model predictions (in COCO format), submit them to EvalAI, and retrieve model 237 | performance based on captioning metrics. 238 | 239 | Parameters 240 | ---------- 241 | predictions: List[Prediction] 242 | Model predictions in COCO format. They are a list of dicts with keys 243 | ``{"image_id": int, "caption": str}``. 244 | iteration: int, optional (default = None) 245 | Training iteration where the checkpoint was evaluated. 246 | 247 | Returns 248 | ------- 249 | Dict[str, Dict[str, float]] 250 | Model performance based on all captioning metrics. Nested dict structure:: 251 | 252 | { 253 | "B1": {"in-domain", "near-domain", "out-domain", "entire"}, # BLEU-1 254 | "B2": {"in-domain", "near-domain", "out-domain", "entire"}, # BLEU-2 255 | "B3": {"in-domain", "near-domain", "out-domain", "entire"}, # BLEU-3 256 | "B4": {"in-domain", "near-domain", "out-domain", "entire"}, # BLEU-4 257 | "METEOR": {"in-domain", "near-domain", "out-domain", "entire"}, 258 | "ROUGE-L": {"in-domain", "near-domain", "out-domain", "entire"}, 259 | "CIDEr": {"in-domain", "near-domain", "out-domain", "entire"}, 260 | "SPICE": {"in-domain", "near-domain", "out-domain", "entire"}, 261 | } 262 | 263 | """ 264 | # Save predictions as a json file first. 265 | _, predictions_filename = tempfile.mkstemp(suffix=".json", text=True) 266 | with open(predictions_filename, "w") as f: 267 | json.dump(predictions, f) 268 | 269 | submission_command = ( 270 | f"evalai challenge {self._challenge_id} phase {self._phase_id} " 271 | f"submit --file {predictions_filename}" 272 | ) 273 | 274 | submission_command_subprocess = subprocess.Popen( 275 | submission_command.split(), 276 | stdout=subprocess.PIPE, 277 | stdin=subprocess.PIPE, 278 | stderr=subprocess.STDOUT, 279 | ) 280 | 281 | # This terminal output will have submission ID we need to check. 282 | submission_command_stdout = submission_command_subprocess.communicate(input=b"N\n")[ 283 | 0 284 | ].decode("utf-8") 285 | 286 | submission_id_regex = re.search("evalai submission ([0-9]+)", submission_command_stdout) 287 | try: 288 | # Get an integer submission ID (as a string). 289 | submission_id = submission_id_regex.group(0).split()[-1] # type: ignore 290 | except: 291 | # Very unlikely, but submission may fail because of some glitch. Retry for that. 292 | return self.evaluate(predictions) 293 | 294 | if iteration is not None: 295 | print(f"Submitted predictions for iteration {iteration}, submission id: {submission_id}.") 296 | else: 297 | print(f"Submitted predictions, submission_id: {submission_id}") 298 | 299 | # Placeholder stdout for a pending submission. 300 | result_stdout: str = "The Submission is yet to be evaluated." 301 | num_tries: int = 0 302 | 303 | # Query every 10 seconds for result until it appears. 304 | while "CIDEr" not in result_stdout: 305 | 306 | time.sleep(10) 307 | result_stdout = subprocess.check_output( 308 | ["evalai", "submission", submission_id, "result"] 309 | ).decode("utf-8") 310 | num_tries += 1 311 | 312 | # Raise error if it takes more than 5 minutes. 313 | if num_tries == 30: 314 | raise ConnectionError("Unable to get results from EvalAI within 5 minutes!") 315 | 316 | # Convert result to json. 317 | metrics = json.loads(result_stdout, encoding="utf-8") 318 | 319 | # keys: {"in-domain", "near-domain", "out-domain", "entire"} 320 | # In each of these, keys: {"B1", "B2", "B3", "B4", "METEOR", "ROUGE-L", "CIDEr", "SPICE"} 321 | metrics = { 322 | "in-domain": metrics[0]["in-domain"], 323 | "near-domain": metrics[1]["near-domain"], 324 | "out-domain": metrics[2]["out-domain"], 325 | "entire": metrics[3]["entire"], 326 | } 327 | 328 | # Restructure the metrics dict for better tensorboard logging. 329 | # keys: {"B1", "B2", "B3", "B4", "METEOR", "ROUGE-L", "CIDEr", "SPICE"} 330 | # In each of these, keys: keys: {"in-domain", "near-domain", "out-domain", "entire"} 331 | flipped_metrics: Dict[str, Dict[str, float]] = defaultdict(dict) 332 | for key, val in metrics.items(): 333 | for subkey, subval in val.items(): 334 | flipped_metrics[subkey][key] = subval 335 | 336 | return flipped_metrics 337 | 338 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/cider/pyciderevalcap/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/cider/pyciderevalcap/cider/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/cider/pyciderevalcap/cider/cider.py: -------------------------------------------------------------------------------- 1 | # Filename: cider.py 2 | # 3 | # 4 | # Description: Describes the class to compute the CIDEr 5 | # (Consensus-Based Image Description Evaluation) Metric 6 | # by Vedantam, Zitnick, and Parikh (http://arxiv.org/abs/1411.5726) 7 | # 8 | # Creation Date: Sun Feb 8 14:16:54 2015 9 | # 10 | # Authors: Ramakrishna Vedantam and 11 | # Tsung-Yi Lin 12 | from __future__ import absolute_import 13 | from __future__ import division 14 | from __future__ import print_function 15 | 16 | from .cider_scorer import CiderScorer 17 | 18 | 19 | class Cider: 20 | """ 21 | Main Class to compute the CIDEr metric 22 | 23 | """ 24 | def __init__(self, n=4, df="corpus"): 25 | """ 26 | Initialize the CIDEr scoring function 27 | : param n (int): n-gram size 28 | : param df (string): specifies where to get the IDF values from 29 | takes values 'corpus', 'coco-train' 30 | : return: None 31 | """ 32 | # set cider to sum over 1 to 4-grams 33 | self._n = n 34 | self._df = df 35 | self.cider_scorer = CiderScorer(n=self._n, df_mode=self._df) 36 | 37 | def compute_score(self, gts, res): 38 | """ 39 | Main function to compute CIDEr score 40 | : param gts (dict) : {image:tokenized reference sentence} 41 | : param res (dict) : {image:tokenized candidate sentence} 42 | : return: cider (float) : computed CIDEr score for the corpus 43 | """ 44 | 45 | # clear all the previous hypos and refs 46 | self.cider_scorer.clear() 47 | 48 | for res_id in res: 49 | 50 | hypo = res_id['caption'] 51 | ref = gts[res_id['image_id']] 52 | 53 | # Sanity check. 54 | assert(type(hypo) is list) 55 | assert(len(hypo) == 1) 56 | assert(type(ref) is list) 57 | assert(len(ref) > 0) 58 | self.cider_scorer += (hypo[0], ref) 59 | 60 | (score, scores) = self.cider_scorer.compute_score() 61 | 62 | return score, scores 63 | 64 | def method(self): 65 | return "CIDEr" 66 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/cider/pyciderevalcap/cider/cider_scorer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # Tsung-Yi Lin 3 | # Ramakrishna Vedantam 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | 8 | import copy 9 | import six 10 | from six.moves import cPickle 11 | from collections import defaultdict 12 | import numpy as np 13 | import math 14 | import os 15 | 16 | def precook(s, n=4, out=False): 17 | """ 18 | Takes a string as input and returns an object that can be given to 19 | either cook_refs or cook_test. This is optional: cook_refs and cook_test 20 | can take string arguments as well. 21 | :param s: string : sentence to be converted into ngrams 22 | :param n: int : number of ngrams for which representation is calculated 23 | :return: term frequency vector for occuring ngrams 24 | """ 25 | words = s.split() 26 | counts = defaultdict(int) 27 | for k in range(1,n+1): 28 | for i in range(len(words)-k+1): 29 | ngram = tuple(words[i:i+k]) 30 | counts[ngram] += 1 31 | return counts 32 | 33 | def cook_refs(refs, n=4): ## lhuang: oracle will call with "average" 34 | '''Takes a list of reference sentences for a single segment 35 | and returns an object that encapsulates everything that BLEU 36 | needs to know about them. 37 | :param refs: list of string : reference sentences for some image 38 | :param n: int : number of ngrams for which (ngram) representation is calculated 39 | :return: result (list of dict) 40 | ''' 41 | return [precook(ref, n) for ref in refs] 42 | 43 | def cook_test(test, n=4): 44 | '''Takes a test sentence and returns an object that 45 | encapsulates everything that BLEU needs to know about it. 46 | :param test: list of string : hypothesis sentence for some image 47 | :param n: int : number of ngrams for which (ngram) representation is calculated 48 | :return: result (dict) 49 | ''' 50 | return precook(test, n, True) 51 | 52 | class CiderScorer(object): 53 | """CIDEr scorer. 54 | """ 55 | 56 | def copy(self): 57 | ''' copy the refs.''' 58 | new = CiderScorer(n=self.n) 59 | new.ctest = copy.copy(self.ctest) 60 | new.crefs = copy.copy(self.crefs) 61 | return new 62 | 63 | def __init__(self, df_mode="corpus", test=None, refs=None, n=4, sigma=6.0): 64 | ''' singular instance ''' 65 | self.n = n 66 | self.sigma = sigma 67 | self.crefs = [] 68 | self.ctest = [] 69 | self.df_mode = df_mode 70 | self.ref_len = None 71 | if self.df_mode != "corpus": 72 | pkl_file = cPickle.load(open(os.path.join('data', df_mode + '.p'),'rb'), **(dict(encoding='latin1') if six.PY3 else {})) 73 | self.ref_len = np.log(float(pkl_file['ref_len'])) 74 | self.document_frequency = pkl_file['document_frequency'] 75 | self.cook_append(test, refs) 76 | 77 | def clear(self): 78 | self.crefs = [] 79 | self.ctest = [] 80 | 81 | def cook_append(self, test, refs): 82 | '''called by constructor and __iadd__ to avoid creating new instances.''' 83 | 84 | if refs is not None: 85 | self.crefs.append(cook_refs(refs)) 86 | if test is not None: 87 | self.ctest.append(cook_test(test)) ## N.B.: -1 88 | else: 89 | self.ctest.append(None) # lens of crefs and ctest have to match 90 | 91 | def size(self): 92 | assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest)) 93 | return len(self.crefs) 94 | 95 | def __iadd__(self, other): 96 | '''add an instance (e.g., from another sentence).''' 97 | 98 | if type(other) is tuple: 99 | ## avoid creating new CiderScorer instances 100 | self.cook_append(other[0], other[1]) 101 | else: 102 | self.ctest.extend(other.ctest) 103 | self.crefs.extend(other.crefs) 104 | 105 | return self 106 | def compute_doc_freq(self): 107 | ''' 108 | Compute term frequency for reference data. 109 | This will be used to compute idf (inverse document frequency later) 110 | The term frequency is stored in the object 111 | :return: None 112 | ''' 113 | for refs in self.crefs: 114 | # refs, k ref captions of one image 115 | for ngram in set([ngram for ref in refs for (ngram,count) in ref.items()]): 116 | self.document_frequency[ngram] += 1 117 | # maxcounts[ngram] = max(maxcounts.get(ngram,0), count) 118 | 119 | def compute_cider(self): 120 | def counts2vec(cnts): 121 | """ 122 | Function maps counts of ngram to vector of tfidf weights. 123 | The function returns vec, an array of dictionary that store mapping of n-gram and tf-idf weights. 124 | The n-th entry of array denotes length of n-grams. 125 | :param cnts: 126 | :return: vec (array of dict), norm (array of float), length (int) 127 | """ 128 | vec = [defaultdict(float) for _ in range(self.n)] 129 | length = 0 130 | norm = [0.0 for _ in range(self.n)] 131 | for (ngram,term_freq) in cnts.items(): 132 | # give word count 1 if it doesn't appear in reference corpus 133 | df = np.log(max(1.0, self.document_frequency[ngram])) 134 | # ngram index 135 | n = len(ngram)-1 136 | # tf (term_freq) * idf (precomputed idf) for n-grams 137 | vec[n][ngram] = float(term_freq)*(self.ref_len - df) 138 | # compute norm for the vector. the norm will be used for 139 | # computing similarity 140 | norm[n] += pow(vec[n][ngram], 2) 141 | 142 | if n == 1: 143 | length += term_freq 144 | norm = [np.sqrt(n) for n in norm] 145 | return vec, norm, length 146 | 147 | def sim(vec_hyp, vec_ref, norm_hyp, norm_ref, length_hyp, length_ref): 148 | ''' 149 | Compute the cosine similarity of two vectors. 150 | :param vec_hyp: array of dictionary for vector corresponding to hypothesis 151 | :param vec_ref: array of dictionary for vector corresponding to reference 152 | :param norm_hyp: array of float for vector corresponding to hypothesis 153 | :param norm_ref: array of float for vector corresponding to reference 154 | :param length_hyp: int containing length of hypothesis 155 | :param length_ref: int containing length of reference 156 | :return: array of score for each n-grams cosine similarity 157 | ''' 158 | delta = float(length_hyp - length_ref) 159 | # measure consine similarity 160 | val = np.array([0.0 for _ in range(self.n)]) 161 | for n in range(self.n): 162 | # ngram 163 | for (ngram,count) in vec_hyp[n].items(): 164 | val[n] += vec_hyp[n][ngram] * vec_ref[n][ngram] 165 | 166 | if (norm_hyp[n] != 0) and (norm_ref[n] != 0): 167 | val[n] /= (norm_hyp[n]*norm_ref[n]) 168 | 169 | assert(not math.isnan(val[n])) 170 | return val 171 | 172 | # compute log reference length 173 | if self.df_mode == "corpus": 174 | self.ref_len = np.log(float(len(self.crefs))) 175 | 176 | scores = [] 177 | for test, refs in zip(self.ctest, self.crefs): 178 | # compute vector for test captions 179 | vec, norm, length = counts2vec(test) 180 | # compute vector for ref captions 181 | score = np.array([0.0 for _ in range(self.n)]) 182 | for ref in refs: 183 | vec_ref, norm_ref, length_ref = counts2vec(ref) 184 | score += sim(vec, vec_ref, norm, norm_ref, length, length_ref) 185 | # change by vrama91 - mean of ngram scores, instead of sum 186 | score_avg = np.mean(score) 187 | # divide by number of references 188 | score_avg /= len(refs) 189 | # multiply score by 10 190 | score_avg *= 10.0 191 | # append score of an image to the score list 192 | scores.append(score_avg) 193 | return scores 194 | 195 | def compute_score(self, option=None, verbose=0): 196 | # compute idf 197 | if self.df_mode == "corpus": 198 | self.document_frequency = defaultdict(float) 199 | self.compute_doc_freq() 200 | # assert to check document frequency 201 | assert(len(self.ctest) >= max(self.document_frequency.values())) 202 | # import json for now and write the corresponding files 203 | # compute cider score 204 | score = self.compute_cider() 205 | # debug 206 | # print score 207 | return np.mean(np.array(score)), np.array(score) 208 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/cider/pyciderevalcap/ciderD/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/cider/pyciderevalcap/ciderD/ciderD.py: -------------------------------------------------------------------------------- 1 | # Filename: ciderD.py 2 | # 3 | # Description: Describes the class to compute the CIDEr-D (Consensus-Based Image Description Evaluation) Metric 4 | # by Vedantam, Zitnick, and Parikh (http://arxiv.org/abs/1411.5726) 5 | # 6 | # Creation Date: Sun Feb 8 14:16:54 2015 7 | # 8 | # Authors: Ramakrishna Vedantam and Tsung-Yi Lin 9 | from __future__ import absolute_import 10 | from __future__ import division 11 | from __future__ import print_function 12 | 13 | from .ciderD_scorer import CiderScorer 14 | import pdb 15 | 16 | class CiderD: 17 | """ 18 | Main Class to compute the CIDEr metric 19 | 20 | """ 21 | def __init__(self, n=4, sigma=6.0, df="corpus"): 22 | # set cider to sum over 1 to 4-grams 23 | self._n = n 24 | # set the standard deviation parameter for gaussian penalty 25 | self._sigma = sigma 26 | # set which where to compute document frequencies from 27 | self._df = df 28 | self.cider_scorer = CiderScorer(n=self._n, df_mode=self._df) 29 | 30 | def compute_score(self, gts, res): 31 | """ 32 | Main function to compute CIDEr score 33 | :param hypo_for_image (dict) : dictionary with key and value 34 | ref_for_image (dict) : dictionary with key and value 35 | :return: cider (float) : computed CIDEr score for the corpus 36 | """ 37 | 38 | # clear all the previous hypos and refs 39 | tmp_cider_scorer = self.cider_scorer.copy_empty() 40 | tmp_cider_scorer.clear() 41 | for res_id in res: 42 | 43 | hypo = res_id['caption'] 44 | ref = gts[res_id['image_id']] 45 | 46 | # Sanity check. 47 | assert(type(hypo) is list) 48 | assert(len(hypo) == 1) 49 | assert(type(ref) is list) 50 | assert(len(ref) > 0) 51 | tmp_cider_scorer += (hypo[0], ref) 52 | 53 | (score, scores) = tmp_cider_scorer.compute_score() 54 | 55 | return score, scores 56 | 57 | def method(self): 58 | return "CIDEr-D" 59 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/cider/pyciderevalcap/ciderD/ciderD_scorer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # Tsung-Yi Lin 3 | # Ramakrishna Vedantam 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | 8 | import copy 9 | from collections import defaultdict 10 | import numpy as np 11 | import pdb 12 | import math 13 | import six 14 | from six.moves import cPickle 15 | import os 16 | 17 | def precook(s, n=4, out=False): 18 | """ 19 | Takes a string as input and returns an object that can be given to 20 | either cook_refs or cook_test. This is optional: cook_refs and cook_test 21 | can take string arguments as well. 22 | :param s: string : sentence to be converted into ngrams 23 | :param n: int : number of ngrams for which representation is calculated 24 | :return: term frequency vector for occuring ngrams 25 | """ 26 | words = s.split() 27 | counts = defaultdict(int) 28 | for k in range(1,n+1): 29 | for i in range(len(words)-k+1): 30 | ngram = tuple(words[i:i+k]) 31 | counts[ngram] += 1 32 | return counts 33 | 34 | def cook_refs(refs, n=4): ## lhuang: oracle will call with "average" 35 | '''Takes a list of reference sentences for a single segment 36 | and returns an object that encapsulates everything that BLEU 37 | needs to know about them. 38 | :param refs: list of string : reference sentences for some image 39 | :param n: int : number of ngrams for which (ngram) representation is calculated 40 | :return: result (list of dict) 41 | ''' 42 | return [precook(ref, n) for ref in refs] 43 | 44 | def cook_test(test, n=4): 45 | '''Takes a test sentence and returns an object that 46 | encapsulates everything that BLEU needs to know about it. 47 | :param test: list of string : hypothesis sentence for some image 48 | :param n: int : number of ngrams for which (ngram) representation is calculated 49 | :return: result (dict) 50 | ''' 51 | return precook(test, n, True) 52 | 53 | class CiderScorer(object): 54 | """CIDEr scorer. 55 | """ 56 | 57 | def copy(self): 58 | ''' copy the refs.''' 59 | new = CiderScorer(n=self.n) 60 | new.ctest = copy.copy(self.ctest) 61 | new.crefs = copy.copy(self.crefs) 62 | return new 63 | 64 | def copy_empty(self): 65 | new = CiderScorer(df_mode="corpus", n=self.n, sigma=self.sigma) 66 | new.df_mode = self.df_mode 67 | new.ref_len = self.ref_len 68 | new.document_frequency = self.document_frequency 69 | return new 70 | 71 | def __init__(self, df_mode="corpus", test=None, refs=None, n=4, sigma=6.0): 72 | ''' singular instance ''' 73 | self.n = n 74 | self.sigma = sigma 75 | self.crefs = [] 76 | self.ctest = [] 77 | self.df_mode = df_mode 78 | self.ref_len = None 79 | if self.df_mode != "corpus": 80 | pkl_file = cPickle.load(open(df_mode,'rb'), **(dict(encoding='latin1') if six.PY3 else {})) 81 | self.ref_len = np.log(float(pkl_file['ref_len'])) 82 | self.document_frequency = pkl_file['document_frequency'] 83 | else: 84 | self.document_frequency = None 85 | self.cook_append(test, refs) 86 | 87 | def clear(self): 88 | self.crefs = [] 89 | self.ctest = [] 90 | 91 | def cook_append(self, test, refs): 92 | '''called by constructor and __iadd__ to avoid creating new instances.''' 93 | 94 | if refs is not None: 95 | self.crefs.append(cook_refs(refs)) 96 | if test is not None: 97 | self.ctest.append(cook_test(test)) ## N.B.: -1 98 | else: 99 | self.ctest.append(None) # lens of crefs and ctest have to match 100 | 101 | def size(self): 102 | assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest)) 103 | return len(self.crefs) 104 | 105 | def __iadd__(self, other): 106 | '''add an instance (e.g., from another sentence).''' 107 | 108 | if type(other) is tuple: 109 | ## avoid creating new CiderScorer instances 110 | self.cook_append(other[0], other[1]) 111 | else: 112 | self.ctest.extend(other.ctest) 113 | self.crefs.extend(other.crefs) 114 | 115 | return self 116 | def compute_doc_freq(self): 117 | ''' 118 | Compute term frequency for reference data. 119 | This will be used to compute idf (inverse document frequency later) 120 | The term frequency is stored in the object 121 | :return: None 122 | ''' 123 | for refs in self.crefs: 124 | # refs, k ref captions of one image 125 | for ngram in set([ngram for ref in refs for (ngram,count) in ref.items()]): 126 | self.document_frequency[ngram] += 1 127 | # maxcounts[ngram] = max(maxcounts.get(ngram,0), count) 128 | 129 | def compute_cider(self): 130 | def counts2vec(cnts): 131 | """ 132 | Function maps counts of ngram to vector of tfidf weights. 133 | The function returns vec, an array of dictionary that store mapping of n-gram and tf-idf weights. 134 | The n-th entry of array denotes length of n-grams. 135 | :param cnts: 136 | :return: vec (array of dict), norm (array of float), length (int) 137 | """ 138 | vec = [defaultdict(float) for _ in range(self.n)] 139 | length = 0 140 | norm = [0.0 for _ in range(self.n)] 141 | for (ngram,term_freq) in cnts.items(): 142 | # give word count 1 if it doesn't appear in reference corpus 143 | df = np.log(max(1.0, self.document_frequency[ngram])) 144 | # ngram index 145 | n = len(ngram)-1 146 | # tf (term_freq) * idf (precomputed idf) for n-grams 147 | vec[n][ngram] = float(term_freq)*(self.ref_len - df) 148 | # compute norm for the vector. the norm will be used for computing similarity 149 | norm[n] += pow(vec[n][ngram], 2) 150 | 151 | if n == 1: 152 | length += term_freq 153 | norm = [np.sqrt(n) for n in norm] 154 | return vec, norm, length 155 | 156 | def sim(vec_hyp, vec_ref, norm_hyp, norm_ref, length_hyp, length_ref): 157 | ''' 158 | Compute the cosine similarity of two vectors. 159 | :param vec_hyp: array of dictionary for vector corresponding to hypothesis 160 | :param vec_ref: array of dictionary for vector corresponding to reference 161 | :param norm_hyp: array of float for vector corresponding to hypothesis 162 | :param norm_ref: array of float for vector corresponding to reference 163 | :param length_hyp: int containing length of hypothesis 164 | :param length_ref: int containing length of reference 165 | :return: array of score for each n-grams cosine similarity 166 | ''' 167 | delta = float(length_hyp - length_ref) 168 | # measure consine similarity 169 | val = np.array([0.0 for _ in range(self.n)]) 170 | for n in range(self.n): 171 | # ngram 172 | for (ngram,count) in vec_hyp[n].items(): 173 | # vrama91 : added clipping 174 | val[n] += min(vec_hyp[n][ngram], vec_ref[n][ngram]) * vec_ref[n][ngram] 175 | 176 | if (norm_hyp[n] != 0) and (norm_ref[n] != 0): 177 | val[n] /= (norm_hyp[n]*norm_ref[n]) 178 | 179 | assert(not math.isnan(val[n])) 180 | # vrama91: added a length based gaussian penalty 181 | val[n] *= np.e**(-(delta**2)/(2*self.sigma**2)) 182 | return val 183 | 184 | # compute log reference length 185 | if self.df_mode == "corpus": 186 | self.ref_len = np.log(float(len(self.crefs))) 187 | #elif self.df_mode == "coco-val-df": 188 | # if coco option selected, use length of coco-val set 189 | # self.ref_len = np.log(float(40504)) 190 | 191 | scores = [] 192 | for test, refs in zip(self.ctest, self.crefs): 193 | # compute vector for test captions 194 | vec, norm, length = counts2vec(test) 195 | # compute vector for ref captions 196 | score = np.array([0.0 for _ in range(self.n)]) 197 | for ref in refs: 198 | vec_ref, norm_ref, length_ref = counts2vec(ref) 199 | score += sim(vec, vec_ref, norm, norm_ref, length, length_ref) 200 | # change by vrama91 - mean of ngram scores, instead of sum 201 | score_avg = np.mean(score) 202 | # divide by number of references 203 | score_avg /= len(refs) 204 | # multiply score by 10 205 | score_avg *= 10.0 206 | # append score of an image to the score list 207 | scores.append(score_avg) 208 | return scores 209 | 210 | def compute_score(self, option=None, verbose=0): 211 | # compute idf 212 | if self.df_mode == "corpus": 213 | self.document_frequency = defaultdict(float) 214 | self.compute_doc_freq() 215 | # assert to check document frequency 216 | assert(len(self.ctest) >= max(self.document_frequency.values())) 217 | # import json for now and write the corresponding files 218 | # compute cider score 219 | score = self.compute_cider() 220 | # debug 221 | # print score 222 | return np.mean(np.array(score)), np.array(score) 223 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/logger.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 2 | 3 | import logging 4 | from logging import StreamHandler, Handler, getLevelName 5 | import os 6 | import sys 7 | 8 | 9 | # this class is a copy of logging.FileHandler except we end self.close() 10 | # at the end of each emit. While closing file and reopening file after each 11 | # write is not efficient, it allows us to see partial logs when writing to 12 | # fused Azure blobs, which is very convenient 13 | class FileHandler(StreamHandler): 14 | """ 15 | A handler class which writes formatted logging records to disk files. 16 | """ 17 | def __init__(self, filename, mode='a', encoding=None, delay=False): 18 | """ 19 | Open the specified file and use it as the stream for logging. 20 | """ 21 | # Issue #27493: add support for Path objects to be passed in 22 | filename = os.fspath(filename) 23 | #keep the absolute path, otherwise derived classes which use this 24 | #may come a cropper when the current directory changes 25 | self.baseFilename = os.path.abspath(filename) 26 | self.mode = mode 27 | self.encoding = encoding 28 | self.delay = delay 29 | if delay: 30 | #We don't open the stream, but we still need to call the 31 | #Handler constructor to set level, formatter, lock etc. 32 | Handler.__init__(self) 33 | self.stream = None 34 | else: 35 | StreamHandler.__init__(self, self._open()) 36 | 37 | def close(self): 38 | """ 39 | Closes the stream. 40 | """ 41 | self.acquire() 42 | try: 43 | try: 44 | if self.stream: 45 | try: 46 | self.flush() 47 | finally: 48 | stream = self.stream 49 | self.stream = None 50 | if hasattr(stream, "close"): 51 | stream.close() 52 | finally: 53 | # Issue #19523: call unconditionally to 54 | # prevent a handler leak when delay is set 55 | StreamHandler.close(self) 56 | finally: 57 | self.release() 58 | 59 | def _open(self): 60 | """ 61 | Open the current base file with the (original) mode and encoding. 62 | Return the resulting stream. 63 | """ 64 | return open(self.baseFilename, self.mode, encoding=self.encoding) 65 | 66 | def emit(self, record): 67 | """ 68 | Emit a record. 69 | 70 | If the stream was not opened because 'delay' was specified in the 71 | constructor, open it before calling the superclass's emit. 72 | """ 73 | if self.stream is None: 74 | self.stream = self._open() 75 | StreamHandler.emit(self, record) 76 | self.close() 77 | 78 | def __repr__(self): 79 | level = getLevelName(self.level) 80 | return '<%s %s (%s)>' % (self.__class__.__name__, self.baseFilename, level) 81 | 82 | 83 | def setup_logger(name, save_dir, distributed_rank, filename="log.txt"): 84 | logger = logging.getLogger(name) 85 | logger.setLevel(logging.DEBUG) 86 | # don't log results for the non-master process 87 | if distributed_rank > 0: 88 | return logger 89 | ch = logging.StreamHandler(stream=sys.stdout) 90 | ch.setLevel(logging.DEBUG) 91 | formatter = logging.Formatter("%(asctime)s %(name)s %(levelname)s: %(message)s") 92 | ch.setFormatter(formatter) 93 | logger.addHandler(ch) 94 | 95 | if save_dir: 96 | fh = FileHandler(os.path.join(save_dir, filename)) 97 | fh.setLevel(logging.DEBUG) 98 | fh.setFormatter(formatter) 99 | logger.addHandler(fh) 100 | 101 | return logger 102 | 103 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/metric_logger.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved. 2 | from collections import defaultdict 3 | from collections import deque 4 | import os 5 | 6 | import torch 7 | 8 | from .misc import is_main_process 9 | 10 | 11 | class SmoothedValue(object): 12 | """Track a series of values and provide access to smoothed values over a 13 | window or the global series average. 14 | """ 15 | 16 | def __init__(self, window_size=20): 17 | self.deque = deque(maxlen=window_size) 18 | # self.series = [] 19 | self.total = 0.0 20 | self.count = 0 21 | 22 | def update(self, value): 23 | self.deque.append(value) 24 | # self.series.append(value) 25 | self.count += 1 26 | self.total += value 27 | 28 | @property 29 | def median(self): 30 | d = torch.tensor(list(self.deque)) 31 | return d.median().item() 32 | 33 | @property 34 | def avg(self): 35 | d = torch.tensor(list(self.deque)) 36 | return d.mean().item() 37 | 38 | @property 39 | def global_avg(self): 40 | return self.total / self.count 41 | 42 | @property 43 | def last_value(self): 44 | return self.deque[-1] 45 | 46 | 47 | class MetricLogger(object): 48 | def __init__(self, delimiter="\t"): 49 | self.meters = {} 50 | self.params = {} 51 | self.delimiter = delimiter 52 | 53 | def update_params(self, update_dict): 54 | for param_group, group_dict in update_dict.items(): 55 | if param_group not in self.params: 56 | self.params[param_group] = {} 57 | for param_name, param_value in group_dict.items(): 58 | # skipping parameters if they start with '_' 59 | if param_name.startswith('_'): 60 | continue 61 | if isinstance(param_value, torch.Tensor): 62 | param_value = param_value.item() 63 | assert isinstance(param_value, (float, int)) 64 | self.params[param_group][param_name] = param_value 65 | 66 | def update_metrics(self, update_dict): 67 | for metric_group, group_dict in update_dict.items(): 68 | if metric_group not in self.meters: 69 | self.meters[metric_group] = defaultdict(SmoothedValue) 70 | for metric_name, metric_value in group_dict.items(): 71 | # skipping metrics if they start with '_' 72 | if metric_name.startswith('_'): 73 | continue 74 | if isinstance(metric_value, torch.Tensor): 75 | metric_value = metric_value.item() 76 | assert isinstance(metric_value, (float, int)) 77 | self.meters[metric_group][metric_name].update(metric_value) 78 | 79 | def get_logs(self, iteration): 80 | return_str = [] 81 | if len(self.meters) > 0: 82 | offset_m = max([len(group_name) for group_name in self.meters.keys()]) 83 | else: 84 | offset_m = 0 85 | if len(self.params) > 0: 86 | offset_p = max([len(group_name) for group_name in self.params.keys()]) 87 | else: 88 | offset_p = 0 89 | offset = max(offset_m, offset_p) 90 | 91 | for group_name, values in sorted(self.meters.items(), 92 | key=lambda x: x[0]): 93 | loss_str = [] 94 | for name, meter in values.items(): 95 | loss_str.append("{}: {:.4f} ({:.4f})".format( 96 | name, meter.median, meter.global_avg, 97 | )) 98 | return_str.append( 99 | "{:{offset}s} - {}".format( 100 | group_name, self.delimiter.join(loss_str), offset=offset, 101 | ), 102 | ) 103 | for group_name, values in self.params.items(): 104 | loss_str = [] 105 | for name, param in values.items(): 106 | loss_str.append("{}: {:.6f}".format(name, param)) 107 | return_str.append( 108 | "{:{offset}s} - {}".format( 109 | group_name, self.delimiter.join(loss_str), offset=offset, 110 | ), 111 | ) 112 | return "\n ".join(return_str) 113 | 114 | 115 | class TensorboardLogger(MetricLogger): 116 | def __init__(self, 117 | log_dir, 118 | delimiter='\t'): 119 | super(TensorboardLogger, self).__init__(delimiter) 120 | try: 121 | from tensorboardX import SummaryWriter 122 | except ImportError: 123 | raise ImportError( 124 | 'To use tensorboard please install tensorboardX ' 125 | '[ pip install tensorboardx ].' 126 | ) 127 | self.philly_tb_logger = None 128 | self.philly_tb_logger_avg = None 129 | self.philly_tb_logger_med = None 130 | if is_main_process(): 131 | self.tb_logger = SummaryWriter(log_dir) 132 | self.tb_logger_avg = SummaryWriter(os.path.join(log_dir, 'avg')) 133 | self.tb_logger_med = SummaryWriter(os.path.join(log_dir, 'med')) 134 | else: 135 | self.tb_logger = None 136 | self.tb_logger_avg = None 137 | self.tb_logger_med = None 138 | 139 | def get_logs(self, iteration): 140 | if self.tb_logger: 141 | for group_name, values in self.meters.items(): 142 | for name, meter in values.items(): 143 | self.tb_logger.add_scalar( 144 | '{}/{}'.format(group_name, name), 145 | meter.last_value, iteration, 146 | ) 147 | self.tb_logger_avg.add_scalar( 148 | '{}/{}'.format(group_name, name), 149 | meter.avg, iteration, 150 | ) 151 | self.tb_logger_med.add_scalar( 152 | '{}/{}'.format(group_name, name), 153 | meter.median, iteration, 154 | ) 155 | if self.philly_tb_logger: 156 | self.philly_tb_logger.add_scalar( 157 | '{}/{}'.format(group_name, name), 158 | meter.last_value, iteration, 159 | ) 160 | self.philly_tb_logger_avg.add_scalar( 161 | '{}/{}'.format(group_name, name), 162 | meter.avg, iteration, 163 | ) 164 | self.philly_tb_logger_med.add_scalar( 165 | '{}/{}'.format(group_name, name), 166 | meter.median, iteration, 167 | ) 168 | for group_name, values in self.params.items(): 169 | for name, param in values.items(): 170 | self.tb_logger.add_scalar( 171 | '{}/{}'.format(group_name, name), 172 | param, iteration, 173 | ) 174 | if self.philly_tb_logger: 175 | self.philly_tb_logger.add_scalar( 176 | '{}/{}'.format(group_name, name), 177 | param, iteration, 178 | ) 179 | return super(TensorboardLogger, self).get_logs(iteration) 180 | 181 | def close(self): 182 | if is_main_process(): 183 | self.tb_logger.close() 184 | self.tb_logger_avg.close() 185 | self.tb_logger_med.close() 186 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/misc.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 2 | 3 | import errno 4 | import os 5 | import os.path as op 6 | import yaml 7 | import random 8 | import torch 9 | import numpy as np 10 | import torch.distributed as dist 11 | 12 | 13 | def mkdir(path): 14 | # if it is the current folder, skip. 15 | if path == '': 16 | return 17 | try: 18 | os.makedirs(path) 19 | except OSError as e: 20 | if e.errno != errno.EEXIST: 21 | raise 22 | 23 | 24 | def set_seed(seed, n_gpu): 25 | random.seed(seed) 26 | np.random.seed(seed) 27 | torch.manual_seed(seed) 28 | if n_gpu > 0: 29 | torch.cuda.manual_seed_all(seed) 30 | 31 | 32 | def load_from_yaml_file(yaml_file): 33 | with open(yaml_file, 'r') as fp: 34 | return yaml.load(fp) 35 | 36 | 37 | def find_file_path_in_yaml(fname, root): 38 | if fname is not None: 39 | if op.isfile(fname): 40 | return fname 41 | elif op.isfile(op.join(root, fname)): 42 | return op.join(root, fname) 43 | else: 44 | raise FileNotFoundError( 45 | errno.ENOENT, os.strerror(errno.ENOENT), op.join(root, fname) 46 | ) 47 | 48 | 49 | def get_rank(): 50 | if not dist.is_available(): 51 | return 0 52 | if not dist.is_initialized(): 53 | return 0 54 | return dist.get_rank() 55 | 56 | 57 | def is_main_process(): 58 | return get_rank() == 0 59 | 60 | 61 | def get_world_size(): 62 | if not dist.is_available(): 63 | return 1 64 | if not dist.is_initialized(): 65 | return 1 66 | return dist.get_world_size() 67 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/task_utils.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 2 | 3 | from __future__ import absolute_import, division, print_function 4 | 5 | import csv, json 6 | import logging 7 | import os 8 | import sys 9 | from io import open 10 | import _pickle as cPickle 11 | import torch 12 | 13 | logger = logging.getLogger(__name__) 14 | 15 | 16 | class InputInstance(object): 17 | """A single training/test example for simple sequence classification.""" 18 | 19 | def __init__(self, guid, text_a, text_b=None, label=None, score=None, img_key=None, q_id=None): 20 | """Constructs a InputExample. 21 | 22 | Args: 23 | guid: Unique id for the example. 24 | text_a: string. The untokenized text of the first sequence. For single 25 | sequence tasks, only this sequence must be specified. 26 | text_b: (Optional) string. The untokenized text of the second sequence. 27 | Only must be specified for sequence pair tasks. 28 | label: (Optional) string. The label of the example. This should be 29 | specified for train and dev examples, but not for test examples. 30 | """ 31 | 32 | self.guid = guid 33 | self.text_a = text_a 34 | self.text_b = text_b 35 | self.label = label 36 | self.score = score 37 | self.img_key = img_key 38 | self.q_id = q_id 39 | 40 | 41 | class InputFeat(object): 42 | """A single set of features of data.""" 43 | 44 | def __init__(self, input_ids, input_mask, segment_ids, label_id, score, img_feat): 45 | self.input_ids = input_ids 46 | self.input_mask = input_mask 47 | self.segment_ids = segment_ids 48 | self.label_id = label_id 49 | self.score = score 50 | self.img_feat = img_feat 51 | 52 | 53 | class DataProcessor(object): 54 | """Base class for data converters for sequence classification data sets.""" 55 | 56 | def get_train_examples(self, data_dir): 57 | """Gets a collection of `InputExample`s for the train set.""" 58 | raise NotImplementedError() 59 | 60 | def get_dev_examples(self, data_dir): 61 | """Gets a collection of `InputExample`s for the dev set.""" 62 | raise NotImplementedError() 63 | 64 | def get_labels(self): 65 | """Gets the list of labels for this data set.""" 66 | raise NotImplementedError() 67 | 68 | @classmethod 69 | def _read_tsv(cls, input_file, quotechar=None): 70 | """Reads a tab separated value file.""" 71 | with open(input_file, "r", encoding="utf-8-sig") as f: 72 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar) 73 | lines = [] 74 | for line in reader: 75 | if sys.version_info[0] == 2: 76 | line = list(unicode(cell, 'utf-8') for cell in line) 77 | lines.append(line) 78 | return lines 79 | 80 | 81 | class VQATextProcessor(DataProcessor): 82 | """ Processor for the VQA Text data set. """ 83 | 84 | def get_train_examples(self, data_dir, file_name='train2014_qla.json'): 85 | """ See base class.""" 86 | 87 | lines = json.load(open(os.path.join(data_dir, file_name))) 88 | declars = open(os.path.join(data_dir, 'train2014_declarative.json'), "r").read().split("\n") 89 | assert len(lines) == len(declars) 90 | return self._create_examples(lines, declars, "train") 91 | 92 | #return self._create_examples(self._read_tsv(os.path.join(data_dir, "train2014_qla.tsv")), "train") 93 | 94 | def get_dev_examples(self, data_dir, file_name='val2014_qla.json'): 95 | """ See base class.""" 96 | 97 | lines = json.load(open(os.path.join(data_dir, file_name))) 98 | declars = open(os.path.join(data_dir, 'val2014_declarative.json'), "r").read().split("\n") 99 | assert len(lines) == len(declars) 100 | return self._create_examples(lines, declars, "dev") 101 | 102 | #return self._create_examples(self._read_tsv(os.path.join(data_dir, "val2014_qla.tsv")), "dev") 103 | 104 | def get_test_examples(self, data_dir, file_name='test2015_qla.json'): 105 | """ See base class.""" 106 | 107 | lines = json.load(open(os.path.join(data_dir, file_name))) 108 | return self._create_examples(lines, "test") 109 | 110 | def get_labels(self, label_file): 111 | """ See base class.""" 112 | 113 | ans2label = cPickle.load(open(label_file, 'rb')) 114 | return list(range(len(ans2label))) 115 | # return list(ans2label.values()) 116 | #return ["entailment", "not_entailment"] 117 | 118 | def _create_examples(self, lines, declars, set_type): 119 | """Creates examples for the training and dev sets.""" 120 | 121 | examples = [] 122 | for (i, (line, declar)) in enumerate(zip(lines, declars)): 123 | if set_type!='test' and len(line['an']) == 0: continue 124 | 125 | guid = "%s-%s" % (set_type, str(i)) 126 | text_a = line['q'] 127 | text_b = line['o'].replace(';', ' ').strip() #line['o'] 128 | label = None if set_type.startswith('test') else line['an'] 129 | score = None if set_type.startswith('test') else line['s'] 130 | img_key = line['img_id'] 131 | q_id = i # int(line['q_id']) if set_type.startswith('test') else 0 132 | 133 | exam = InputInstance(guid=guid, text_a=text_a, text_b=text_b, label=label, score=score, img_key=img_key, 134 | q_id=q_id) 135 | exam.text_c = declar 136 | examples.append(exam) 137 | return examples 138 | 139 | class VQATextAProcessor(DataProcessor): 140 | """ Processor for the VQA Text data set. """ 141 | 142 | def get_train_examples(self, data_dir, file_name='train2014_qla.json'): 143 | """ See base class.""" 144 | 145 | lines = json.load(open(os.path.join(data_dir, file_name))) 146 | return self._create_examples(lines, "train") 147 | 148 | #return self._create_examples(self._read_tsv(os.path.join(data_dir, "train2014_qla.tsv")), "train") 149 | 150 | def get_dev_examples(self, data_dir, file_name='val2014_qla.json'): 151 | """ See base class.""" 152 | 153 | lines = json.load(open(os.path.join(data_dir, file_name))) 154 | return self._create_examples(lines, "dev") 155 | 156 | #return self._create_examples(self._read_tsv(os.path.join(data_dir, "val2014_qla.tsv")), "dev") 157 | 158 | def get_test_examples(self, data_dir, file_name='test2015_qla.json'): 159 | """ See base class.""" 160 | 161 | lines = json.load(open(os.path.join(data_dir, file_name))) 162 | return self._create_examples(lines, "test") 163 | 164 | def get_labels(self, label_file): 165 | """ See base class.""" 166 | 167 | ans2label = cPickle.load(open(label_file, 'rb')) 168 | return list(ans2label.values()) 169 | 170 | def _create_examples(self, lines, set_type): 171 | """Creates examples for the training and dev sets.""" 172 | 173 | examples = [] 174 | for (i, line) in enumerate(lines): 175 | if set_type!='test' and len(line['an']) == 0: continue 176 | 177 | guid = "%s-%s" % (set_type, str(i)) 178 | text_a = line['q'] 179 | text_b = None # line['o'] # or None 180 | label = None if set_type.startswith('test') else line['an'] 181 | score = None if set_type.startswith('test') else line['s'] 182 | img_key = line['img_id'] 183 | q_id = int(line['q_id']) if set_type.startswith('test') else 0 184 | examples.append(InputInstance(guid=guid, text_a=text_a, text_b=text_b, label=label, score=score, img_key=img_key, q_id=q_id)) 185 | return examples 186 | 187 | class GQAProcessor(DataProcessor): 188 | """ Processor for the GQA data set. """ 189 | 190 | def get_train_examples(self, data_dir, file_name='train2014_qla.json'): 191 | """ See base class.""" 192 | 193 | lines = json.load(open(os.path.join(data_dir, file_name))) 194 | return self._create_examples(lines, "train") 195 | 196 | #return self._create_examples(self._read_tsv(os.path.join(data_dir, "train2014_qla.tsv")), "train") 197 | 198 | def get_dev_examples(self, data_dir, file_name='val2014_qla.json'): 199 | """ See base class.""" 200 | 201 | lines = json.load(open(os.path.join(data_dir, file_name))) 202 | return self._create_examples(lines, "dev") 203 | 204 | #return self._create_examples(self._read_tsv(os.path.join(data_dir, "val2014_qla.tsv")), "dev") 205 | 206 | def get_test_examples(self, data_dir, file_name='test2015_qla.json'): 207 | """ See base class.""" 208 | 209 | lines = json.load(open(os.path.join(data_dir, file_name))) 210 | return self._create_examples(lines, "test") 211 | 212 | def get_labels(self, label_file='trainval_testdev_all_ans2label.pkl'): 213 | """ See base class.""" 214 | 215 | ans2label = cPickle.load(open(label_file, 'rb')) 216 | return list(ans2label.values()) 217 | 218 | def _create_examples(self, lines, set_type): 219 | """Creates examples for the training and dev sets.""" 220 | 221 | examples = [] 222 | for (i, line) in enumerate(lines): 223 | if set_type!='test' and len(line['an']) == 0: continue 224 | 225 | guid = "%s-%s" % (set_type, str(i)) 226 | text_a = line['q'] 227 | text_b = line['o'] # or None 228 | label = None if set_type.startswith('test') else line['an'] 229 | score = 0 230 | img_key = line['img_id'] 231 | q_id = int(line['q_id']) # if set_type.startswith('test') else 0 232 | examples.append(InputInstance(guid=guid, text_a=text_a, text_b=text_b, label=label, score=score, img_key=img_key, q_id=q_id)) 233 | return examples 234 | 235 | class NLVRProcessor(DataProcessor): 236 | """ Processor for the NLVR data set. """ 237 | 238 | def get_train_examples(self, data_dir, use_label_seq=True, file_name='nlvr2_train.json'): 239 | """ See base class.""" 240 | 241 | lines = json.load(open(os.path.join(data_dir, file_name))) 242 | return self._create_examples(lines, "train", use_label_seq) 243 | 244 | #return self._create_examples(self._read_tsv(os.path.join(data_dir, "train2014_qla.tsv")), "train") 245 | 246 | def get_dev_examples(self, data_dir, use_label_seq=True, file_name='nlvr2_dev.json'): 247 | """ See base class.""" 248 | 249 | lines = json.load(open(os.path.join(data_dir, file_name))) 250 | return self._create_examples(lines, "dev", use_label_seq) 251 | 252 | #return self._create_examples(self._read_tsv(os.path.join(data_dir, "val2014_qla.tsv")), "dev") 253 | 254 | def get_test_examples(self, data_dir, use_label_seq=True, file_name='nlvr2_test1.json'): 255 | """ See base class.""" 256 | 257 | lines = json.load(open(os.path.join(data_dir, file_name))) 258 | return self._create_examples(lines, "test", use_label_seq) 259 | 260 | def get_labels(self, label_file=None): 261 | """ See base class.""" 262 | 263 | #ans2label = cPickle.load(open(label_file, 'rb')) 264 | #return list(ans2label.values()) 265 | return [0, 1] 266 | 267 | def _create_examples(self, lines, set_type, use_label_seq=True): 268 | """ Creates examples for the training and dev sets. """ 269 | 270 | examples = [] 271 | for (i, line) in enumerate(lines): 272 | guid = "%s-%s" % (set_type, str(i)) 273 | text_a = line['q'] 274 | text_b = line['o'] if use_label_seq else None 275 | label = line['label'] #None if set_type.startswith('test') else line['label'] 276 | score = 0 277 | img_key = line['img_id'] #[line['img_left'], line['img_left']] 278 | q_id = 0 #int(line['q_id']) if set_type.startswith('test') else 0 279 | examples.append(InputInstance(guid=guid, text_a=text_a, text_b=text_b, label=label, score=score, img_key=img_key, q_id=q_id)) 280 | return examples 281 | 282 | class VCR_Q_A_Processor(DataProcessor): 283 | """ Processor for the VCR (q -> a) (Det) data set. """ 284 | 285 | def get_train_examples(self, data_dir, file_name='vcr_train.json'): 286 | """ See base class.""" 287 | 288 | lines = json.load(open(os.path.join(data_dir, file_name))) 289 | return self._create_examples(lines, "train") 290 | 291 | def get_dev_examples(self, data_dir, file_name='vcr_val.json'): 292 | """ See base class.""" 293 | 294 | lines = json.load(open(os.path.join(data_dir, file_name))) 295 | return self._create_examples(lines, "dev") 296 | 297 | def get_test_examples(self, data_dir, file_name='vcr_test.json'): 298 | """ See base class.""" 299 | 300 | lines = json.load(open(os.path.join(data_dir, file_name))) 301 | return self._create_examples(lines, "test") 302 | 303 | def get_labels(self, label_file=None): 304 | """ See base class.""" 305 | 306 | #ans2label = cPickle.load(open(label_file, 'rb')) 307 | #return list(ans2label.values()) 308 | return [0, 1] 309 | 310 | def _create_examples(self, lines, set_type): 311 | """ Creates examples for the training and dev sets. """ 312 | 313 | examples = [] 314 | for (i, line) in enumerate(lines): 315 | #if set_type!='test': continue 316 | 317 | guid = "%s-%s" % (set_type, str(i)) 318 | text_a = line['q'] # question 319 | choices = line['choices'] 320 | label = None if set_type.startswith('test') else line['label'] 321 | img_key = line['img_id'] 322 | q_id = int(line['annot_id'].split('-')[-1]) #int(line['q_id']) if set_type.startswith('test') else 0 323 | score = line['objects'] if 'objects' in line else None 324 | examples.append(InputInstance(guid=guid, text_a=text_a, text_b=choices, label=label, score=score, img_key=img_key, q_id=q_id)) 325 | return examples 326 | 327 | class VCR_QA_R_Processor(DataProcessor): 328 | """ Processor for the VCR (qa -> r) QA_R data set. """ 329 | 330 | def get_train_examples(self, data_dir, file_name='vcr_train.json'): 331 | """ See base class.""" 332 | 333 | lines = json.load(open(os.path.join(data_dir, file_name))) 334 | return self._create_examples(lines, "train") 335 | 336 | def get_dev_examples(self, data_dir, file_name='vcr_val.json'): 337 | """ See base class.""" 338 | 339 | lines = json.load(open(os.path.join(data_dir, file_name))) 340 | return self._create_examples(lines, "dev") 341 | 342 | def get_test_examples(self, data_dir, file_name='vcr_test.json'): 343 | """ See base class.""" 344 | 345 | lines = json.load(open(os.path.join(data_dir, file_name))) 346 | return self._create_examples(lines, "test") 347 | 348 | def get_labels(self, label_file=None): 349 | """ See base class.""" 350 | 351 | #ans2label = cPickle.load(open(label_file, 'rb')) 352 | #return list(ans2label.values()) 353 | return [0, 1] 354 | 355 | def _create_examples(self, lines, set_type): 356 | """ Creates examples for the training and dev sets. """ 357 | 358 | examples = [] 359 | for (i, line) in enumerate(lines): 360 | #if set_type!='test': continue 361 | 362 | guid = "%s-%s" % (set_type, str(i)) 363 | text_a = line['q'] + ' ' + line['choices'][line['label']] # question_choice 364 | choices = line['rational_choices'] # rational_choice 365 | label = None if set_type.startswith('test') else line['rational_label'] # rational_label 366 | img_key = line['img_id'] 367 | q_id = int(line['annot_id'].split('-')[-1]) #int(line['q_id']) if set_type.startswith('test') else 0 368 | examples.append(InputInstance(guid=guid, text_a=text_a, text_b=choices, label=label, score=None, img_key=img_key, q_id=q_id)) 369 | return examples 370 | 371 | class VCR_QAR_Processor(DataProcessor): 372 | """ Processor for the VCR (q->a, qa->r) data set. """ 373 | 374 | def get_train_examples(self, data_dir, file_name='vcr_train.json'): 375 | """ See base class.""" 376 | 377 | lines = json.load(open(os.path.join(data_dir, file_name))) 378 | return self._create_examples(lines, "train") 379 | 380 | def get_dev_examples(self, data_dir, file_name='vcr_val.json'): 381 | """ See base class.""" 382 | 383 | lines = json.load(open(os.path.join(data_dir, file_name))) 384 | return self._create_examples(lines, "dev") 385 | 386 | def get_test_examples(self, data_dir, file_name='vcr_test.json'): 387 | """ See base class.""" 388 | 389 | lines = json.load(open(os.path.join(data_dir, file_name))) 390 | return self._create_examples(lines, "test") 391 | 392 | def get_labels(self, label_file=None): 393 | """ See base class.""" 394 | 395 | #ans2label = cPickle.load(open(label_file, 'rb')) 396 | #return list(ans2label.values()) 397 | return [0, 1] 398 | 399 | def _create_examples(self, lines, set_type): 400 | """ Creates examples for the training and dev sets. """ 401 | 402 | examples = [] 403 | for (i, line) in enumerate(lines): 404 | #if set_type!='test': continue 405 | 406 | guid = "%s-%s-q-a" % (set_type, str(i)) 407 | text_a = line['q'] # question 408 | choices = line['choices'] 409 | label = None if set_type.startswith('test') else line['label'] 410 | img_key = line['img_id'] 411 | q_id = int(line['annot_id'].split('-')[-1]) #int(line['q_id']) if set_type.startswith('test') else 0 412 | score = line['objects'] if 'objects' in line else None 413 | examples.append(InputInstance(guid=guid, text_a=text_a, text_b=choices, label=label, score=score, img_key=img_key, q_id=q_id)) 414 | 415 | if set_type == 'train': # qa -> r 416 | guid = "%s-%s-qa-r" % (set_type, str(i)) 417 | text_a = line['q'] + ' ' + line['choices'][line['label']] # question_choice 418 | choices = line['rational_choices'] # rational_choice 419 | label = None if set_type.startswith('test') else line['rational_label'] # rational_label 420 | img_key = line['img_id'] 421 | q_id = int(line['annot_id'].split('-')[-1]) # int(line['q_id']) if set_type.startswith('test') else 0 422 | score = line['objects'] if 'objects' in line else None 423 | examples.append(InputInstance(guid=guid, text_a=text_a, text_b=choices, label=label, score=score, img_key=img_key, q_id=q_id)) 424 | return examples 425 | 426 | 427 | def convert_examples_to_features_vqa(examples, img_feats, label_list, max_img_seq_length, max_seq_length, 428 | tokenizer, output_mode, 429 | cls_token_at_end=False, pad_on_left=False, 430 | cls_token='[CLS]', sep_token='[SEP]', pad_token=0, 431 | sequence_a_segment_id=0, sequence_b_segment_id=1, 432 | cls_token_segment_id=1, pad_token_segment_id=0, 433 | mask_padding_with_zero=True): 434 | """ Loads a data file into a list of `InputBatch`s 435 | `cls_token_at_end` define the location of the CLS token: 436 | - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP] 437 | - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS] 438 | `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet) 439 | """ 440 | 441 | label_map = {label:i for i, label in enumerate(label_list)} 442 | 443 | features = [] 444 | #debug: 445 | debug_size = 500 446 | 447 | for (ex_index, example) in enumerate(examples[0: ]): 448 | if len(example.label) == 0: continue 449 | if ex_index % 10000 == 0: 450 | logger.info("Writing example %d of %d" % (ex_index, len(examples))) 451 | 452 | tokens_a = tokenizer.tokenize(example.text_a) 453 | 454 | tokens_b = None 455 | if example.text_b: 456 | tokens_b = tokenizer.tokenize(example.text_b) 457 | # Modifies `tokens_a` and `tokens_b` in place so that the total 458 | # length is less than the specified length. 459 | # Account for [CLS], [SEP], [SEP] with "- 3" 460 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 461 | else: 462 | # Account for [CLS] and [SEP] with "- 2" 463 | if len(tokens_a) > max_seq_length - 2: 464 | tokens_a = tokens_a[:(max_seq_length - 2)] 465 | 466 | # The convention in BERT is: 467 | # (a) For sequence pairs: 468 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 469 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 470 | # (b) For single sequences: 471 | # tokens: [CLS] the dog is hairy . [SEP] 472 | # type_ids: 0 0 0 0 0 0 0 473 | # 474 | # Where "type_ids" are used to indicate whether this is the first 475 | # sequence or the second sequence. The embedding vectors for `type=0` and 476 | # `type=1` were learned during pre-training and are added to the wordpiece 477 | # embedding vector (and position vector). This is not *strictly* necessary 478 | # since the [SEP] token unambiguously separates the sequences, but it makes 479 | # it easier for the model to learn the concept of sequences. 480 | # 481 | # For classification tasks, the first vector (corresponding to [CLS]) is 482 | # used as as the "sentence vector". Note that this only makes sense because 483 | # the entire model is fine-tuned. 484 | tokens = tokens_a + [sep_token] 485 | segment_ids = [sequence_a_segment_id] * len(tokens) 486 | 487 | if tokens_b: 488 | tokens += tokens_b + [sep_token] 489 | segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1) 490 | 491 | if cls_token_at_end: 492 | tokens = tokens + [cls_token] 493 | segment_ids = segment_ids + [cls_token_segment_id] 494 | else: 495 | tokens = [cls_token] + tokens 496 | segment_ids = [cls_token_segment_id] + segment_ids 497 | 498 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 499 | 500 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 501 | # tokens are attended to. 502 | input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids) 503 | 504 | # Zero-pad up to the sequence length. 505 | padding_length = max_seq_length - len(input_ids) 506 | if pad_on_left: 507 | input_ids = ([pad_token] * padding_length) + input_ids 508 | input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask 509 | segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids 510 | else: 511 | input_ids = input_ids + ([pad_token] * padding_length) 512 | input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length) 513 | segment_ids = segment_ids + ([pad_token_segment_id] * padding_length) 514 | 515 | assert len(input_ids) == max_seq_length 516 | assert len(input_mask) == max_seq_length 517 | assert len(segment_ids) == max_seq_length 518 | 519 | # image features 520 | #img_feat = img_feats[example.img_key] # torch 521 | img_feat = img_feats.item().get(example.img_key) # numpy 522 | if img_feat.shape[0] > max_img_seq_length: 523 | img_feat = img_feat[0:max_img_seq_length, ] 524 | if max_img_seq_length > 0: 525 | input_mask = input_mask + [1 if mask_padding_with_zero else 0] * img_feat.shape[0] 526 | #segment_ids += [sequence_b_segment_id] * img_feat.shape[0] 527 | else: 528 | if max_img_seq_length > 0: 529 | input_mask = input_mask + [1 if mask_padding_with_zero else 0] * img_feat.shape[0] 530 | #segment_ids = segment_ids + [sequence_b_segment_id] * img_feat.shape[0] 531 | padding_matrix = torch.zeros((max_img_seq_length - img_feat.shape[0], img_feat.shape[1])) 532 | img_feat = torch.cat((img_feat, padding_matrix), 0) 533 | if max_img_seq_length > 0: 534 | input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_matrix.shape[0]) 535 | #segment_ids = segment_ids + [pad_token_segment_id] * padding_matrix.shape[0] 536 | 537 | if output_mode == "classification": 538 | label_id = [label_map[l] for l in example.label] 539 | score = example.score 540 | elif output_mode == "regression": 541 | label_id = float(example.label) 542 | else: 543 | raise KeyError(output_mode) 544 | 545 | if ex_index < 5: 546 | logger.info("*** Example ***") 547 | logger.info("guid: %s" % (example.guid)) 548 | logger.info("tokens: %s" % " ".join([str(x) for x in tokens])) 549 | logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 550 | logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 551 | logger.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 552 | logger.info("label: %s (id = %s)" % (example.label, label_id)) 553 | logger.info("score: %s (score = %s)" % (example.score, score)) 554 | 555 | features.append(InputFeat(input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_id=label_id, score=score, img_feat=img_feat)) 556 | return features 557 | 558 | 559 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 560 | """Truncates a sequence pair in place to the maximum length.""" 561 | 562 | # This is a simple heuristic which will always truncate the longer sequence 563 | # one token at a time. This makes more sense than truncating an equal percent 564 | # of tokens from each, since if one sequence is very short then each token 565 | # that's truncated likely contains more information than a longer sequence. 566 | while True: 567 | total_length = len(tokens_a) + len(tokens_b) 568 | if total_length <= max_length: 569 | break 570 | if len(tokens_a) > len(tokens_b): 571 | tokens_a.pop() 572 | else: 573 | tokens_b.pop() 574 | 575 | 576 | processors = { 577 | "vqa_text": VQATextProcessor, 578 | "vqa_text_a": VQATextAProcessor, 579 | "gqa": GQAProcessor, 580 | "nlvr": NLVRProcessor, 581 | "vcr_q_a": VCR_Q_A_Processor, 582 | "vcr_qa_r": VCR_QA_R_Processor, 583 | "vcr_qar": VCR_QAR_Processor, 584 | } 585 | 586 | output_modes = { 587 | "vqa_text": "classification", 588 | "vqa_text_a": "classification", 589 | "gqa": "classification", 590 | "nlvr": "classification", 591 | "vcr_q_a": "classification", 592 | "vcr_qa_r": "classification", 593 | "vcr_qar": "classification", 594 | } 595 | 596 | GLUE_TASKS_NUM_LABELS = { 597 | "vqa_text": 3129, 598 | "vqa_text_a": 3129, 599 | "gqa": 1853, 600 | "nlvr": 2, 601 | "vcr_q_a": 2, 602 | "vcr_qa_r": 2, 603 | "vcr_qar": 2, 604 | } -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/tsv_file.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 2 | 3 | import logging 4 | import os 5 | import os.path as op 6 | 7 | 8 | def generate_lineidx_file(filein, idxout): 9 | idxout_tmp = idxout + '.tmp' 10 | with open(filein, 'r') as tsvin, open(idxout_tmp,'w') as tsvout: 11 | fsize = os.fstat(tsvin.fileno()).st_size 12 | fpos = 0 13 | while fpos!=fsize: 14 | tsvout.write(str(fpos)+"\n") 15 | tsvin.readline() 16 | fpos = tsvin.tell() 17 | os.rename(idxout_tmp, idxout) 18 | 19 | 20 | class TSVFile(object): 21 | def __init__(self, tsv_file, generate_lineidx=False): 22 | self.tsv_file = tsv_file 23 | self.lineidx = op.splitext(tsv_file)[0] + '.lineidx' 24 | self._fp = None 25 | self._lineidx = None 26 | # the process always keeps the process which opens the file. 27 | # If the pid is not equal to the currrent pid, we will re-open the file. 28 | self.pid = None 29 | # generate lineidx if not exist 30 | if not op.isfile(self.lineidx) and generate_lineidx: 31 | generate_lineidx_file(self.tsv_file, self.lineidx) 32 | 33 | def __del__(self): 34 | if self._fp: 35 | self._fp.close() 36 | 37 | def __str__(self): 38 | return "TSVFile(tsv_file='{}')".format(self.tsv_file) 39 | 40 | def __repr__(self): 41 | return str(self) 42 | 43 | def num_rows(self): 44 | self._ensure_lineidx_loaded() 45 | return len(self._lineidx) 46 | 47 | def seek(self, idx): 48 | self._ensure_tsv_opened() 49 | self._ensure_lineidx_loaded() 50 | try: 51 | pos = self._lineidx[idx] 52 | except: 53 | logging.info('{}-{}'.format(self.tsv_file, idx)) 54 | raise 55 | self._fp.seek(pos) 56 | return [s.strip() for s in self._fp.readline().split('\t')] 57 | 58 | def seek_first_column(self, idx): 59 | self._ensure_tsv_opened() 60 | self._ensure_lineidx_loaded() 61 | pos = self._lineidx[idx] 62 | self._fp.seek(pos) 63 | return read_to_character(self._fp, '\t') 64 | 65 | def __getitem__(self, index): 66 | return self.seek(index) 67 | 68 | def __len__(self): 69 | return self.num_rows() 70 | 71 | def _ensure_lineidx_loaded(self): 72 | if self._lineidx is None: 73 | logging.info('loading lineidx: {}'.format(self.lineidx)) 74 | with open(self.lineidx, 'r') as fp: 75 | self._lineidx = [int(i.strip()) for i in fp.readlines()] 76 | 77 | def _ensure_tsv_opened(self): 78 | if self._fp is None: 79 | self._fp = open(self.tsv_file, 'r') 80 | self.pid = os.getpid() 81 | 82 | if self.pid != os.getpid(): 83 | logging.info('re-open {} because the process id changed'.format(self.tsv_file)) 84 | self._fp = open(self.tsv_file, 'r') 85 | self.pid = os.getpid() 86 | -------------------------------------------------------------------------------- /VinVL/Oscar/oscar/utils/tsv_file_ops.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 2 | 3 | import logging 4 | import numpy as np 5 | import os 6 | import os.path as op 7 | import shutil 8 | from .misc import mkdir 9 | from .tsv_file import TSVFile 10 | 11 | 12 | def tsv_writer(values, tsv_file_name, sep='\t'): 13 | mkdir(os.path.dirname(tsv_file_name)) 14 | tsv_file_name_tmp = tsv_file_name + '.tmp' 15 | with open(tsv_file_name_tmp, 'wb') as fp: 16 | assert values is not None 17 | for value in values: 18 | assert value is not None 19 | v = sep.join(map(lambda v: v.decode() if type(v) == bytes else str(v), value)) + '\n' 20 | v = v.encode() 21 | fp.write(v) 22 | os.rename(tsv_file_name_tmp, tsv_file_name) 23 | 24 | 25 | def concat_files(ins, out): 26 | out_tmp = out + '.tmp' 27 | with open(out_tmp, 'wb') as fp_out: 28 | for i, f in enumerate(ins): 29 | with open(f, 'rb') as fp_in: 30 | shutil.copyfileobj(fp_in, fp_out, 1024*1024*10) 31 | os.rename(out_tmp, out) 32 | 33 | 34 | def concat_tsv_files(tsvs, out_tsv, generate_lineidx=False): 35 | concat_files(tsvs, out_tsv) 36 | if generate_lineidx: 37 | sizes = [os.stat(t).st_size for t in tsvs] 38 | sizes = np.cumsum(sizes) 39 | all_idx = [] 40 | for i, t in enumerate(tsvs): 41 | for idx in load_list_file(op.splitext(t)[0] + '.lineidx'): 42 | if i == 0: 43 | all_idx.append(idx) 44 | else: 45 | all_idx.append(str(int(idx) + sizes[i - 1])) 46 | with open(op.splitext(out_tsv)[0] + '.lineidx', 'w') as f: 47 | f.write('\n'.join(all_idx)) 48 | 49 | 50 | def load_list_file(fname): 51 | with open(fname, 'r') as fp: 52 | lines = fp.readlines() 53 | result = [line.strip() for line in lines] 54 | if len(result) > 0 and result[-1] == '': 55 | result = result[:-1] 56 | return result 57 | 58 | 59 | def reorder_tsv_keys(in_tsv_file, ordered_keys, out_tsv_file): 60 | tsv = TSVFile(in_tsv_file, generate_lineidx=True) 61 | keys = [tsv.seek(i)[0] for i in range(len(tsv))] 62 | key_to_idx = {key: i for i, key in enumerate(keys)} 63 | def gen_rows(): 64 | for key in ordered_keys: 65 | idx = key_to_idx[key] 66 | yield tsv.seek(idx) 67 | tsv_writer(gen_rows(), out_tsv_file) 68 | 69 | 70 | def delete_tsv_files(tsvs): 71 | for t in tsvs: 72 | if op.isfile(t): 73 | try_delete(t) 74 | line = op.splitext(t)[0] + '.lineidx' 75 | if op.isfile(line): 76 | try_delete(line) 77 | 78 | 79 | def try_once(func): 80 | def func_wrapper(*args, **kwargs): 81 | try: 82 | return func(*args, **kwargs) 83 | except Exception as e: 84 | logging.info('ignore error \n{}'.format(str(e))) 85 | return func_wrapper 86 | 87 | 88 | @try_once 89 | def try_delete(f): 90 | os.remove(f) 91 | 92 | 93 | -------------------------------------------------------------------------------- /VinVL/Oscar/requirements.txt: -------------------------------------------------------------------------------- 1 | tqdm 2 | pyyaml 3 | matplotlib 4 | requests 5 | scikit-image 6 | anytree 7 | regex 8 | boto3 9 | -------------------------------------------------------------------------------- /VinVL/Oscar/setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | from __future__ import print_function 4 | import os 5 | import sys 6 | import re 7 | import os.path as op 8 | from setuptools import find_packages, setup 9 | 10 | # change directory to this module path 11 | try: 12 | this_file = __file__ 13 | except NameError: 14 | this_file = sys.argv[0] 15 | this_file = os.path.abspath(this_file) 16 | if op.dirname(this_file): 17 | os.chdir(op.dirname(this_file)) 18 | script_dir = os.getcwd() 19 | 20 | def readme(fname): 21 | """Read text out of a file in the same directory as setup.py. 22 | """ 23 | return open(op.join(script_dir, fname)).read() 24 | 25 | 26 | def find_version(fname): 27 | version_file = readme(fname) 28 | version_match = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]", 29 | version_file, re.M) 30 | if version_match: 31 | return version_match.group(1) 32 | raise RuntimeError("Unable to find version string.") 33 | 34 | 35 | setup( 36 | name="oscar", 37 | version=find_version("oscar/__init__.py"), 38 | url='https://github.com/xjli/Oscar', 39 | description="Oscar for vision and language tasks", 40 | long_description=readme('README.md'), 41 | packages=find_packages(), 42 | classifiers=[ 43 | 'Intended Audience :: Developers', 44 | "Programming Language :: Python", 45 | 'Topic :: Software Development', 46 | ] 47 | ) 48 | -------------------------------------------------------------------------------- /VinVL/README.md: -------------------------------------------------------------------------------- 1 | # README 2 | ___ 3 | This directory contains the files for training and evaluating vision-language 4 | pre-trained model, i.e., VinVL. Most of the files are copied from VinVL 5 | [github repo](https://github.com/microsoft/Oscar/tree/vinvl). We mainly modified or 6 | added the following files: 7 | ``` 8 | |- Oscar/ 9 | |- oscar/ 10 | |- run_gqa_prompt_mlm.py 11 | |- run_gqa_prompt_itm.py 12 | |- run_gqa_prompt_zero_few.py 13 | |- run_vqa_prompt_mlm.py 14 | |- run_vqa_prompt_itm.py 15 | |- utils/ 16 | |- task_utils.py 17 | ``` 18 | 19 | ## INSTALL 20 | 21 | + Refer to [INSTALL](https://github.com/microsoft/Oscar/tree/vinvl) for installation. 22 | 23 | ## Data Preparation 24 | 25 | ### GQA 26 | Please follow the steps bellow to configure data: 27 | 1. Refer to [DOWNLOAD](https://github.com/microsoft/Oscar/blob/vinvl/VinVL_DOWNLOAD.md) 28 | to download the pre-processed GQA dataset. 29 | The downloaded data should contain following files: 30 | ``` 31 | |- [DATA_ROOT]/gqa/ 32 | |- gqa_bal_qla_train.json 33 | |- gqa_bal_qla_val.json 34 | |- gqa_all_qla_train.json 35 | |- gqa_all_qla_val.json 36 | |- gqa_all_qla_submission.json 37 | ... 38 | ``` 39 | 2. Download the corresponding declaration files and put them in the `gqa/` directory. 40 | The declaration files are downloaded from [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/vinvl/gqa/*_declarative.json`). 41 | These files contain declarative sentence per line, which can used for later data-loading 42 | and processing. Please put these `*_declarative.json` files into the `gqa/` directory, 43 | resulting in following directory tree: 44 | ```text 45 | |- [DATA_ROOT]/gqa/ 46 | |- gqa_bal_qla_train.json 47 | |- gqa_bal_qla_val.json 48 | |- gqa_all_qla_train.json 49 | |- gqa_all_qla_val.json 50 | |- gqa_all_qla_submission.json 51 | |- gqa_bal_qla_train_declarative.json # newly added 52 | |- gqa_bal_qla_val_declarative.json # newly added 53 | |- gqa_all_qla_train_declarative.json # newly added 54 | |- gqa_all_qla_val_declarative.json # newly added 55 | |- gqa_all_qla_submission_declarative.json # newly added 56 | ... 57 | ``` 58 | 59 | ### VQA v2.0 60 | 61 | Please follow the steps bellow to configure data: 62 | 1. Refer to [DOWNLOAD](https://github.com/microsoft/Oscar/blob/vinvl/VinVL_DOWNLOAD.md) 63 | to download the pre-processed VQA v2.0 dataset. 64 | The downloaded data should contain following files: 65 | ``` 66 | |- [DATA_ROOT]/vqa/ 67 | |- train2014_qla_mrcnn.json 68 | |- val2014_qla_mrcnn.json 69 | ... 70 | ``` 71 | 2. Download the corresponding declaration files and put them in the `vqa/` directory. 72 | The declaration files are downloaded from [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/vinvl/vqa/*_declarative.json`). 73 | Please put these `*_declarative.json` files into the `vqa/` directory, 74 | resulting in following directory tree: 75 | ```text 76 | |- [DATA_ROOT]/vqa/ 77 | |- train2014_qla_mrcnn.json 78 | |- val2014_qla_mrcnn.json 79 | |- train2014_declarative.json # newly added 80 | |- val2014_declarative.json # newly added 81 | ... 82 | ``` 83 | 84 | ## Pre-trained Model 85 | 86 | Please refer to [DOWNLOAD](https://github.com/microsoft/Oscar/blob/vinvl/VinVL_DOWNLOAD.md) 87 | to download the pre-trained VinVL base model (`checkpoint-2000000`). We also provide the 88 | model checkpoint in [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/model/vinvl/checkpoint-2000000`). Assume that the 89 | `checkpoint-2000000` is placed in directory `[MODEL_ROOT]`, resulting in `[MODEL_ROOT]/checkpoint-2000000/` 90 | 91 | ## Training and Validation 92 | 93 | ### GQA 94 | Please follow the steps bellow to reproduce the results (we take the balanced 95 | split for example). 96 | 97 | We first utilize the adapted **masked language model (MLM) task** for GQA fine-tuning: 98 | 99 | 1. **Training(MLM)**: Run the following code to train VinVL-DPT(MLM) on the balanced split: 100 | ```bash 101 | python oscar/run_gqa_prompt_mlm.py \ 102 | -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ 103 | --data_dir [DATA_ROOT]/gqa/ \ 104 | --model_type bert \ 105 | --model_name_or_path [MODEL_ROOT]/checkpoint-2000000/ \ 106 | --task_name gqa --do_lower_case --max_seq_length 165 \ 107 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 108 | --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \ 109 | --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ 110 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ 111 | --eval_data_type bal \ 112 | --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ 113 | --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ 114 | --logging_steps 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 \ 115 | --gradient_accumulation_steps 2 116 | ``` 117 | If successful, the _overall_ accuracy will reach up to ~62.7%. We also 118 | provide the fine-tuned model in [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/model/vinvl/vinvl_bal_mlm`). 119 | 2. **Validation(MLM)**: Evaluate the fine-tuned model on the GQA validation set using the fine-tuned model we 120 | provide (or the model in the output_dir `gqa_mlm`). 121 | ```bash 122 | python oscar/run_gqa_prompt_mlm.py \ 123 | -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ 124 | --data_dir [DATA_ROOT]/gqa/ \ 125 | --model_type bert \ 126 | --model_name_or_path data/model/vinvl/vinvl_bal_mlm \ 127 | --task_name gqa --do_lower_case --max_seq_length 165 \ 128 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 129 | --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \ 130 | --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ 131 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ 132 | --eval_data_type bal \ 133 | --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ 134 | --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ 135 | --logging_steps 4000 --drop_out 0.3 --do_val --weight_decay 0.05 --warmup_steps 0 \ 136 | --gradient_accumulation_steps 2 137 | ``` 138 | Note that the `model_name_or_path` and `--do_val` arguments have been changed compared to 139 | training stage. 140 | 3. **Testing and Submission(MLM)**: Test the fine-tuned model and submit the result file 141 | to the [online evaluation website](https://eval.ai/web/challenges/challenge-page/225/leaderboard/733): 142 | Run the following code: 143 | ```bash 144 | python oscar/run_gqa_prompt_mlm.py \ 145 | -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ 146 | --data_dir [DATA_ROOT]/gqa/ \ 147 | --model_type bert \ 148 | --model_name_or_path data/model/vinvl/vinvl_bal_mlm \ 149 | --task_name gqa --do_lower_case --max_seq_length 165 \ 150 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 151 | --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \ 152 | --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ 153 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ 154 | --eval_data_type bal \ 155 | --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ 156 | --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ 157 | --logging_steps 4000 --drop_out 0.3 --do_test --weight_decay 0.05 --warmup_steps 0 \ 158 | --gradient_accumulation_steps 2 159 | ``` 160 | Note that the `--do_test` argument has been changed compared to 161 | validation stage. 162 | 163 | Then, we apply the adapated **image-text matching (ITM) task** to solve VQA problem. To 164 | achieve this, we need to obtain the top-k candidate answer predicted by MLM tasks. Specifically, 165 | we pre-generate the prediction results of MLM task: 166 | + Pre-generate topk results for training and validation. 167 | ```bash 168 | python oscar/run_gqa_prompt_mlm.py \ 169 | -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ 170 | --data_dir [DATA_ROOT]/gqa/ \ 171 | --model_type bert \ 172 | --model_name_or_path data/model/vinvl/vinvl_bal_mlm/ \ 173 | --task_name gqa --do_lower_case --max_seq_length 165 \ 174 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 175 | --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \ 176 | --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ 177 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ 178 | --eval_data_type bal \ 179 | --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ 180 | --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ 181 | --logging_steps 4000 --drop_out 0.3 --do_train --do_generate --weight_decay 0.05 --warmup_steps 0 \ 182 | --gradient_accumulation_steps 2 183 | ``` 184 | + Pre-generate topk results for submission. 185 | ```bash 186 | python oscar/run_gqa_prompt_mlm.py \ 187 | -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ 188 | --data_dir [DATA_ROOT]/gqa/ \ 189 | --model_type bert \ 190 | --model_name_or_path data/model/vinvl/vinvl_bal_mlm/ \ 191 | --task_name gqa --do_lower_case --max_seq_length 165 \ 192 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 193 | --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \ 194 | --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ 195 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ 196 | --eval_data_type bal \ 197 | --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ 198 | --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ 199 | --logging_steps 4000 --drop_out 0.3 --do_test --do_generate --weight_decay 0.05 --warmup_steps 0 \ 200 | --gradient_accumulation_steps 2 201 | ``` 202 | 203 | 204 | Note that the `--do_generate` argument has been added. In this way, there will be two result files 205 | saved in `model_name_or_path`, i.e., `stage1.pkl`, `stage1_eval.pkl`, and `stage1_submission.pkl`. The files have following 206 | data format: 207 | ```text 208 | { 209 | "[QID]": (np.ndarray([topk, ], np.int16), # Topk answer indices 210 | np.ndarray([topk, ], np.float16),), # Topk answer scores 211 | ... 212 | } 213 | ``` 214 | > We also provide the result files in the fine-tuned checkpoint `vinvl_bal_mlm`. 215 | 216 | 4. **Training(ITM)**: Equipped with the pre-generated topk answers, we can apply ITM by running following 217 | code: 218 | ```bash 219 | python oscar/run_gqa_prompt_itm.py \ 220 | -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ 221 | --data_dir [DATA_ROOT]/gqa/ \ 222 | --model_type bert \ 223 | --model_name_or_path data/model/vinvl/vinvl_bal_mlm/ \ 224 | --task_name gqa --do_lower_case --max_seq_length 165 \ 225 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 226 | --learning_rate 5e-05 --num_train_epochs 2 --output_dir gqa_itm \ 227 | --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ 228 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ 229 | --eval_data_type bal \ 230 | --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ 231 | --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ 232 | --logging_steps 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 \ 233 | --gradient_accumulation_steps 2 234 | ``` 235 | Note that we need to load the checkpoint from MLM task. We also provide the checkpoint in 236 | [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/model/vinvl/vinvl_bal_itm/`). 237 | 5. **Validation(ITM)**: Once the model is fine-tuned via ITM, we can validate the model 238 | through following code: 239 | ```bash 240 | python oscar/run_gqa_prompt_itm.py \ 241 | -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ 242 | --data_dir [DATA_ROOT]/gqa/ \ 243 | --model_type bert \ 244 | --model_name_or_path data/model/vinvl/vinvl_bal_itm/ \ 245 | --task_name gqa --do_lower_case --max_seq_length 165 \ 246 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 247 | --learning_rate 5e-05 --num_train_epochs 2 --output_dir gqa_itm \ 248 | --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ 249 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ 250 | --eval_data_type bal \ 251 | --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ 252 | --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ 253 | --logging_steps 4000 --drop_out 0.3 --do_val --weight_decay 0.05 --warmup_steps 0 \ 254 | --gradient_accumulation_steps 2 255 | ``` 256 | Note that the pre-generate result files, i.e., `stage1.pkl`, `stage1_eval.pkl`, and `stage1_submission.pkl` 257 | should be copied to `data/model/vinvl/vinvl_bal_itm/` so that the code has the access to the 258 | MLM results. 259 | 6. **Testing and Submission(ITM)**: (Please make sure that the `stage1_submission.pkl` has been 260 | pre-generated or downloaded, and placed in the `model_name_or_path`.) Run the following code to run testing: 261 | ```bash 262 | python oscar/run_gqa_prompt_itm.py \ 263 | -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ 264 | --data_dir [DATA_ROOT]/gqa/ \ 265 | --model_type bert \ 266 | --model_name_or_path data/model/vinvl/vinvl_bal_itm/ \ 267 | --task_name gqa --do_lower_case --max_seq_length 165 \ 268 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 269 | --learning_rate 5e-05 --num_train_epochs 2 --output_dir gqa_itm \ 270 | --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ 271 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ 272 | --eval_data_type bal \ 273 | --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ 274 | --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ 275 | --logging_steps 4000 --drop_out 0.3 --do_test --weight_decay 0.05 --warmup_steps 0 \ 276 | --gradient_accumulation_steps 2 277 | ``` 278 | 279 | ### VQA v2.0 280 | 281 | Please follow the steps bellow to reproduce the results on VQA v2.0: 282 | 283 | We first utilize the **masked language model (MLM) task** to fine-tune the model: 284 | 1. **Training(MLM)**: Run the following code to train VinVL-DPT(MLM): 285 | ```bash 286 | python oscar/run_vqa_prompt_mlm.py -j 4 \ 287 | --img_feature_dim 2054 --max_img_seq_length 50 \ 288 | --data_label_type mask --img_feature_type faster_r-cnn \ 289 | --data_dir [DATA_ROOT]/vqa --model_type bert \ 290 | --model_name_or_path [MODEL_ROOT]/checkpoint-2000000 \ 291 | --task_name vqa_text --do_train --do_lower_case --max_seq_length 158 \ 292 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 293 | --learning_rate 5e-05 --num_train_epochs 25 \ 294 | --output_dir vqa_mlm --label_file [DATA_ROOT]/vqatrainval_ans2label.pkl \ 295 | --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 \ 296 | --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce \ 297 | --img_feat_format pt --classifier linear --cls_hidden_scale 3 \ 298 | --txt_data_dir [DATA_ROOT]/vqa 299 | ``` 300 | We also provide the checkpoint in [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/model/vinvl/vqa_mlm/`). 301 | Then, we pre-generate the topk results of MLM task via following code: 302 | ```bash 303 | python oscar/run_vqa_prompt_mlm.py -j 4 \ 304 | --img_feature_dim 2054 --max_img_seq_length 50 \ 305 | --data_label_type mask --img_feature_type faster_r-cnn \ 306 | --data_dir [DATA_ROOT]/vqa --model_type bert \ 307 | --model_name_or_path data/model/vinvl/vqa_mlm/ \ 308 | --task_name vqa_text --do_train --do_generate --do_lower_case --max_seq_length 158 \ 309 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 310 | --learning_rate 5e-05 --num_train_epochs 25 \ 311 | --output_dir vqa_mlm --label_file [DATA_ROOT]/vqatrainval_ans2label.pkl \ 312 | --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 \ 313 | --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce \ 314 | --img_feat_format pt --classifier linear --cls_hidden_scale 3 \ 315 | --txt_data_dir [DATA_ROOT]/vqa 316 | ``` 317 | Note that `model_name_or_path` and `do_generate` arguments have been changed. In this way, 318 | two result files are generated and saved in `model_name_or_path`, i.e., `stage1.pkl` and 319 | `stage1_eval.pkl`. 320 | 2. **Training(ITM)**: Run the following code to train image-text matching (ITM) task for VQA: 321 | ```bash 322 | python oscar/run_vqa_prompt_itm.py -j 4 \ 323 | --img_feature_dim 2054 --max_img_seq_length 50 \ 324 | --data_label_type mask --img_feature_type faster_r-cnn \ 325 | --data_dir [DATA_ROOT]/vqa --model_type bert \ 326 | --model_name_or_path data/model/vinvl/vqa_mlm/ \ 327 | --task_name vqa_text --do_train --do_lower_case --max_seq_length 158 \ 328 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ 329 | --learning_rate 5e-05 --num_train_epochs 6 \ 330 | --output_dir vqa_itm --label_file [DATA_ROOT]/vqatrainval_ans2label.pkl \ 331 | --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 \ 332 | --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce \ 333 | --img_feat_format pt --classifier linear --cls_hidden_scale 3 \ 334 | --txt_data_dir [DATA_ROOT]/vqa 335 | ``` 336 | We also provide the fine-tuned checkpoint in [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/model/vinvl/vqa_itm/`). 337 | 338 | ## Zero-shot and Few-shot Learning 339 | 340 | In zero-shot and few-shot settings, zero or only a few samples (1~128) are used to fine-tune the 341 | model. Run the following code to split `[K]`-shot training set for fine-tuning, and evaluate on the 342 | whole validation set. 343 | ```bash 344 | python oscar/run_gqa_prompt_zero_few.py \ 345 | -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ 346 | --data_dir [DATA_ROOT]/gqa/ \ 347 | --model_type bert \ 348 | --model_name_or_path [MODEL_ROOT]/checkpoint-2000000/ \ 349 | --task_name gqa --do_lower_case --max_seq_length 165 \ 350 | --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 1 \ 351 | --learning_rate 5e-05 --num_train_epochs 25 --output_dir gqa_subset \ 352 | --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ 353 | --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ 354 | --eval_data_type bal \ 355 | --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ 356 | --loss_type xe --save_epoch 10 --seed 88 --evaluate_during_training \ 357 | --logging_steps 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 \ 358 | --gradient_accumulation_steps 1 \ 359 | --num_examples [K] --subset_seed 0 360 | ``` 361 | 362 | ## Online Results 363 | 364 | + VinVL Baseline trained on balanced split: 365 | + [testdev](https://evalai.s3.amazonaws.com/media/submission_files/submission_176247/bd8113ca-6e55-4559-a508-9ecedbd4a49c.json) 366 | + [teststd](https://evalai.s3.amazonaws.com/media/submission_files/submission_176248/55c074d6-f381-4458-8413-69333596d8f7.json) 367 | + VinVL-DPT trained on balanced split: 368 | + [testdev](https://evalai.s3.amazonaws.com/media/submission_files/submission_176397/6150d2b9-09cf-4ac9-aecc-ae859be3fc79.json) 369 | + [teststd](https://evalai.s3.amazonaws.com/media/submission_files/submission_176400/e6f063d1-0df9-4c82-9b4c-c1ed130857bc.json) 370 | + VinVL-DPT trained on all split: 371 | + [testdev](https://evalai.s3.amazonaws.com/media/submission_files/submission_176134/ec622e48-6b78-47a4-a939-13c466d0622c.json) 372 | + [teststd](https://evalai.s3.amazonaws.com/media/submission_files/submission_176088/2eee503a-a756-4752-b48e-758a85ebd35b.json) --------------------------------------------------------------------------------