├── .gitignore
├── DeclarationGeneration
    ├── README.md
    ├── finetuning_T5.py
    ├── run_predict.sh
    └── run_train.sh
├── README.md
└── VinVL
    ├── Oscar
        ├── .gitignore
        ├── .gitmodules
        ├── CODE_OF_CONDUCT.md
        ├── DOWNLOAD.md
        ├── INSTALL.md
        ├── LICENSE
        ├── MODEL_ZOO.md
        ├── README.md
        ├── SECURITY.md
        ├── VinVL_DOWNLOAD.md
        ├── VinVL_MODEL_ZOO.md
        ├── docs
        │   ├── oscar.PNG
        │   ├── oscar_logo.png
        │   └── pretrain_corpus.PNG
        ├── oscar
        │   ├── __init__.py
        │   ├── datasets
        │   │   ├── __init__.py
        │   │   ├── build.py
        │   │   └── oscar_tsv.py
        │   ├── modeling
        │   │   ├── __init__.py
        │   │   ├── modeling_bert.py
        │   │   └── modeling_utils.py
        │   ├── run_captioning.py
        │   ├── run_gqa.py
        │   ├── run_gqa_prompt_itm.py
        │   ├── run_gqa_prompt_mlm.py
        │   ├── run_gqa_prompt_zero_few.py
        │   ├── run_nlvr.py
        │   ├── run_oscarplus_pretrain.py
        │   ├── run_retrieval.py
        │   ├── run_vqa.py
        │   ├── run_vqa_prompt_itm.py
        │   ├── run_vqa_prompt_mlm.py
        │   └── utils
        │   │   ├── __init__.py
        │   │   ├── caption_evaluate.py
        │   │   ├── cbs.py
        │   │   ├── cider
        │   │       └── pyciderevalcap
        │   │       │   ├── __init__.py
        │   │       │   ├── cider
        │   │       │       ├── __init__.py
        │   │       │       ├── cider.py
        │   │       │       └── cider_scorer.py
        │   │       │   └── ciderD
        │   │       │       ├── __init__.py
        │   │       │       ├── ciderD.py
        │   │       │       └── ciderD_scorer.py
        │   │   ├── logger.py
        │   │   ├── metric_logger.py
        │   │   ├── misc.py
        │   │   ├── task_utils.py
        │   │   ├── tsv_file.py
        │   │   └── tsv_file_ops.py
        ├── requirements.txt
        └── setup.py
    └── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea


--------------------------------------------------------------------------------
/DeclarationGeneration/README.md:
--------------------------------------------------------------------------------
 1 | # README
 2 | 
 3 | ___
 4 | 
 5 | This directory contains the files for declaration generation. [T5-small](https://arxiv.org/pdf/1910.10683.pdf) is 
 6 | exploited as the encoder-decoder model for training and evaluation.
 7 | 
 8 | ## INSTALL
 9 | 
10 | + Follow the [documentation](https://huggingface.co/docs/transformers/model_doc/t5) for installation.
11 | 
12 | ## Data
13 | 
14 | Assume the root data dir is `[ROOT_DATA_DIR]`, then the declaration dataset for training and 
15 | validation is placed in `[ROOT_DATA_DIR]/declaration/*`:
16 | ```
17 | |- [ROOT_DATA_DIR]
18 |     |- declaration/
19 |         |- question_to_declarative_train.json
20 |         |- question_to_declarative_val.json
21 | ```
22 | 
23 | ## Model Training
24 | 
25 | Run the script for training:
26 | ```bash
27 | bash run_train.sh
28 | ```
29 | 
30 | or the fine-tuned T5-small model (`model/declaration/checkpoint-480000`) can be 
31 | downloaded from [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA). The fine-tuned or downloaded checkpoint is placed in 
32 | `data/model/declaration/checkpoint-480000`.
33 | 
34 | ## Declaration Generation
35 | 
36 | Once T5 is trained on the declaration dataset, the model can be used to generate 
37 | declarative sentences for GQA and VQA datasets. Just follow the steps as bellow:
38 | 1. Transform the questions (from GQA and VQA v2.0 datasets) into the translation 
39 |    format (one sample per line), where `en_q` denotes the source question string 
40 |    and `en_a` denotes the target declarative sentence we want to generate. The file 
41 |    is named `source_file.txt`:
42 |    ```text
43 |    {"translation": {"en_q": "Is the sky dark?", "en_a": ""}}
44 |    {"translation": {"en_q": "What is on the white wall?", "en_a": ""}}
45 |    {"translation": {"en_q": "Is that pipe red?", "en_a": ""}}
46 |    ...
47 |    ```
48 | 2. Assume the path of `source_file.txt` is `[SOURCE_FILE_DIR]/source_file.txt`, then
49 |    run the script:
50 |    ```bash
51 |    bash run_predict.sh
52 |    ```
53 |    Finally, there will be a `.txt` file in `output` dir, _i.e._, `generated_predictions.txt`.
54 |    This file contains one sentence per line, representing the declarative 
55 |    sentence of the corresponding question in `source_file.txt`.
56 |    The format of `generated_predictions.txt` is shown as follows:
57 |    ```text
58 |    [MASK], the sky [BE] dark.
59 |    the [MASK] is on the wall.
60 |    [MASK], that pipe [BE] red.
61 |    ```
62 | 3. We provide the pre-generated declaration files (from the questions of GQA and VQA v2.0 datasets) 
63 |    for easy-to-use. The files can be downloaded from [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA), the files are arranged as 
64 |    follows:
65 |    ```
66 |    |- [ROOT_DATA_DIR]
67 |        |- declaration/
68 |            |- gqa/
69 |                |- gqa_all_submission_declaration.json
70 |                |- gqa_all_train_declaration.json
71 |                |- gqa_all_val_declaration.json
72 |                |- gqa_bal_train_declaration.json
73 |                |- gqa_bal_val_declaration.json
74 |            |- vqa/
75 |                |- test2015_declarative.json
76 |                |- test-dev2015_declarative.json
77 |                |- train2014_declarative.json
78 |                |- val2014_declarative.json
79 |                |- vqa_vg_declarative.json
80 |    ```


--------------------------------------------------------------------------------
/DeclarationGeneration/run_predict.sh:
--------------------------------------------------------------------------------
 1 | python ./finetuning_T5.py \
 2 |     --model_name_or_path "[ROOT_DATA_DIR]/model/declaration/checkpoint-480000" \
 3 |     --do_predict \
 4 |     --source_lang en_q \
 5 |     --target_lang en_a \
 6 |     --source_prefix "translate question to declarative sentence: " \
 7 |     --dataset_config_name en-en \
 8 |     --train_file "[ROOT_DATA_DIR]/declaration/question_to_declarative_train.json" \
 9 |     --validation_file "[ROOT_DATA_DIR]/declaration/question_to_declarative_val.json" \
10 |     --test_file "[SOURCE_FILE_DIR]/source_file.txt" \
11 |     --output_dir "./output" \
12 |     --per_device_train_batch_size=4 \
13 |     --per_device_eval_batch_size=64 \
14 |     --overwrite_output_dir \
15 |     --predict_with_generate \
16 |     --max_source_length 50 \
17 |     --max_target_length 50 \
18 |     --generation_max_length 50 \
19 |     --eval_accumulation_steps 80000


--------------------------------------------------------------------------------
/DeclarationGeneration/run_train.sh:
--------------------------------------------------------------------------------
 1 | python ./finetuning_T5.py \
 2 |     --model_name_or_path "t5-small" \
 3 |     --do_train \
 4 |     --source_lang en_q \
 5 |     --target_lang en_a \
 6 |     --source_prefix "translate question to declarative sentence: " \
 7 |     --dataset_config_name en-en \
 8 |     --train_file "[ROOT_DATA_DIR]/declaration/question_to_declarative_train.json" \
 9 |     --validation_file "[ROOT_DATA_DIR]/declaration/question_to_declarative_val.json" \
10 |     --output_dir "./output" \
11 |     --per_device_train_batch_size=4 \
12 |     --per_device_eval_batch_size=4 \
13 |     --overwrite_output_dir \
14 |     --predict_with_generate \
15 |     --max_source_length 50 \
16 |     --max_target_length 50 \
17 |     --generation_max_length 50 \
18 |     --eval_accumulation_steps 80000


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Declaration-based Prompt Tuning for Visual Question Answering
 2 | The implementation code of IJCAI 2022 paper: 
 3 | _Declaration-based Prompt Tuning for Visual Question Answering_.
 4 | 
 5 | ## Requirements
 6 | + Python 3
 7 | + Pytorch>=1.7.1
 8 | + pytorch-transformers==1.2.0
 9 | 
10 | ## Usage
11 | 
12 | ### Declaration Generation
13 | 
14 | Please follow [DeclarationGeneration](DeclarationGeneration/README.md) to set up the 
15 | experiments for declaration generation.
16 | 
17 | ### Visual Question Answering
18 | 
19 | Please follow [VinVL](VinVL/README.md) to set up the experiments for visual question 
20 | answering.
21 | 
22 | ## Citation
23 | 
24 | **Please kindly cite our paper if this paper and the code are helpful.**
25 | 
26 | ```
27 | @inproceedings{liu2022dpt,
28 |     author={Liu, Yuhang and Wei, Wei and Peng, Daowan and Zhu, Feida},
29 |     title={Declaration-based Prompt Tuning for Visual Question Answering},
30 |     booktitle={Proceedings of the Thirty-first International Joint Conference on Artificial Intelligence, {IJCAI-22}},
31 |     year={2022}
32 | }
33 | ```
34 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/.gitignore:
--------------------------------------------------------------------------------
  1 | # Initially taken from Github's Python gitignore file
  2 | 
  3 | # Byte-compiled / optimized / DLL files
  4 | __pycache__/
  5 | *.py[cod]
  6 | *$py.class
  7 | 
  8 | # C extensions
  9 | *.so
 10 | 
 11 | # Distribution / packaging
 12 | .Python
 13 | build/
 14 | develop-eggs/
 15 | dist/
 16 | downloads/
 17 | eggs/
 18 | .eggs/
 19 | lib/
 20 | lib64/
 21 | parts/
 22 | sdist/
 23 | var/
 24 | wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | 
 53 | # Translations
 54 | *.mo
 55 | *.pot
 56 | 
 57 | # Django stuff:
 58 | *.log
 59 | local_settings.py
 60 | db.sqlite3
 61 | 
 62 | # Flask stuff:
 63 | instance/
 64 | .webassets-cache
 65 | 
 66 | # Scrapy stuff:
 67 | .scrapy
 68 | 
 69 | # Sphinx documentation
 70 | docs/_build/
 71 | 
 72 | # PyBuilder
 73 | target/
 74 | 
 75 | # Jupyter Notebook
 76 | .ipynb_checkpoints
 77 | 
 78 | # IPython
 79 | profile_default/
 80 | ipython_config.py
 81 | 
 82 | # pyenv
 83 | .python-version
 84 | 
 85 | # celery beat schedule file
 86 | celerybeat-schedule
 87 | 
 88 | # SageMath parsed files
 89 | *.sage.py
 90 | 
 91 | # Environments
 92 | .env
 93 | .venv
 94 | env/
 95 | venv/
 96 | ENV/
 97 | env.bak/
 98 | venv.bak/
 99 | 
100 | # Spyder project settings
101 | .spyderproject
102 | .spyproject
103 | 
104 | # Rope project settings
105 | .ropeproject
106 | 
107 | # mkdocs documentation
108 | /site
109 | 
110 | # mypy
111 | .mypy_cache/
112 | .dmypy.json
113 | dmypy.json
114 | 
115 | # Pyre type checker
116 | .pyre/
117 | 
118 | # vscode
119 | .vscode
120 | 
121 | # TF code
122 | tensorflow_code
123 | 
124 | # Models
125 | models
126 | proc_data
127 | 
128 | # examples
129 | runs
130 | examples/runs
131 | 
132 | # pyCharm
133 | .idea/
134 | 
135 | # local folders
136 | data
137 | models
138 | output
139 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "transformers"]
2 | 	path = transformers
3 | 	url = git@github.com:huggingface/transformers.git
4 | [submodule "coco_caption"]
5 | 	path = coco_caption
6 | 	url = git@github.com:LuoweiZhou/coco-caption.git
7 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Microsoft Open Source Code of Conduct
 2 | 
 3 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
 4 | 
 5 | Resources:
 6 | 
 7 | - [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
 8 | - [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
 9 | - Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
10 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/DOWNLOAD.md:
--------------------------------------------------------------------------------
 1 | # Download
 2 | 
 3 | ## Datasets
 4 | We provide the extracted image region features, object tags, and the original text annotations for each downstream tasks.
 5 | ```bash
 6 | wget https://biglmdiag.blob.core.windows.net/oscar/datasets/$TASK_NAME.zip
 7 | unzip $TASK_NAME.zip -d $DATA_DIR
 8 | ```
 9 | `TASK_NAME` could be `coco_caption`, `coco_ir`, `vqa`, `GQA`, `nlvr2`.
10 | 
11 | ## Pre-trained Models
12 | We provide pre-trained *Oscar* models of Bert-base and Bert-large structures, with the name starting with `base` and `large`, respectively.
13 | ```bash
14 | wget https://biglmdiag.blob.core.windows.net/oscar/pretrained_models/$MODEL_NAME.zip
15 | unzip $MODEL_NAME.zip -d $MODEL_DIR
16 | ```
17 | `MODEL_NAME` could be `base-vg-labels`, `large-vg-labels`, `base-oid-labels`, `base-no-labels`.
18 | 
19 | The models are trained with both image region features and object tags. The image region features are extracted by the Faster R-CNN with
20 | ResNet-101, using object and attribute annotations from [Visual Genome](http://visualgenome.org/).
21 | The object tags are from:
22 |     1) the same VisualGenome model, named as `-vg-labels`. Or,
23 |     2) the model trained on object annotations from [Open Images V5](https://storage.googleapis.com/openimages/web/index.html). named as `-oid-labels`. Or,
24 |     3) no object tags provied, serving as baseline, named as `-no-labels`.
25 | 
26 | 
27 | ### Note
28 | It is recommended to download large files with **AzCopy** for faster speed.
29 | AzCopy executable tools can be downloaded [here](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10#download-azcopy).
30 | Decompress the tar file and put the executable in any path. To download from
31 | any URL above, the command is:
32 | ```bash
33 | path/to/azcopy copy <URL> <local_path>
34 | 
35 | # for example, downloading coco_caption.zip
36 | path/to/azcopy copy https://biglmdiag.blob.core.windows.net/oscar/datasets/coco_caption.zip <local_path>
37 | ```
38 | 
39 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/INSTALL.md:
--------------------------------------------------------------------------------
 1 | ## Installation
 2 | ### Requirements
 3 | - Python 3.7
 4 | - Pytorch 1.2
 5 | - torchvision 0.4.0
 6 | - cuda 10.0
 7 | 
 8 | ### Setup with Conda
 9 | ```bash
10 | # create a new environment
11 | conda create --name oscar python=3.7
12 | conda activate oscar
13 | 
14 | # install pytorch1.2
15 | conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
16 | 
17 | export INSTALL_DIR=$PWD
18 | 
19 | # install apex
20 | cd $INSTALL_DIR
21 | git clone https://github.com/NVIDIA/apex.git
22 | cd apex
23 | python setup.py install --cuda_ext --cpp_ext
24 | 
25 | # install oscar
26 | cd $INSTALL_DIR
27 | git clone --recursive git@github.com:microsoft/Oscar.git
28 | cd Oscar/coco_caption
29 | ./get_stanford_models.sh
30 | cd ..
31 | python setup.py build develop
32 | 
33 | # install requirements
34 | pip install -r requirements.txt
35 | 
36 | unset INSTALL_DIR
37 | ```
38 | 
39 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/LICENSE:
--------------------------------------------------------------------------------
 1 |     MIT License
 2 | 
 3 |     Copyright (c) Microsoft Corporation.
 4 | 
 5 |     Permission is hereby granted, free of charge, to any person obtaining a copy
 6 |     of this software and associated documentation files (the "Software"), to deal
 7 |     in the Software without restriction, including without limitation the rights
 8 |     to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 |     copies of the Software, and to permit persons to whom the Software is
10 |     furnished to do so, subject to the following conditions:
11 | 
12 |     The above copyright notice and this permission notice shall be included in all
13 |     copies or substantial portions of the Software.
14 | 
15 |     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 |     IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 |     FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 |     AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 |     LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 |     OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 |     SOFTWARE
22 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/MODEL_ZOO.md:
--------------------------------------------------------------------------------
  1 | ## Table of Contents
  2 | - <a href='#VQA'>VQA</a>
  3 | - <a href='#GQA'>GQA</a>
  4 | - <a href='#NLVR2'>NLVR2</a>
  5 | - <a href='#Image-Text-Retrieval'>Image/Text Retrieval</a>
  6 | - <a href='#Image-Captioning-on-COCO'>Image Captioning on COCO</a>
  7 | 
  8 | 
  9 | ## Performance
 10 | Task    | t2i | t2i | i2t | i2t | IC  | IC  |  IC  |  IC  | NoCaps | NoCaps |   VQA    |  NLVR2  |
 11 | --------|-----|-----|-----|-----|-----|-----|------|------|--------|--------|----------|---------|
 12 | Metric  | R@1 | R@5 | R@1 | R@5 | B@4 |  M  |  C   |   S  |    C   |    S   | test-std | test-P  |
 13 | SoTA_S  |39.2 | 68.0|56.6 | 84.5|38.9 |29.2 |129.8 | 22.4 |   61.5 |  9.2   |  70.90   | 53.50   |
 14 | SoTA_B  |48.4 | 76.7|63.3 | 87.0|39.5 |29.3 |129.3 | 23.2 |   73.1 | 11.2   |  72.54   | 78.87   |
 15 | SoTA_L  |51.7 | 78.4|66.6 | 89.4|  -  |  -  |   -  |   -  |     -  |   -    |  73.40   | 79.50   |
 16 | -----   |---  |---  |---  |---  |---  |---  |---   |---   |---     |---     |---       |---      |
 17 | Oscar_B |54.0 | 80.8|70.0 | 91.1|40.5 |29.7 |137.6 | 22.8 |   78.8 | 11.7   |  73.44   | 78.44   |
 18 | Oscar_L |57.5 | 82.8|73.5 | 92.2|41.7 |30.6 |140.0 | 24.5 |   80.9 | 11.3   |  73.82   | 80.37   |
 19 | gain    | 5.8 |  4.4| 6.9 |  2.8| 2.2 | 1.3 | 10.7 | 1.3  |    7.8 |  0.5   |   0.42   |  0.87   |
 20 | 
 21 | t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO. 
 22 | 
 23 | For reference, we also release the training logs and output.
 24 | 
 25 | 
 26 | ## VQA
 27 | Script to finetune for Oscar base model.
 28 | Base model is trained on train split and evaluated on the val split. Good for later comparison.
 29 | 
 30 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/base/base_9m_ep107_1192k_eu1/application_1575931286052_40649/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/base/base_9m_ep107_1192k_eu1/application_1575931286052_40649/results/stdout.txt).<br />
 31 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/base/results.txt).<br />
 32 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/base/vqa_base_best.zip).
 33 | ```bash
 34 | python oscar/run_vqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length
 35 |     50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir datasets/vqa/2k
 36 |     --model_type bert --model_name_or_path pretrained_models/base-vg-labels/ep_107_1192087
 37 |     --task_name vqa_text --do_train --do_lower_case --max_seq_length 128 --per_gpu_eval_batch_size
 38 |     256 --per_gpu_train_batch_size 32 --learning_rate 5e-05 --num_train_epochs 25
 39 |     --output_dir results --label_file datasets/vqa/cache/trainval_ans2label.pkl
 40 |     --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 --drop_out
 41 |     0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce --img_feat_format pt 
 42 |     --classifier linear --cls_hidden_scale 3 --txt_data_dir datasets/vqa/2k
 43 | ```
 44 | 
 45 | Script to finetune for Oscar large model.
 46 | Large model is trained on train+val split and evaluated on the val split, for reproduce the paper's best result.
 47 | 
 48 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/large/ab128_img_large_rr1_ep20_590k_tv_done_good/exp_ab128_img_large_rr1_ep20_590k_tv_0.00003_128_50_dp_0.3_wd_0.05_bce_3linear_s88_abcd/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/large/ab128_img_large_rr1_ep20_590k_tv_done_good/exp_ab128_img_large_rr1_ep20_590k_tv_0.00003_128_50_dp_0.3_wd_0.05_bce_3linear_s88_abcd/stdout.txt).<br />
 49 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/large/results.txt).<br />
 50 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/large/vqa_large_best.zip).
 51 | ```bash
 52 | python oscar/run_vqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length
 53 |     50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir datasets/vqa/2k
 54 |     --model_type bert --model_name_or_path pretrained_models/large-vg-labels/ep_20_590000
 55 |     --task_name vqa_text --do_train_val --do_lower_case --max_seq_length 128 --per_gpu_eval_batch_size
 56 |     256 --per_gpu_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 25
 57 |     --label_file datasets/vqa/cache/trainval_ans2label.pkl --save_epoch 30
 58 |     --seed 88 --evaluate_during_training --logging_steps 4000 --drop_out 0.3 --weight_decay
 59 |     0.05 --warmup_steps 0 --loss_type bce --save_after_epoch 15 --output_dir results --img_feat_format pt --classifier linear --cls_hidden_scale 3 --txt_data_dir datasets/vqa/2k
 60 | ```
 61 | 
 62 | 
 63 | ## GQA
 64 | Script to finetune for Oscar base model.
 65 | 
 66 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/gqa/base/ab175_base_ep107_1192k_0.4true_taeb_done_25eps_good/exp_ab175_base_ep107_1192k_0.4true_taeb_b_48_0.00005_165_45_dp_0.3_abce/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/gqa/base/ab175_base_ep107_1192k_0.4true_taeb_done_25eps_good/exp_ab175_base_ep107_1192k_0.4true_taeb_b_48_0.00005_165_45_dp_0.3_abce/stdout.txt).<br />
 67 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/gqa/base/ab165_img45_1568928610179_62515_test_done_good/results.txt).<br />
 68 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/gqa/base/gqa_base_best.zip).
 69 | ```bash
 70 | python oscar/run_gqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length
 71 |     45 --data_dir datasets/GQA/0.4true --model_type bert --model_name_or_path pretrained_models/base-vg-labels/ep_107_1192087
 72 |     --task_name gqa --do_lower_case --max_seq_length 165 --per_gpu_eval_batch_size
 73 |     256 --per_gpu_train_batch_size 48 --learning_rate 5e-05 --num_train_epochs 5 --output_dir
 74 |     results --label_file datasets/GQA/questions1.2/trainval_testdev_all_ans2label.pkl
 75 |     --img_feature_type faster_r-cnn --data_label_type all --train_data_type all --eval_data_type
 76 |     bal --label2ans_file datasets/GQA/questions1.2/trainval_testdev_all_label2ans.pkl
 77 |     --loss_type xe --save_epoch 2 --seed 88 --evaluate_during_training --logging_steps
 78 |     4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0
 79 | ```
 80 | 
 81 | ## NLVR2
 82 | Script to finetune for Oscar base model.
 83 | 
 84 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/base/exp_rvln_base_ep107_1192k_wm1w_b72_0.00003_55_40_dp0.3_3mlp_wm10000_abcf_best/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/base/exp_rvln_base_ep107_1192k_wm1w_b72_0.00003_55_40_dp0.3_3mlp_wm10000_abcf_best/stdout.txt).<br />
 85 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/base/exp_nlvr_base_11123_testall_b24_0.00003_55_43_dp_0.3_mlp_abcj_best/stdout.txt).
 86 | ```bash
 87 | python oscar/run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length
 88 |     40 --data_dir datasets/nlvr2/ft_corpus --model_type bert --model_name_or_path pretrained_models/base-vg-labels/ep_107_1192087
 89 |     --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size
 90 |     64 --per_gpu_train_batch_size 72 --learning_rate 3e-05 --num_train_epochs 20 --output_dir
 91 |     results --img_feature_type faster_r-cnn --data_label_type all --train_data_type
 92 |     all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training
 93 |     --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps
 94 |     10000 --classifier mlp --cls_hidden_scale 3 --num_choice 2 --use_pair
 95 | ```
 96 | 
 97 | Script to finetune for Oscar large model.
 98 | 
 99 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_rvln_large_ep55_1618k_b24_0.00002_seq55_img40_dp0.3_2mlp_wm5000_abcj/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_rvln_large_ep55_1618k_b24_0.00002_seq55_img40_dp0.3_2mlp_wm5000_abcj/stdout.txt).<br />
100 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_nlvr_large_1583307153868_14140_testall_b24_0.00003_55_43_dp_0.3_mlp_abck/stdout.txt).
101 | ```bash
102 | python oscar/run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length
103 |     40 --data_dir datasets/nlvr2/ft_corpus --model_type bert --model_name_or_path pretrained_models/large-vg-labels/ep_55_1617000
104 |     --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size
105 |     64 --per_gpu_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 20 --output_dir
106 |     results --img_feature_type faster_r-cnn --data_label_type all --train_data_type
107 |     all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training
108 |     --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps
109 |     5000 --classifier mlp --cls_hidden_scale 2 --num_choice 2 --use_pair
110 | ```
111 | 
112 | <!---
113 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_rvln_large_ep55_1618k_b24_0.00002_seq55_img40_dp0.3_2mlp_wm5000_abcj/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_rvln_large_ep55_1618k_b24_0.00002_seq55_img40_dp0.3_2mlp_wm5000_abcj/stdout.txt).<br />
114 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_nlvr_large_1583307153868_14140_testall_b24_0.00003_55_43_dp_0.3_mlp_abck/stdout.txt).
115 | ```bash
116 | python oscar/run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length
117 |     40 --data_dir datasets/nlvr2/ft_corpus --model_type bert --model_name_or_path pretrained_models/base-vg-labels/ep_55_1617000
118 |     --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size
119 |     64 --per_gpu_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 20 --output_dir
120 |     results --img_feature_type faster_r-cnn --data_label_type all --train_data_type
121 |     all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training
122 |     --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps
123 |     5000 --classifier mlp --cls_hidden_scale 2 --num_choice 2 --use_pair
124 | ```
125 | --->
126 | 
127 | ## Image Text Retrieval
128 | Script to finetune for Oscar base model (4 V100 with 16G mem):
129 | 
130 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/base/eval_logs.json), [log.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/base/log.txt).
131 | Model checkpoint: [checkpoint.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/base/checkpoint.zip).
132 | 
133 | ```bash
134 | python oscar/run_retrieval.py \
135 |     --model_name_or_path pretrained_models/base-vg-labels/ep_67_588997 \
136 |     --do_train \
137 |     --do_lower_case \
138 |     --evaluate_during_training \
139 |     --num_captions_per_img_val 20 \
140 |     --eval_caption_index_file minival_caption_indexs_top20.pt \
141 |     --per_gpu_train_batch_size 32 \
142 |     --learning_rate 0.00002 \
143 |     --num_train_epochs 30 \
144 |     --weight_decay 0.05 \
145 |     --save_steps 5000 \
146 |     --add_od_labels \
147 |     --od_label_type vg \
148 |     --max_seq_length 70 \
149 |     --output_dir output/
150 | ```
151 | 
152 | Script to finetune for Oscar large model (8 V100 with 32G mem):
153 | 
154 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/large/eval_logs.json), [log.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/large/log.txt).
155 | Model checkpoint: [checkpoint.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/large/checkpoint.zip).
156 | 
157 | ```bash
158 | python oscar/run_retrieval.py \
159 |     --model_name_or_path pretrained_models/large-vg-labels/ep_7_816000 \
160 |     --do_train \
161 |     --do_lower_case \
162 |     --evaluate_during_training \
163 |     --num_captions_per_img_val 20 \
164 |     --eval_caption_index_file minival_caption_indexs_top20.pt \
165 |     --per_gpu_train_batch_size 16 \
166 |     --learning_rate 0.00001 \
167 |     --num_train_epochs 30 \
168 |     --save_steps 5000 \
169 |     --add_od_labels \
170 |     --od_label_type vg \
171 |     --max_seq_length 70 \
172 |     --output_dir output/
173 | ```
174 | 
175 | Script to inference on COCO 1K test set:
176 | ```bash
177 | python oscar/run_retrieval.py \
178 |     --do_test \
179 |     --do_eval \
180 |     --test_split test \
181 |     --num_captions_per_img_val 5 \
182 |     --eval_img_keys_file test_img_keys_1k.tsv \
183 |     --cross_image_eval \
184 |     --per_gpu_eval_batch_size 64 \
185 |     --eval_model_dir your_model_for_evaluation # could be base/large models.
186 | ```
187 | 
188 | Script to inference on COCO 5K test set:
189 | ```bash
190 | python oscar/run_retrieval.py \
191 |     --do_test \
192 |     --do_eval \
193 |     --test_split test \
194 |     --num_captions_per_img_val 5 \
195 |     --eval_img_keys_file test_img_keys.tsv \
196 |     --cross_image_eval \
197 |     --per_gpu_eval_batch_size 64 \
198 |     --eval_model_dir your_model_for_evaluation # could be base/large models.
199 | ```
200 | 
201 | 
202 | ## Image Captioning on COCO
203 | Script to finetune for Oscar base model (4 V100 with 16G mem):
204 | 
205 | Training logs: [log.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/coco_caption/base/log.txt).
206 | Model checkpoint: [checkpoint.zip](https://biglmdiag.blob.core.windows.net/oscar/exp/coco_caption/base/checkpoint.zip).
207 | 
208 | 1) First train with cross-entropy loss:
209 | ```bash
210 | python oscar/run_captioning.py \
211 |     --model_name_or_path pretrained_models/base-vg-labels/ep_67_588997 \
212 |     --do_train \
213 |     --do_lower_case \
214 |     --evaluate_during_training \
215 |     --add_od_labels \
216 |     --learning_rate 0.00003 \
217 |     --per_gpu_train_batch_size 64 \
218 |     --num_train_epochs 30 \
219 |     --save_steps 5000 \
220 |     --output_dir output/
221 | ```
222 | 2) Finetune with CIDEr optimization:
223 | ```bash
224 | python oscar/run_captioning.py \
225 |     --model_name_or_path your_checkpoint_from_cross_entropy \
226 |     --do_train \
227 |     --do_lower_case \
228 |     --evaluate_during_training \
229 |     --add_od_labels \
230 |     --learning_rate 0.000005 \
231 |     --per_gpu_train_batch_size 16 \
232 |     --num_train_epochs 5 \
233 |     --scst \
234 |     --save_steps 2000 \
235 |     --output_dir output/
236 | ```
237 | 
238 | Script to finetune for Oscar large model (8 V100 with 32G mem):
239 | 1) First train with cross-entropy loss:
240 | ```bash
241 | python oscar/run_captioning.py \
242 |     --model_name_or_path pretrained_models/large-vg-labels/ep_7_816000 \
243 |     --do_train \
244 |     --do_lower_case \
245 |     --evaluate_during_training \
246 |     --add_od_labels \
247 |     --learning_rate 0.00001 \
248 |     --per_gpu_train_batch_size 32 \
249 |     --num_train_epochs 30 \
250 |     --save_steps 5000 \
251 |     --output_dir output/
252 | ```
253 | 2) Finetune with CIDEr optimization:
254 | ```bash
255 | python oscar/run_captioning.py \
256 |     --model_name_or_path your_checkpoint_from_cross_entropy \
257 |     --do_train \
258 |     --do_lower_case \
259 |     --evaluate_during_training \
260 |     --add_od_labels \
261 |     --learning_rate 0.000005 \
262 |     --per_gpu_train_batch_size 8 \
263 |     --num_train_epochs 5 \
264 |     --scst \
265 |     --save_steps 2000 \
266 |     --output_dir output/
267 | ```
268 | 
269 | Script to inference on COCO test set:
270 | ```bash
271 | python oscar/run_captioning.py \
272 |     --do_test \
273 |     --do_eval \
274 |     --test_yaml test.yaml \
275 |     --per_gpu_eval_batch_size 64 \
276 |     --num_beams 5 \
277 |     --max_gen_length 20 \
278 |     --eval_model_dir your_model_for_evaluation # could be bert base/large.
279 | ```
280 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/README.md:
--------------------------------------------------------------------------------
 1 | # Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks    <img src="docs/oscar_logo.png" width="200" align="right"> 
 2 | # VinVL: Revisiting Visual Representations in Vision-Language Models  
 3 | ## Updates
 4 | 05/28/2020: Released finetuned models on downstream tasks, please check [MODEL_ZOO.md](MODEL_ZOO.md). <br/>
 5 | 05/15/2020: Released pretrained models, datasets, and code for downstream tasks finetuning. <br/>
 6 | 01/13/2021: our new work [VinVL](https://arxiv.org/abs/2101.00529) proposed OSCAR+, an improved version of OSCAR, and provided a better object-attribute detection model to extract features for V+L tasks. The VinVL work achieved SOTA performance on all seven V+L tasks here. Please stay tuned for the model and code release. <br/>
 7 | 03/08/2021: Oscar+ pretraining code released, please check the last section in [VinVL_MODEL_ZOO.md](VinVL_MODEL_ZOO.md). All image features and model checkpoints in VinVL are also released. Please check [VinVL](https://github.com/pzzhang/VinVL) for details. <br/>
 8 | 04/13/2021: Our [Scene Graph Benchmark Repo](https://github.com/microsoft/scene_graph_benchmark) has been released. Welcome to use the code there to extract image features with VinVL pretrained models. <br/>
 9 | 
10 | 
11 | ## Introduction
12 | This repository contains source code necessary to reproduce the results presented in the paper [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks](https://arxiv.org/abs/2004.06165).
13 | We propose a new cross-modal pre-training method **Oscar** (Object-Semantics Aligned Pre-training). It leverages **object tags** detected in images as anchor points to significantly ease the learning of image-text alignments. We pre-train Oscar on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. For more on this project, see the [Microsoft Research Blog post](https://www.microsoft.com/en-us/research/blog/objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language/).
14 | 
15 | 
16 | <img src="docs/oscar.PNG" width="650"> 
17 | 
18 | ## Performance
19 | Task    | t2i | t2i | i2t | i2t | IC  | IC  |  IC  |  IC  | NoCaps | NoCaps |   VQA    |  NLVR2  |   GQA   |
20 | --------|-----|-----|-----|-----|-----|-----|------|------|--------|--------|----------|---------|---------|
21 | Metric	| R@1 | R@5 | R@1 | R@5 | B@4 |  M  |  C   |   S  |    C   |    S   | test-std | test-P  | test-std|
22 | SoTA_S  |39.2 | 68.0|56.6 | 84.5|38.9 |29.2 |129.8 | 22.4 |   61.5 |  9.2   |  70.92   | 58.80   | 63.17   |
23 | SoTA_B  |54.0 | 80.8|70.0 | 91.1|40.5 |29.7 |137.6 | 22.8 |   86.58| 12.38  |  73.67   | 79.30   |   -     |
24 | SoTA_L  |57.5 | 82.8|73.5 | 92.2|41.7 |30.6 |140.0 | 24.5 |     -  |   -    |  74.93   | 81.47   |   -     |
25 | -----   |---  |---  |---  |---  |---  |---  |---   |---   |---     |---     |---       |---      |---      |
26 | Oscar_B |54.0 | 80.8|70.0 | 91.1|40.5 |29.7 |137.6 | 22.8 |   78.8 | 11.7   |  73.44   | 78.36   | 61.62   |
27 | Oscar_L |57.5 | 82.8|73.5 | 92.2|41.7 |30.6 |140.0 | 24.5 |   80.9 | 11.3   |  73.82   | 80.05   |   -     |
28 | -----   |---  |---  |---  |---  |---  |---  |---   |---   |---     |---     |---       |---      |---      |
29 | VinVL_B |58.1 | 83.2|74.6 | 92.6|40.9 |30.9 |140.6 | 25.1 |   92.46| 13.07  |  76.12   | 83.08   | 64.65   |
30 | VinVL_L |58.8 | 83.5|75.4 | 92.9|41.0 |31.1 |140.9 | 25.2 |     -  |   -    |  76.62   | 83.98   |   -     |
31 | gain    | 1.3 |  0.7| 1.9 |  0.6| -0.7| 0.5 | 0.9  | 0.7  |    5.9 |  0.7   |   1.69   |  2.51   |  1.48   |
32 | 
33 | t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO. 
34 | 
35 | 
36 | ## Download
37 | We released pre-trained models, datasets, VinVL image features, and Oscar+ pretraining corpus for downstream tasks. 
38 | Please check [VinVL_DOWNLOAD.md](VinVL_DOWNLOAD.md) for details. 
39 | 
40 | To download checkpoints for the Vanilla OSCAR, please check [DOWNLOAD.md](DOWNLOAD.md) for details. 
41 | 
42 | ## Installation
43 | Check [INSTALL.md](INSTALL.md) for installation instructions.
44 | 
45 | ## Model Zoo
46 | Check [MODEL_ZOO.md](MODEL_ZOO.md) for scripts to run oscar downstream finetuning.
47 | 
48 | Check [VinVL_MODEL_ZOO.md](VinVL_MODEL_ZOO.md) for scripts to run oscar+ pretraining and downstream finetuning.
49 | 
50 | ## Citations
51 | Please consider citing this paper if you use the code:
52 | ```
53 | @article{li2020oscar,
54 |   title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
55 |   author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
56 |   journal={ECCV 2020},
57 |   year={2020}
58 | }
59 | 
60 | @article{zhang2021vinvl,
61 |   title={VinVL: Making Visual Representations Matter in Vision-Language Models},
62 |   author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
63 |   journal={CVPR 2021},
64 |   year={2021}
65 | }
66 | ```
67 | 
68 | ## License
69 | Oscar is released under the MIT license. See [LICENSE](LICENSE) for details. 
70 | 
71 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/SECURITY.md:
--------------------------------------------------------------------------------
 1 | <!-- BEGIN MICROSOFT SECURITY.MD V0.0.5 BLOCK -->
 2 | 
 3 | ## Security
 4 | 
 5 | Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
 6 | 
 7 | If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us as described below.
 8 | 
 9 | ## Reporting Security Issues
10 | 
11 | **Please do not report security vulnerabilities through public GitHub issues.**
12 | 
13 | Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
14 | 
15 | If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
16 | 
17 | You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 
18 | 
19 | Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
20 | 
21 |   * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
22 |   * Full paths of source file(s) related to the manifestation of the issue
23 |   * The location of the affected source code (tag/branch/commit or direct URL)
24 |   * Any special configuration required to reproduce the issue
25 |   * Step-by-step instructions to reproduce the issue
26 |   * Proof-of-concept or exploit code (if possible)
27 |   * Impact of the issue, including how an attacker might exploit the issue
28 | 
29 | This information will help us triage your report more quickly.
30 | 
31 | If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
32 | 
33 | ## Preferred Languages
34 | 
35 | We prefer all communications to be in English.
36 | 
37 | ## Policy
38 | 
39 | Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
40 | 
41 | <!-- END MICROSOFT SECURITY.MD BLOCK -->


--------------------------------------------------------------------------------
/VinVL/Oscar/VinVL_DOWNLOAD.md:
--------------------------------------------------------------------------------
 1 | # Download
 2 | 
 3 | ## Datasets
 4 | We provide the extracted image region features, object tags, and the original text annotations for each downstream tasks.
 5 | ```bash
 6 | path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/datasets/TASK_NAME' <target folder> --recursive
 7 | ```
 8 | `TASK_NAME` could be `coco_caption`, `nocaps`, `coco_ir`, `vqa`, `gqa`, `nlvr2`.
 9 | 
10 | ## Pre-trained Models
11 | We provide pre-trained *Oscar+* models of Bert-base and Bert-large structures, with the name starting with `base` and `large`, respectively.
12 | ```bash
13 | path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/TASK_NAME' <target folder> --recursive
14 | ```
15 | `TASK_NAME` could be `image_captioning` (including `nocaps`), `coco_ir`, `vqa`, `gqa`, `nlvr2`, `od_models`.
16 | 
17 | The models are trained with both image region features and object tags. The image region features are extracted by the Faster R-CNN with
18 | ResNet-101, using object and attribute annotations from [Visual Genome](http://visualgenome.org/).
19 | The object tags are from:
20 |     1) the same VisualGenome model, named as `-vg-labels`. Or,
21 |     2) the model trained on object annotations from [Open Images V5](https://storage.googleapis.com/openimages/web/index.html). named as `-oid-labels`. Or,
22 |     3) no object tags provied, serving as baseline, named as `-no-labels`.
23 | 
24 | ## Pre-exacted Image Features
25 | For ease-of-use, we make pretrained features available for all pretraining datasets and downstream tasks. 
26 | Features are stored in tsv (tab-separated-values) format that can be used in [pretraining](oscar/datasets/oscar_tsv.py) and dowstream tasks like [COCO Image-Text Retrieval](oscar/run_retrieval.py).
27 | 
28 | Notice that all the links below are links to a folder. We recommend using the following AzCopy command to download.
29 | ```
30 | path/to/azcopy copy <folder-link> <target-address> --recursive
31 | ```
32 | 
33 | [COCO 2014 Train/Val Image Features (~50G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/)
34 | 
35 | [COCO 2014 Test Image Features (~16G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/coco2014test/)
36 | 
37 | [COCO 2015 Test Image Features (~32G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/coco2015test/)
38 | 
39 | [GQA All Image Features (~62G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/gqa_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/)
40 | 
41 | [NVLR2 Train/Del/Test Image Features (~28G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/nlvr2_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/)
42 | 
43 | [Flickr30k All Image Features (~14G)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/flickr30k_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/)
44 | 
45 | [Google Conceptual Captions Image Features (Huge, ~960G, splitted into 12 chunks)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/googlecc_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/)
46 | 
47 | [SBU Image Features (Huge, ~280G, splitted into 4 chunks)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/sbu_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/)
48 | 
49 | [Open Images Detection Image Features (Huge, ~530G, splitted into 8 chunks)](https://biglmdiag.blob.core.windows.net/vinvl/image_features/oi_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/)
50 | 
51 | 
52 | ## Oscar+ pretraining corpus
53 | <img src="docs/pretrain_corpus.PNG" width="650"> 
54 | 
55 | [Small corpus](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_gqa.tsv)
56 | 
57 | [Medium corpus](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_gqa_oi.tsv)
58 | 
59 | [Large corpus](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_googlecc_gqa_sbu_oi.tsv)
60 | 
61 | We have tried our best to make sure that there is no data contamination between pretraining corpus and test sets for downstream tasks. 
62 | More specifically, we use two methods to achieve this. 
63 | (1) We use the COCO Image ID of Visual Genome and Flickr30k images.
64 | (2) For COCO, Visual Genome and Flickr30k, we calucate the pair-wise l2 norm between two images after resizing them into the same size.
65 | 
66 | 
67 | ### Note
68 | It is recommended to download large files with **AzCopy** for faster speed.
69 | AzCopy executable tools can be downloaded [here](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10#download-azcopy).
70 | Decompress the tar file and put the executable in any path. To download from
71 | any URL above, the command is:
72 | ```bash
73 | path/to/azcopy copy <URL> <local_path>
74 | 
75 | # for example, downloading coco_caption.zip
76 | path/to/azcopy copy https://biglmdiag.blob.core.windows.net/oscar/datasets/coco_caption.zip <local_path>
77 | ```
78 | 
79 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/VinVL_MODEL_ZOO.md:
--------------------------------------------------------------------------------
  1 | ## Table of Contents
  2 | - <a href='#VQA'>VQA</a>
  3 | - <a href='#GQA'>GQA</a>
  4 | - <a href='#NLVR2'>NLVR2</a>
  5 | - <a href='#Image-Text-Retrieval'>Image/Text Retrieval</a>
  6 | - <a href='#Image-Captioning-on-COCO'>Image Captioning on COCO</a>
  7 | - <a href='#Oscarplus-pretraining'>Oscarplus pretraining</a>
  8 | 
  9 | 
 10 | ## Performance
 11 | Task    | t2i | t2i | i2t | i2t | IC  | IC  |  IC  |  IC  | NoCaps | NoCaps |   VQA    |  NLVR2  |   GQA   |
 12 | --------|-----|-----|-----|-----|-----|-----|------|------|--------|--------|----------|---------|---------|
 13 | Metric	| R@1 | R@5 | R@1 | R@5 | B@4 |  M  |  C   |   S  |    C   |    S   | test-std | test-P  | test-std|
 14 | SoTA_S  |39.2 | 68.0|56.6 | 84.5|38.9 |29.2 |129.8 | 22.4 |   61.5 |  9.2   |  70.92   | 58.80   | 63.17   |
 15 | SoTA_B  |54.0 | 80.8|70.0 | 91.1|40.5 |29.7 |137.6 | 22.8 |   86.58| 12.38  |  73.67   | 79.30   | 61.62   |
 16 | SoTA_L  |57.5 | 82.8|73.5 | 92.2|41.7 |30.6 |140.0 | 24.5 |     -  |   -    |  74.93   | 81.47   |   -     |
 17 | -----   |---  |---  |---  |---  |---  |---  |---   |---   |---     |---     |---       |---      |---      |
 18 | VinVL_B |58.1 | 83.2|74.6 | 92.6|40.9 |30.9 |140.4 | 25.1 |   92.46 (with [VIVO](https://arxiv.org/abs/2009.13682))| 13.07 (with [VIVO](https://arxiv.org/abs/2009.13682))  |  76.12   | 83.08   | 64.65   |
 19 | VinVL_L |58.8 | 83.5|75.4 | 92.9|41.0 |31.1 |140.9 | 25.2 |     -  |   -    |  76.62   | 83.98   |   -     |
 20 | gain    | 1.3 |  0.7| 1.9 |  0.6| -0.7| 0.5 | 0.9  | 0.7  |    5.9 |  0.7   |   1.69   |  2.51   |  1.48   |
 21 | 
 22 | t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO. 
 23 | 
 24 | For reference, we also release the training logs and output.
 25 | 
 26 | 
 27 | ## VQA
 28 | Script to finetune for Oscar base model.
 29 | Base model is trained on train split and evaluated on the val split. Good for later comparison.
 30 | <!---
 31 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/base/base_9m_ep107_1192k_eu1/application_1575931286052_40649/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/base/base_9m_ep107_1192k_eu1/application_1575931286052_40649/results/stdout.txt).<br />
 32 | --->
 33 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/vqa/base/test/results.txt).<br />
 34 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/vqa/base/best.zip).
 35 | ```bash
 36 | python oscar/run_vqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length
 37 |     50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir vinvl/datasets/vqa
 38 |     --model_type bert --model_name_or_path vinvl/model_ckpts/vqa/base/checkpoint-2000000
 39 |     --task_name vqa_text --do_train --do_lower_case --max_seq_length 128 --per_gpu_eval_batch_size
 40 |     256 --per_gpu_train_batch_size 32 --learning_rate 5e-05 --num_train_epochs 25
 41 |     --output_dir results --label_file datasets/vqa/cache/trainval_ans2label.pkl
 42 |     --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 --drop_out
 43 |     0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce --img_feat_format pt 
 44 |     --classifier linear --cls_hidden_scale 3 --txt_data_dir vinvl/datasets/vqa
 45 | ```
 46 | 
 47 | Script to finetune for Oscar large model.
 48 | Large model is trained on train+val split and evaluated on the val split, for reproduce the paper's best result.
 49 | 
 50 | <!---
 51 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/large/ab128_img_large_rr1_ep20_590k_tv_done_good/exp_ab128_img_large_rr1_ep20_590k_tv_0.00003_128_50_dp_0.3_wd_0.05_bce_3linear_s88_abcd/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/vqa/large/ab128_img_large_rr1_ep20_590k_tv_done_good/exp_ab128_img_large_rr1_ep20_590k_tv_0.00003_128_50_dp_0.3_wd_0.05_bce_3linear_s88_abcd/stdout.txt).<br />
 52 | --->
 53 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/vqa/large/test/results.txt).<br />
 54 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/vqa/large/best.zip).
 55 | ```bash
 56 | python oscar/run_vqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length
 57 |     50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir vinvl/datasets/vqa
 58 |     --model_type bert --model_name_or_path vinvl/model_ckpts/vqa/large/checkpoint-2000000
 59 |     --task_name vqa_text --do_train_val --do_lower_case --max_seq_length 128 --per_gpu_eval_batch_size
 60 |     256 --per_gpu_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 25
 61 |     --label_file datasets/vqa/cache/trainval_ans2label.pkl --save_epoch 30
 62 |     --seed 88 --evaluate_during_training --logging_steps 4000 --drop_out 0.3 --weight_decay
 63 |     0.05 --warmup_steps 0 --loss_type bce --save_after_epoch 15 --output_dir results --img_feat_format pt --classifier linear --cls_hidden_scale 3 --txt_data_dir vinvl/datasets/vqa
 64 | ```
 65 | 
 66 | 
 67 | ## GQA
 68 | Script to finetune for Oscar base model.
 69 | 
 70 | <!---
 71 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/gqa/base/ab175_base_ep107_1192k_0.4true_taeb_done_25eps_good/exp_ab175_base_ep107_1192k_0.4true_taeb_b_48_0.00005_165_45_dp_0.3_abce/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/gqa/base/ab175_base_ep107_1192k_0.4true_taeb_done_25eps_good/exp_ab175_base_ep107_1192k_0.4true_taeb_b_48_0.00005_165_45_dp_0.3_abce/stdout.txt).<br />
 72 | --->
 73 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/gqa/base/results.txt).<br />
 74 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/gqa/base/best.zip).
 75 | ```bash
 76 | python oscar/run_gqa.py -j 4 --img_feature_dim 2054 --max_img_seq_length
 77 |     45 --data_dir vinvl/datasets/gqa --model_type bert --model_name_or_path vinvl/model_ckpts/vqa/base/checkpoint-2000000
 78 |     --task_name gqa --do_lower_case --max_seq_length 165 --per_gpu_eval_batch_size
 79 |     256 --per_gpu_train_batch_size 48 --learning_rate 5e-05 --num_train_epochs 5 --output_dir
 80 |     results --label_file vinvl/datasets/gqa/trainval_testdev_all_ans2label.pkl
 81 |     --img_feature_type faster_r-cnn --data_label_type all --train_data_type all --eval_data_type
 82 |     bal --label2ans_file vinvl/datasets/gqa/trainval_testdev_all_label2ans.pkl
 83 |     --loss_type xe --save_epoch 2 --seed 88 --evaluate_during_training --logging_steps
 84 |     4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0
 85 | ```
 86 | 
 87 | ## NLVR2
 88 | Script to finetune for Oscar base model.
 89 | 
 90 | <!---
 91 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/base/exp_rvln_base_ep107_1192k_wm1w_b72_0.00003_55_40_dp0.3_3mlp_wm10000_abcf_best/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/base/exp_rvln_base_ep107_1192k_wm1w_b72_0.00003_55_40_dp0.3_3mlp_wm10000_abcf_best/stdout.txt).<br />
 92 | --->
 93 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/nlvr2/base/rvln_base_oscar_v2_71.5_86236_test_done_best/exp_rvln_base_oscar_v2_71.5_86236_test_b24_0.00003_55_41_dp_0.3_mlp_abch/stdout.txt).<br />
 94 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/nlvr2/base/best.zip).
 95 | ```bash
 96 | python oscar/run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length
 97 |     40 --data_dir vinvl/datasets/nlvr2 --model_type bert --model_name_or_path vinvl/model_ckpts/vqa/base/checkpoint-2000000
 98 |     --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size
 99 |     64 --per_gpu_train_batch_size 72 --learning_rate 3e-05 --num_train_epochs 20 --output_dir
100 |     results --img_feature_type faster_r-cnn --data_label_type all --train_data_type
101 |     all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training
102 |     --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps
103 |     10000 --classifier mlp --cls_hidden_scale 3 --num_choice 2 --use_pair
104 | ```
105 | 
106 | Script to finetune for Oscar large model.
107 | 
108 | <!---
109 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_rvln_large_ep55_1618k_b24_0.00002_seq55_img40_dp0.3_2mlp_wm5000_abcj/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_rvln_large_ep55_1618k_b24_0.00002_seq55_img40_dp0.3_2mlp_wm5000_abcj/stdout.txt).<br />
110 | --->
111 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/nlvr2/large/rvln_oscar_v2_large_99617_test_done_best/exp_rvln_oscar_v2_large_99617_test_b24_0.00003_55_50_dp_0.3_mlp_abce/stdout.txt).<br />
112 | Model checkpoint: [.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/nlvr2/large/best.zip).
113 | ```bash
114 | python oscar/run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length
115 |     40 --data_dir vinvl/datasets/nlvr2 --model_type bert --model_name_or_path vinvl/model_ckpts/vqa/large/checkpoint-2000000
116 |     --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size
117 |     64 --per_gpu_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 20 --output_dir
118 |     results --img_feature_type faster_r-cnn --data_label_type all --train_data_type
119 |     all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training
120 |     --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps
121 |     5000 --classifier mlp --cls_hidden_scale 2 --num_choice 2 --use_pair
122 | ```
123 | 
124 | <!---
125 | Training logs: [eval_logs.json](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_rvln_large_ep55_1618k_b24_0.00002_seq55_img40_dp0.3_2mlp_wm5000_abcj/results/eval_logs.json), [output.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_rvln_large_ep55_1618k_b24_0.00002_seq55_img40_dp0.3_2mlp_wm5000_abcj/stdout.txt).<br />
126 | Final server results: [results.txt](https://biglmdiag.blob.core.windows.net/oscar/exp/nlvr2/large/large_1583307153868_14140/exp_nlvr_large_1583307153868_14140_testall_b24_0.00003_55_43_dp_0.3_mlp_abck/stdout.txt).
127 | ```bash
128 | python oscar/run_nlvr.py -j 4 --img_feature_dim 2054 --max_img_seq_length
129 |     40 --data_dir datasets/nlvr2/ft_corpus --model_type bert --model_name_or_path pretrained_models/base-vg-labels/ep_55_1617000
130 |     --task_name nlvr --do_lower_case --max_seq_length 55 --per_gpu_eval_batch_size
131 |     64 --per_gpu_train_batch_size 24 --learning_rate 3e-05 --num_train_epochs 20 --output_dir
132 |     results --img_feature_type faster_r-cnn --data_label_type all --train_data_type
133 |     all --eval_data_type all --loss_type xe --save_epoch -1 --seed 88 --evaluate_during_training
134 |     --logging_steps -1 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps
135 |     5000 --classifier mlp --cls_hidden_scale 2 --num_choice 2 --use_pair
136 | ```
137 | --->
138 | 
139 | ## Image Text Retrieval
140 | Script to finetune for Oscarplus base model (8 V100 with 16G mem):
141 | 
142 | Training logs: [train_logs](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/train_logs/), 
143 | 
144 | Training logs: [test_logs](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/test_logs/), 
145 | 
146 | Command [command](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/philly.yaml).
147 | 
148 | Model checkpoint: [ckeckpoint](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/checkpoint-0132780/).
149 | 
150 | ```bash
151 | python oscar/run_retrieval.py \
152 |     --model_name_or_path vinvl/coco_ir/base/checkpoint-1340000 \
153 |     --do_train \
154 |     --do_lower_case \
155 |     --evaluate_during_training \
156 |     --num_captions_per_img_val 20 \
157 |     --eval_caption_index_file minival_caption_indexs_top20.pt \
158 |     --per_gpu_train_batch_size 16 \
159 |     --learning_rate 0.00002 \
160 |     --num_train_epochs 30 \
161 |     --weight_decay 0.05 \
162 |     --save_steps 5000 \
163 |     --add_od_labels \
164 |     --od_label_type vg \
165 |     --max_seq_length 70 \
166 |     --max_img_seq_length 70 \
167 |     --output_dir output/
168 | ```
169 | 
170 | Script to finetune for Oscarplus large model (8 V100 with 32G mem):
171 | 
172 | Training logs: [train_logs](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/large/train_logs/), 
173 | 
174 | Training logs: [test_logs](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/large/test_logs/), 
175 | 
176 | Command [command](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/large/philly.yaml).
177 | 
178 | Model checkpoint: [ckeckpoint](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/large/checkpoint-0132780/).
179 | 
180 | ```bash
181 | python oscar/run_retrieval.py \
182 |     --model_name_or_path vinvl/coco_ir/base/checkpoint-0660000 \
183 |     --do_train \
184 |     --do_lower_case \
185 |     --evaluate_during_training \
186 |     --num_captions_per_img_val 20 \
187 |     --eval_caption_index_file minival_caption_indexs_top20.pt \
188 |     --per_gpu_train_batch_size 16 \
189 |     --learning_rate 7.5e-06 \
190 |     --num_train_epochs 30 \
191 |     --save_steps 5000 \
192 |     --add_od_labels \
193 |     --od_label_type vg \
194 |     --max_seq_length 70 \
195 |     --max_img_seq_length 70 \
196 |     --output_dir output \
197 |     --img_feat_file vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/features.tsv 
198 | ```
199 | 
200 | Script to inference on COCO 1K test set:
201 | ```bash
202 | python oscar/run_retrieval.py \
203 |     --do_test \
204 |     --do_eval \
205 |     --test_split test \
206 |     --num_captions_per_img_val 5 \
207 |     --eval_img_keys_file test_img_keys_1k.tsv \
208 |     --cross_image_eval \
209 |     --per_gpu_eval_batch_size 64 \
210 |     --img_feat_file vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/features.tsv \
211 |     --eval_model_dir your_model_for_evaluation # could be base/large models.
212 | ```
213 | 
214 | Script to inference on COCO 5K test set:
215 | ```bash
216 | python oscar/run_retrieval.py \
217 |     --do_test \
218 |     --do_eval \
219 |     --test_split test \
220 |     --num_captions_per_img_val 5 \
221 |     --eval_img_keys_file test_img_keys.tsv \
222 |     --cross_image_eval \
223 |     --per_gpu_eval_batch_size 64 \
224 |     --img_feat_file vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/features.tsv \
225 |     --eval_model_dir your_model_for_evaluation # could be base/large models.
226 | ```
227 | 
228 | 
229 | ## Image Captioning on COCO
230 | Script to finetune for base model:
231 | 
232 | Pretrained model checkpoint: [pretrained_base.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/pretrained_base.zip).
233 | Finetuned model checkpoint (w/ cross entropy): [coco_captioning_base_xe.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/coco_captioning_base_xe.zip).
234 | Finetuned model checkpoint (w/ CIDEr optimization): [coco_captioning_base_scst.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/coco_captioning_base_scst.zip).
235 | 
236 | 1) First train with cross-entropy loss (8 V100 with 16G mem):
237 | ```bash
238 | python oscar/run_captioning.py \
239 |     --model_name_or_path pretrained_models/image_captioning/pretrained_base \
240 |     --do_train \
241 |     --do_lower_case \
242 |     --add_od_labels \
243 |     --learning_rate 3e-5 \
244 |     --per_gpu_train_batch_size 64 \
245 |     --num_train_epochs 60 \
246 |     --tie_weights \
247 |     --freeze_embedding \
248 |     --label_smoothing 0.1 \
249 |     --drop_worst_ratio 0.2 \
250 |     --drop_worst_after 20000 \
251 |     --output_dir output/
252 | ```
253 | 2) Finetune with CIDEr optimization (8 V100 with 32G mem):
254 | ```bash
255 | python oscar/run_captioning.py \
256 |     --model_name_or_path your_checkpoint_from_cross_entropy \
257 |     --do_train \
258 |     --do_lower_case \
259 |     --add_od_labels \
260 |     --learning_rate 3e-6 \
261 |     --per_gpu_train_batch_size 16 \
262 |     --num_train_epochs 75 \
263 |     --tie_weights \
264 |     --freeze_embedding \
265 |     --scst \
266 |     --output_dir output/
267 | ```
268 | 
269 | Script to finetune for large model:
270 | 
271 | Pretrained model checkpoint: [pretrained_large.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/pretrained_large.zip).
272 | Finetuned model checkpoint (w/ cross entropy): [coco_captioning_large_xe.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/coco_captioning_large_xe.zip).
273 | Finetuned model checkpoint (w/ CIDEr optimization): [coco_captioning_large_scst.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/coco_captioning_large_scst.zip).
274 | 
275 | 1) First train with cross-entropy loss (8 V100 with 32G mem):
276 | ```bash
277 | python oscar/run_captioning.py \
278 |     --model_name_or_path pretrained_models/image_captioning/pretrained_large \
279 |     --do_train \
280 |     --do_lower_case \
281 |     --add_od_labels \
282 |     --learning_rate 1e-5 \
283 |     --per_gpu_train_batch_size 64 \
284 |     --num_train_epochs 60 \
285 |     --tie_weights \
286 |     --freeze_embedding \
287 |     --label_smoothing 0.1 \
288 |     --drop_worst_ratio 0.2 \
289 |     --drop_worst_after 20000 \
290 |     --output_dir output/
291 | ```
292 | 2) Finetune with CIDEr optimization (8 V100 with 32G mem):
293 | ```bash
294 | python oscar/run_captioning.py \
295 |     --model_name_or_path your_checkpoint_from_cross_entropy \
296 |     --do_train \
297 |     --do_lower_case \
298 |     --add_od_labels \
299 |     --learning_rate 8e-7 \
300 |     --per_gpu_train_batch_size 6 \
301 |     --num_train_epochs 25 \
302 |     --tie_weights \
303 |     --freeze_embedding \
304 |     --scst \
305 |     --output_dir output/
306 | ```
307 | 
308 | Script to inference on COCO test set:
309 | ```bash
310 | python oscar/run_captioning.py \
311 |     --do_test \
312 |     --do_eval \
313 |     --test_yaml test.yaml \
314 |     --per_gpu_eval_batch_size 64 \
315 |     --num_beams 5 \ 
316 |     --max_gen_length 20 \
317 |     --eval_model_dir your_model_for_evaluation # could be base or large models
318 | ```
319 | 
320 | ## Image Captioning on NoCaps
321 | Note that [NoCaps] (https://nocaps.org/) does not allow to use extra
322 | image-caption pairs for training except COCO. So the model is directly initialized
323 | from bert-base, and trained on COCO data.
324 | 
325 | Script to train base model:
326 | 
327 | Finetuned model checkpoint (w/ cross entropy): [nocaps_base_xe.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/nocaps_base_xe.zip).
328 | Finetuned model checkpoint (w/ CIDEr optimization): [nocaps_base_scst.zip](https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/image_captioning/nocaps_base_scst.zip).
329 | 
330 | 1) First train with cross-entropy loss (4 V100 with 16G mem):
331 | ```bash
332 | python oscar/run_captioning.py \
333 |     --model_name_or_path bert-base-uncased \
334 |     --do_train \
335 |     --do_lower_case \
336 |     --add_od_labels \
337 |     --learning_rate 0.0001 \
338 |     --per_gpu_train_batch_size 64 \
339 |     --num_train_epochs 30 \
340 |     --tie_weights \
341 |     --freeze_embedding \
342 |     --output_dir output/
343 | ```
344 | 2) Train with CIDEr optimization (8 V100 with 32G mem):
345 | ```bash
346 | python oscar/run_captioning.py \
347 |     --model_name_or_path your_checkpoint_from_cross_entropy \
348 |     --do_train \
349 |     --do_lower_case \
350 |     --add_od_labels \
351 |     --scheduler constant \
352 |     --learning_rate 5e-6 \
353 |     --per_gpu_train_batch_size 14 \
354 |     --num_train_epochs 50 \
355 |     --tie_weights \
356 |     --freeze_embedding \
357 |     --scst \
358 |     --output_dir output/
359 | ```
360 | 
361 | Script to inference on NoCaps val set with Constrained Beam Search:
362 | ```bash
363 | python oscar/run_captioning.py \
364 |     --do_test \
365 |     --do_eval \
366 |     --data_dir datasets/nocaps \
367 |     --test_yaml val.yaml \
368 |     --per_gpu_eval_batch_size 2 \
369 |     --num_beams 5 \
370 |     --use_cbs \
371 |     --max_gen_length 20 \
372 |     --eval_model_dir your_model_for_evaluation
373 | ```
374 | 
375 | <!---
376 | ## VCR
377 | Script to finetune for Oscar base model.
378 | 
379 | ```bash
380 | python oscar/run_vcr.py -j 4 --img_feature_dim 2054 --max_img_seq_length 50 --data_label_type mask --img_feature_type faster_r-cnn --data_dir datasets/vcr/X152C4_71.5/gt_box_feats --txt_data_dir datasets/vcr/new_corpus_v3 --model_type bert --model_name_or_path pretrained_models/base-vg-labels/ep_107_1192087 --task_name vcr_qar --do_lower_case --max_seq_length 202 --per_gpu_eval_batch_size 64 --per_gpu_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --output_dir results/vcr --save_epoch 8 --seed 88 --evaluate_during_training --logging_steps 10 --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce --save_steps 2000 --img_feat_dir datasets/vcr/X152C4_71.5/gt_box_feats --img_feat_format pt --num_choice 2 --classifier linear --cls_hidden_scale 2 --do_train --save_task_best
381 | ```
382 | --->
383 | 
384 | ## Oscarplus pretraining
385 | Table 16 below shows the statistics of image and text of the pre-training corpora. 
386 | In our ablation study, we have corpora of three different sizes: [Small](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_gqa_x152c4big2exp168.yaml), [Medium](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_gqa_oi_x152c4big2exp168.yaml), [Large](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml). 
387 | Notice that we make use of image tagging datasets OpenImages, by generating captions using OSCAR's image captioning model to form triplets of ``(generated caption, image tags, image features)'' for the OSCAR+ pre-training. 
388 | <img src="docs/pretrain_corpus.PNG" width="650"> 
389 | 
390 | Script to perform oscar+ pretraining with the [large corpus](https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml). 
391 | ```bash
392 | python -m torch.distributed.launch --nproc_per_node=8 oscar/run_oscarplus_pretrain.py \
393 |     --use_b 1 \
394 |     --max_grad_norm 10.0 --gradient_accumulation_steps 1 \
395 |     --use_img_layernorm 1 \
396 |     --output_dir <your output folder> \
397 |     --bert_model bert --model_name_or_path bert-base-uncased \
398 |     --do_lower_case --learning_rate 5e-05 
399 |     --warmup_steps 0 --do_train --max_seq_length 35 --on_memory \
400 |     --max_img_seq_length 50 --img_feature_dim 2054 \
401 |     --drop_out 0.1 --train_batch_size 8 \
402 |     --ckpt_period 10000 --max_iters 2000000 --log_period 100 \
403 |     --data_dir <The input data dir that contain the .yaml files> --dataset_file coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml
404 |     --textb_sample_mode 1 --texta_false_prob 0.25 
405 | ```
406 | 
407 | 
408 | One can perform the vanilla OSCAR pretraining by setting 
409 | ```bash
410 |     --textb_sample_mode 0 --texta_false_prob 0.0
411 | ```
412 | 
413 | One can also split the large pretraining corpus into two parts, i.e., coco_flickr30k_gqa + googlecc_sbu_oi, and use different textb_sample_modes for them. 
414 | To set textb_sample_mode=2 for coco_flickr30k_gqa has the potential to emphasize the QA-pairs in the small corpus. 
415 | ```bash
416 |     --data_dir <The input data dir that contain the .yaml files> --dataset_file coco_flickr30k_gqa_x152c4big2exp168.yaml
417 |     --textb_sample_mode 2 --texta_false_prob 0.25 \
418 |     --extra_dataset_file googlecc_sbu_oi_x152c4big2exp168.yaml \
419 |     --extra_textb_sample_mode 1 --extra_loss_weight 0.5
420 | ```


--------------------------------------------------------------------------------
/VinVL/Oscar/docs/oscar.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CCIIPLab/DPT/b6b4835ce84cff0d594854d4a7d3c2768f87cd9e/VinVL/Oscar/docs/oscar.PNG


--------------------------------------------------------------------------------
/VinVL/Oscar/docs/oscar_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CCIIPLab/DPT/b6b4835ce84cff0d594854d4a7d3c2768f87cd9e/VinVL/Oscar/docs/oscar_logo.png


--------------------------------------------------------------------------------
/VinVL/Oscar/docs/pretrain_corpus.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CCIIPLab/DPT/b6b4835ce84cff0d594854d4a7d3c2768f87cd9e/VinVL/Oscar/docs/pretrain_corpus.PNG


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = "0.1.0"
2 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/datasets/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = "0.1.0"
2 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/datasets/build.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import logging
  3 | import torch
  4 | from oscar.utils.misc import get_world_size
  5 | from .oscar_tsv import OscarTSVDataset
  6 | from transformers.pytorch_transformers import BertTokenizer
  7 | 
  8 | 
  9 | class BatchCollator(object):
 10 |     """
 11 |     From a list of samples from the dataset,
 12 |     returns the images and targets.
 13 |     """
 14 |     def __call__(self, batch):
 15 |         return list(zip(*batch))
 16 | 
 17 | 
 18 | def build_dataset(args):
 19 |     """
 20 |     Arguments:
 21 |         args: configuration.
 22 |     """
 23 |     full_yaml_file = os.path.join(args.data_dir, args.dataset_file)
 24 |     assert os.path.isfile(full_yaml_file)
 25 | 
 26 |     tokenizer = BertTokenizer.from_pretrained(
 27 |         args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
 28 |         do_lower_case=args.do_lower_case)
 29 | 
 30 |     cfg = dict(
 31 |         yaml_file=full_yaml_file,
 32 |         args=args,
 33 |         seq_len=args.max_seq_length,
 34 |         on_memory=args.on_memory,
 35 |         tokenizer=tokenizer,
 36 |     )
 37 |     # make dataset from factory
 38 |     datasets = [OscarTSVDataset(**cfg)]
 39 |     if args.extra_dataset_file:
 40 |         full_yaml_file = os.path.join(args.data_dir, args.extra_dataset_file)
 41 |         assert os.path.isfile(full_yaml_file)
 42 |         cfg['yaml_file'] = full_yaml_file
 43 |         cfg['textb_sample_mode'] = args.extra_textb_sample_mode
 44 |         datasets.append(OscarTSVDataset(**cfg))
 45 | 
 46 |     return datasets
 47 | 
 48 | 
 49 | def make_data_sampler(dataset, shuffle, distributed):
 50 |     if distributed:
 51 |         return torch.utils.data.distributed.DistributedSampler(
 52 |             dataset, shuffle=shuffle
 53 |         )
 54 |     if shuffle:
 55 |         sampler = torch.utils.data.sampler.RandomSampler(dataset)
 56 |     else:
 57 |         sampler = torch.utils.data.sampler.SequentialSampler(dataset)
 58 |     return sampler
 59 | 
 60 | 
 61 | class IterationBasedBatchSampler(torch.utils.data.sampler.BatchSampler):
 62 |     """
 63 |     Wraps a BatchSampler, resampling from it until
 64 |     a specified number of iterations have been sampled
 65 |     """
 66 | 
 67 |     def __init__(self, batch_sampler, num_iterations, start_iter=0):
 68 |         self.batch_sampler = batch_sampler
 69 |         self.num_iterations = num_iterations
 70 |         self.start_iter = start_iter
 71 | 
 72 |     def __iter__(self):
 73 |         iteration = self.start_iter
 74 |         while iteration <= self.num_iterations:
 75 |             # if the underlying sampler has a set_epoch method, like
 76 |             # DistributedSampler, used for making each process see
 77 |             # a different split of the dataset, then set it
 78 |             if hasattr(self.batch_sampler.sampler, "set_epoch"):
 79 |                 self.batch_sampler.sampler.set_epoch(iteration)
 80 |             for batch in self.batch_sampler:
 81 |                 iteration += 1
 82 |                 if iteration > self.num_iterations:
 83 |                     break
 84 |                 yield batch
 85 | 
 86 |     def __len__(self):
 87 |         return self.num_iterations
 88 | 
 89 | 
 90 | def make_batch_data_sampler(
 91 |         sampler, images_per_batch, num_iters=None,
 92 |         start_iter=0
 93 | ):
 94 |     batch_sampler = torch.utils.data.sampler.BatchSampler(
 95 |         sampler, images_per_batch, drop_last=False
 96 |     )
 97 |     if num_iters is not None and num_iters >= 0:
 98 |         batch_sampler = IterationBasedBatchSampler(
 99 |             batch_sampler, num_iters, start_iter
100 |         )
101 |     return batch_sampler
102 | 
103 | 
104 | def make_data_loader(args, is_distributed=False, arguments=None):
105 |     num_gpus = get_world_size()
106 |     # figure out start iteration
107 |     if arguments is None:
108 |         start_iter = 0
109 |     else:
110 |         start_iter = arguments['iteration']
111 |     # figure out the batchsize
112 |     grad_accumulate_steps = 1
113 |     if hasattr(args, 'gradient_accumulation_steps'):
114 |         grad_accumulate_steps = args.gradient_accumulation_steps
115 |     assert (
116 |             args.train_batch_size % grad_accumulate_steps == 0
117 |     ), "train_batch_size ({}) must be divisible by the number "
118 |     "of Gradient accumulation ({}) used."\
119 |         .format(args.train_batch_size, grad_accumulate_steps)
120 |     images_per_batch = args.train_batch_size//grad_accumulate_steps
121 |     assert (
122 |         images_per_batch % num_gpus == 0
123 |     ), "SOLVER.IMS_PER_BATCH ({}) must be divisible by the number "
124 |     "of GPUs ({}) used.".format(images_per_batch, num_gpus)
125 |     images_per_gpu = images_per_batch // num_gpus
126 |     logger = logging.getLogger(__name__)
127 |     logger.info("Train with {} images per GPU".format(images_per_gpu))
128 |     shuffle = True
129 |     num_iters = args.max_iters * grad_accumulate_steps
130 | 
131 |     # build dataset
132 |     datasets = build_dataset(args)
133 | 
134 |     data_loaders = []
135 |     for i, dataset in enumerate(datasets):
136 |         sampler = make_data_sampler(dataset, shuffle, is_distributed)
137 | 
138 |         batch_sampler = make_batch_data_sampler(
139 |            sampler, images_per_gpu, num_iters, start_iter
140 |         )
141 |         num_workers = args.num_workers
142 |         data_loader = torch.utils.data.DataLoader(
143 |             dataset,
144 |             num_workers=num_workers,
145 |             batch_sampler=batch_sampler,
146 |             collate_fn=BatchCollator(),
147 |             pin_memory=True,
148 |         )
149 |         data_loaders.append(data_loader)
150 |     return data_loaders
151 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/modeling/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = "0.1.0"
2 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/run_oscarplus_pretrain.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import, division, print_function
  2 | 
  3 | import argparse
  4 | import datetime
  5 | import json
  6 | import logging
  7 | import os
  8 | import random
  9 | import sys
 10 | import time
 11 | import math
 12 | import shutil
 13 | 
 14 | sys.path.insert(0, '.')
 15 | 
 16 | import numpy as np
 17 | import torch
 18 | 
 19 | from oscar.modeling.modeling_bert import BertImgForPreTraining
 20 | from transformers.pytorch_transformers import (WEIGHTS_NAME, BertConfig,
 21 |                                   BertTokenizer)
 22 | 
 23 | from oscar.datasets.build import make_data_loader
 24 | 
 25 | from transformers.pytorch_transformers import AdamW, WarmupLinearSchedule
 26 | from oscar.utils.misc import mkdir, get_rank
 27 | from oscar.utils.metric_logger import TensorboardLogger
 28 | 
 29 | logger = logging.getLogger(__name__)
 30 | 
 31 | ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig,)), ())
 32 | 
 33 | MODEL_CLASSES = {
 34 |     'bert': (BertConfig, BertImgForPreTraining, BertTokenizer),
 35 | }
 36 | 
 37 | 
 38 | """ ****** Pretraining ****** """
 39 | 
 40 | 
 41 | def main():
 42 |     parser = argparse.ArgumentParser()
 43 | 
 44 |     ## Required parameters
 45 |     parser.add_argument("--data_dir", default=None, type=str, required=False,
 46 |                         help="The input data dir. "
 47 |                              "Should contain the .yaml files for the task.")
 48 |     parser.add_argument("--dataset_file", default=None, type=str, required=True,
 49 |                         help="The training dataset yaml file.")
 50 |     parser.add_argument("--extra_dataset_file", default=None, type=str, required=False,
 51 |                         help="The extra training dataset yaml file.")
 52 |     parser.add_argument("--bert_model", default=None, type=str, required=True,
 53 |                         help="Bert pre-trained model selected in the list: bert-base-uncased, "
 54 |                              "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.")
 55 |     parser.add_argument("--output_dir", default=None, type=str, required=True,
 56 |                         help="The output directory where the model checkpoints will be written.")
 57 | 
 58 |     # image chunks
 59 |     parser.add_argument("--chunk_start_id", default=-1, type=int,
 60 |                         help="Image Chunk Start ID")
 61 |     parser.add_argument("--chunk_end_id", default=-1, type=int,
 62 |                         help="Image Chunk End ID")
 63 | 
 64 |     ## Image parameters
 65 |     parser.add_argument("--max_img_seq_length", default=50, type=int,
 66 |                         help="The maximum total input image sequence length.")
 67 |     parser.add_argument("--img_feature_dim", default=2054, type=int,
 68 |                         help="The Image Feature Dimension.")
 69 |     parser.add_argument("--img_feature_type", default='faster_r-cnn', type=str,
 70 |                         help="faster_r-cnn or mask_r-cnn")
 71 |     parser.add_argument("--use_layernorm", action='store_true',
 72 |                         help="use_layernorm")
 73 | 
 74 |     parser.add_argument("--drop_out", default=0.1, type=float,
 75 |                         help="Drop out for BERT.")
 76 | 
 77 |     parser.add_argument("--use_b", type=int, default=1, help="use_b")
 78 |     parser.add_argument("--textb_sample_mode", type=int, default=0,
 79 |                         help="0: sample from both texta&textb, "
 80 |                              "1: sample from textb, "
 81 |                              "2: sample from QA answers")
 82 |     parser.add_argument("--extra_textb_sample_mode", type=int, default=1)
 83 |     parser.add_argument("--texta_false_prob", type=float, default=0.0,
 84 |                         help="the probality that we sample wrong texta, should in [0.0, 0.5]")
 85 | 
 86 |     parser.add_argument("--model_name_or_path", default=None, type=str,
 87 |                         required=True,
 88 |                         help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(
 89 |                             ALL_MODELS))
 90 |     parser.add_argument("--config_name", default="", type=str,
 91 |                         help="Pretrained config name or path if not the same as model_name")
 92 |     parser.add_argument("--tokenizer_name", default="", type=str,
 93 |                         help="Pretrained tokenizer name or path if not the same as model_name")
 94 |     parser.add_argument("--cache_dir", default="", type=str,
 95 |                         help="Where do you want to store the pre-trained models downloaded from s3")
 96 | 
 97 |     parser.add_argument("--max_seq_length", default=35, type=int,
 98 |                         help="The maximum total input sequence length after WordPiece tokenization. \n"
 99 |                              "Sequences longer than this will be truncated, and sequences shorter than this will be padded.")
100 |     parser.add_argument("--do_train", action='store_true',
101 |                         help="Whether to run training.")
102 |     parser.add_argument("--learning_rate", default=5e-5, type=float,
103 |                         help="The initial learning rate for Adam.")
104 |     parser.add_argument("--max_iters", default=2000000, type=int,
105 |                         help="Maximal number of training iterations.")
106 |     parser.add_argument("--train_batch_size", default=1024, type=int,
107 |                         help="Batch size for training.")
108 |     parser.add_argument("--num_workers", default=6, type=int,
109 |                         help="Number of workers for dataset.")
110 |     parser.add_argument("--adam_epsilon", default=1e-8, type=float,
111 |                         help="Epsilon for Adam optimizer.")
112 |     parser.add_argument("--optim", default='adamw', type=str,
113 |                         help="The optimizer used for Bert, [adamw, lamb], default: adamw")
114 |     parser.add_argument("--max_grad_norm", default=-1.0, type=float, help="Max gradient norm.")
115 |     parser.add_argument("--warmup_steps", default=0, type=int,
116 |                         help="Linear warmup over warmup_steps.")
117 |     parser.add_argument("--no_cuda", action='store_true',
118 |                         help="Whether not to use CUDA when available")
119 |     parser.add_argument("--on_memory", action='store_true',
120 |                         help="Whether to load train samples into memory or use disk")
121 |     parser.add_argument("--do_lower_case", action='store_true',
122 |                         help="Whether to lower case the input text. True for uncased models, False for cased models.")
123 |     parser.add_argument("--local_rank", type=int, default=-1,
124 |                         help="local_rank for distributed training on gpus")
125 |     parser.add_argument('--seed', type=int, default=42,
126 |                         help="random seed for initialization")
127 |     parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
128 |                         help="Number of updates steps to accumualte before performing a backward/update pass.")
129 | 
130 |     parser.add_argument("--from_scratch", action='store_true',
131 |                         help="train from scratch")
132 |     parser.add_argument("--use_img_layernorm", type=int, default=0,
133 |                         help="Normalize image features with bertlayernorm")
134 |     parser.add_argument("--img_layer_norm_eps", default=1e-12, type=float,
135 |                         help="The eps in image feature laynorm layer")
136 |     # distributed
137 |     parser.add_argument('--gpu_ids', type=str, default='-1')
138 |     parser.add_argument("--mask_loss_for_unmatched", type=int, default=1,
139 |                         help="masked language model loss for unmatched triplets")
140 |     parser.add_argument("--extra_loss_weight", type=float, default=0.0,
141 |                         help="the loss weight for the extra train data batch (should be in [0,1])")
142 |     parser.add_argument(
143 |         "--use_gtlabels",
144 |         type=int, default=1,
145 |         help="use groundtruth labels for text b or not"
146 |     )
147 |     # logging
148 |     parser.add_argument('--ckpt_period', type=int, default=10000,
149 |                         help="Period for saving checkpoint")
150 |     parser.add_argument('--log_period', type=int, default=100,
151 |                         help="Period for saving logging info")
152 |     args = parser.parse_args()
153 | 
154 |     if args.gpu_ids != '-1':
155 |         os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu_ids
156 | 
157 |     args.num_gpus = int(
158 |         os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
159 |     args.distributed = args.num_gpus > 1
160 | 
161 |     if args.gpu_ids != '-1':
162 |         os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu_ids
163 | 
164 |     if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train:
165 |         logger.info("Output Directory Exists.")
166 | 
167 |     # Setup CUDA, GPU & distributed training
168 |     if args.local_rank == -1 or args.no_cuda:
169 |         device = torch.device(
170 |             "cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
171 |         args.n_gpu = torch.cuda.device_count()
172 |     else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
173 |         torch.cuda.set_device(args.local_rank)
174 |         device = torch.device("cuda", args.local_rank)
175 |         torch.distributed.init_process_group(
176 |             backend='nccl', init_method="env://"
177 |         )
178 |         args.n_gpu = 1
179 |     args.device = device
180 | 
181 |     # Setup logging
182 |     logging.basicConfig(
183 |         format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
184 |         datefmt='%m/%d/%Y %H:%M:%S',
185 |         level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
186 |     logger.warning(
187 |         "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s",
188 |         args.local_rank, device, args.n_gpu, bool(args.local_rank != -1)
189 |     )
190 | 
191 |     if args.gradient_accumulation_steps < 1:
192 |         raise ValueError(
193 |             "Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
194 |                 args.gradient_accumulation_steps))
195 | 
196 |     random.seed(args.seed)
197 |     np.random.seed(args.seed)
198 |     torch.manual_seed(args.seed)
199 |     if args.n_gpu > 0:
200 |         torch.cuda.manual_seed_all(args.seed)
201 | 
202 |     if not args.do_train:
203 |         raise ValueError(
204 |             "Training is currently the only implemented execution option. Please set `do_train`.")
205 | 
206 |     if not os.path.exists(args.output_dir):
207 |         mkdir(args.output_dir)
208 | 
209 |     last_checkpoint_dir = None
210 |     arguments = {"iteration": 0}
211 |     if os.path.exists(args.output_dir):
212 |         save_file = os.path.join(args.output_dir, "last_checkpoint")
213 |         try:
214 |             with open(save_file, "r") as f:
215 |                 last_saved = f.read()
216 |                 last_saved = last_saved.strip()
217 |         except IOError:
218 |             # if file doesn't exist, maybe because it has just been
219 |             # deleted by a separate process
220 |             last_saved = ""
221 |         if last_saved:
222 |             folder_name = os.path.splitext(last_saved.split('/')[0])[0] # in the form of checkpoint-00001 or checkpoint-00001/pytorch_model.bin
223 |             last_checkpoint_dir = os.path.join(args.output_dir, folder_name)
224 |             arguments["iteration"] = int(folder_name.split('-')[-1])
225 |             assert os.path.isfile(os.path.join(last_checkpoint_dir, WEIGHTS_NAME)), "Last_checkpoint detected, but file not found!"
226 | 
227 |     # model first
228 |     if get_rank() != 0:
229 |         torch.distributed.barrier()
230 |     config_class, model_class, tokenizer_class = MODEL_CLASSES[args.bert_model]
231 |     if last_checkpoint_dir is not None:  # recovery
232 |         args.model_name_or_path = last_checkpoint_dir
233 |         logger.info(" -> Recovering model from {}".format(last_checkpoint_dir))
234 | 
235 |     config = config_class.from_pretrained(
236 |         args.config_name if args.config_name else args.model_name_or_path,
237 |     )
238 |     config.img_layer_norm_eps = args.img_layer_norm_eps
239 |     config.use_img_layernorm = args.use_img_layernorm
240 | 
241 |     # discrete code
242 |     config.img_feature_dim = args.img_feature_dim
243 |     config.img_feature_type = args.img_feature_type
244 |     config.hidden_dropout_prob = args.drop_out
245 |     if args.texta_false_prob < 0.5 and (args.texta_false_prob > 0 or not args.use_b):
246 |         args.num_contrast_classes = 3
247 |     else:
248 |         args.num_contrast_classes = 2
249 |     config.num_contrast_classes = args.num_contrast_classes
250 | 
251 |     # Prepare model
252 |     # model = BertForPreTraining.from_pretrained(args.bert_model)
253 |     load_num = 0
254 |     while load_num < 10:
255 |         try:
256 |             model = BertImgForPreTraining.from_pretrained(
257 |                 args.model_name_or_path,
258 |                 from_tf=bool('.ckpt' in args.model_name_or_path),
259 |                 config=config)
260 |             break
261 |         except:
262 |             load_num += 1
263 | 
264 |     # train from scratch
265 |     if args.from_scratch:
266 |         if last_checkpoint_dir is None:
267 |             logger.info("Training from scratch ... ")
268 |             model.apply(model.init_weights)
269 |     total_params = sum(p.numel() for p in model.parameters())
270 |     logger.info(
271 |         'Total Parameters: {}'.format(total_params))
272 | 
273 |     for key, val in vars(config).items():
274 |         setattr(args, key, val)
275 | 
276 |     if get_rank() == 0 and args.local_rank != -1:
277 |         torch.distributed.barrier()
278 | 
279 |     model.to(args.device)
280 | 
281 |     logger.info("Training/evaluation parameters %s", args)
282 | 
283 |     tb_log_dir = os.path.join(args.output_dir, 'train_logs')
284 |     meters = TensorboardLogger(
285 |         log_dir=tb_log_dir,
286 |         delimiter="  ",
287 |     )
288 | 
289 |     # Prepare optimizer
290 |     param_optimizer = list(model.named_parameters())
291 |     no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
292 |     optimizer_grouped_parameters = [
293 |         {'params': [p for n, p in param_optimizer if
294 |                     not any(nd in n for nd in no_decay)],
295 |          'weight_decay': 0.01},
296 |         {'params': [p for n, p in param_optimizer if
297 |                     any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
298 |     ]
299 | 
300 |     optimizer = AdamW(optimizer_grouped_parameters,
301 |                               lr=args.learning_rate, eps=args.adam_epsilon)
302 |     scheduler = WarmupLinearSchedule(optimizer,
303 |                                      warmup_steps=args.warmup_steps,
304 |                                      t_total=args.max_iters)
305 | 
306 |     if arguments['iteration'] > 0 and os.path.isfile(os.path.join(last_checkpoint_dir, 'optimizer.pth')):  # recovery
307 |         logger.info(
308 |             "Load BERT optimizer from {}".format(last_checkpoint_dir))
309 |         optimizer_to_load = torch.load(
310 |             os.path.join(last_checkpoint_dir, 'optimizer.pth'),
311 |             map_location=torch.device("cpu"))
312 |         optimizer.load_state_dict(optimizer_to_load.pop("optimizer"))
313 |         scheduler.load_state_dict(optimizer_to_load.pop("scheduler"))
314 | 
315 |     if args.distributed:
316 |         model = torch.nn.parallel.DistributedDataParallel(
317 |             model, device_ids=[args.local_rank], output_device=args.local_rank,
318 |             find_unused_parameters=True)
319 |     elif args.n_gpu > 1:
320 |         model = torch.nn.DataParallel(model)
321 | 
322 |     # train_examples = None
323 |     train_dataloaders = make_data_loader(
324 |         args, is_distributed=args.distributed, arguments=arguments
325 |     )
326 | 
327 |     if isinstance(train_dataloaders, list):
328 |         train_dataloader = train_dataloaders[0]
329 |     else:
330 |         train_dataloader = train_dataloaders
331 |     train_dataloader_extra = [None] * len(train_dataloader)
332 |     if isinstance(train_dataloaders, list) and len(train_dataloaders) > 1:
333 |         logger.info("Having two train dataloaders!")
334 |         train_dataloader_extra = train_dataloaders[1]
335 |     tokenizer = train_dataloader.dataset.tokenizer
336 | 
337 |     # torch.backends.cudnn.benchmark = True
338 | 
339 |     max_iter = len(train_dataloader)
340 |     start_iter = arguments["iteration"]
341 |     logger.info("***** Running training *****")
342 |     logger.info(" Num examples = {}".format(len(train_dataloader.dataset)))
343 |     logger.info("  Instantaneous batch size = %d",
344 |                 args.train_batch_size // args.gradient_accumulation_steps)
345 |     logger.info(
346 |         "  Total train batch size (w. parallel, distributed & accumulation) = %d",
347 |         args.train_batch_size)
348 |     logger.info("  Gradient Accumulation steps = %d",
349 |                 args.gradient_accumulation_steps)
350 |     logger.info("  Total optimization steps = %d",
351 |                 max_iter // args.gradient_accumulation_steps)
352 | 
353 |     log_json = {}
354 | 
355 |     model.train()
356 |     model.zero_grad()
357 | 
358 |     clock_started = False
359 |     # Every args.ckpt_period, report train_score and save model
360 |     tr_loss = 0
361 |     nb_tr_examples, nb_tr_steps = 0, 0
362 |     for step, (batch, batch_extra) in enumerate(zip(train_dataloader, train_dataloader_extra), start_iter):
363 |         if not clock_started:
364 |             start_training_time = time.time()
365 |             end = time.time()
366 |             clock_started = True
367 | 
368 |         def data_process(mini_batch):
369 |             images, targets, qa_inds = \
370 |                 mini_batch[0], mini_batch[1], mini_batch[2]
371 |             targets_transposed = list(zip(*targets))
372 |             input_ids = torch.stack(targets_transposed[0]).to(args.device, non_blocking=True)
373 |             input_mask = torch.stack(targets_transposed[1]).to(args.device, non_blocking=True)
374 |             segment_ids = torch.stack(targets_transposed[2]).to(args.device, non_blocking=True)
375 |             lm_label_ids = torch.stack(targets_transposed[3]).to(args.device, non_blocking=True)
376 |             is_next = torch.stack(targets_transposed[4]).to(args.device, non_blocking=True)
377 |             is_img_match = torch.stack(targets_transposed[5]).to(args.device, non_blocking=True)
378 | 
379 |             return images, input_ids, input_mask, segment_ids, lm_label_ids, is_next
380 | 
381 |         images1, input_ids1, input_mask1, segment_ids1, lm_label_ids1, is_next1 \
382 |             = data_process(batch)
383 |         if batch_extra is not None:
384 |             images2, input_ids2, input_mask2, segment_ids2, lm_label_ids2, is_next2 \
385 |                 = data_process(batch_extra)
386 | 
387 |         data_time = time.time() - end
388 | 
389 |         def forward_backward(images, input_ids, input_mask, segment_ids,
390 |                              lm_label_ids, is_next, loss_weight=1.0):
391 |             # feature as input
392 |             image_features = torch.stack(images).to(args.device, non_blocking=True)
393 | 
394 |             outputs = model(input_ids, segment_ids, input_mask,
395 |                             lm_label_ids, is_next, img_feats=image_features)
396 | 
397 |             loss = loss_weight * outputs[0]
398 | 
399 |             if args.n_gpu > 1:
400 |                 loss = loss.mean()  # mean() to average on multi-gpu.
401 | 
402 |             if args.gradient_accumulation_steps > 1:
403 |                 loss = loss / args.gradient_accumulation_steps
404 |             loss.backward()
405 | 
406 |             return loss.item(), input_ids.size(0)
407 | 
408 |         start1 = time.time()
409 |         loss1, nb_tr_example1 = forward_backward(
410 |             images1, input_ids1, input_mask1,
411 |             segment_ids1, lm_label_ids1, is_next1,
412 |             loss_weight=1.0-args.extra_loss_weight
413 |         )
414 |         tr_loss += loss1
415 |         nb_tr_examples += nb_tr_example1
416 |         compute_time1 = time.time() - start1
417 | 
418 |         loss2, nb_tr_example2 = 0.0, 0
419 |         compute_time2 = 0.0
420 |         if batch_extra is not None:
421 |             start2 = time.time()
422 |             loss2, nb_tr_example2 = forward_backward(
423 |                 images2, input_ids2, input_mask2,
424 |                 segment_ids2, lm_label_ids2, is_next2,
425 |                 loss_weight=args.extra_loss_weight
426 |             )
427 |             tr_loss += loss2
428 |             nb_tr_examples += nb_tr_example2
429 |             compute_time2 = time.time() - start2
430 | 
431 |         nb_tr_steps += 1
432 |         arguments["iteration"] = step + 1
433 | 
434 |         if (step + 1) % args.gradient_accumulation_steps == 0:
435 |             # do gradient clipping
436 |             if args.max_grad_norm > 0:
437 |                 torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
438 |             # do the optimization steps
439 |             optimizer.step()
440 |             scheduler.step()  # Update learning rate schedule
441 |             optimizer.zero_grad()
442 | 
443 |             # measure elapsed time
444 |             batch_time = time.time() - end
445 |             end = time.time()
446 |             metrics_to_log = {
447 |                 'time_info': {'compute': batch_time, 'data': data_time,
448 |                               'compute1': compute_time1,
449 |                               'compute2': compute_time2},
450 |                 'batch_metrics': {'loss': loss1+loss2}
451 |             }
452 |             params_to_log = {'params': {'bert_lr': optimizer.param_groups[0]["lr"]}}
453 |             meters.update_metrics(metrics_to_log)
454 |             meters.update_params(params_to_log)
455 | 
456 |             if args.log_period > 0 and (step + 1) % args.log_period == 0:
457 |                 avg_time = meters.meters['time_info']['compute'].global_avg
458 |                 eta_seconds = avg_time * (max_iter - step - 1)
459 |                 eta_string = str(
460 |                     datetime.timedelta(seconds=int(eta_seconds)))
461 |                 logger.info(
462 |                     meters.delimiter.join(
463 |                         [
464 |                             "eta: {eta}",
465 |                             "iter: {iter}",
466 |                             "max mem: {memory:.0f}",
467 |                         ]
468 |                     ).format(
469 |                         eta=eta_string,
470 |                         iter=step + 1,
471 |                         memory=torch.cuda.max_memory_allocated() / 1024.0 / 1024.0,
472 |                     ) + "\n    " + meters.get_logs(step + 1)
473 |                 )
474 | 
475 |         if (step + 1) == max_iter or (step + 1) % args.ckpt_period == 0:  # Save a trained model
476 |             log_json[step+1] = tr_loss
477 |             train_metrics_total = torch.Tensor([tr_loss, nb_tr_examples, nb_tr_steps]).to(args.device)
478 |             torch.distributed.all_reduce(train_metrics_total)
479 |             # reset metrics
480 |             tr_loss = 0
481 |             nb_tr_examples, nb_tr_steps = 0, 0
482 | 
483 |             if get_rank() == 0:
484 |                 # report metrics
485 |                 train_score_gathered = train_metrics_total[0] / \
486 |                                        train_metrics_total[2]
487 |                 logger.info("PROGRESS: {}%".format(
488 |                     round(100 * (step + 1) / max_iter, 4)))
489 |                 logger.info(
490 |                     "EVALERR: {}%".format(train_score_gathered))
491 |                 meters.update_metrics(
492 |                     {
493 |                         'epoch_metrics': {'ex_cnt': train_metrics_total[1],
494 |                                           'loss': train_score_gathered}
495 |                     }
496 |                 )
497 |                 with open(os.path.join(args.output_dir, 'loss_logs.json'),
498 |                           'w') as fp:
499 |                     json.dump(log_json, fp)
500 | 
501 |                 # save checkpoint
502 |                 output_dir = os.path.join(args.output_dir,
503 |                                           'checkpoint-{:07d}'.format(
504 |                                               step + 1))
505 |                 if not os.path.exists(output_dir):
506 |                     os.makedirs(output_dir)
507 |                 model_to_save = model.module if hasattr(
508 |                     model,
509 |                     'module') else model  # Take care of distributed/parallel training
510 |                 optimizer_to_save = {
511 |                     "optimizer": optimizer.state_dict(),
512 |                     "scheduler": scheduler.state_dict()}
513 | 
514 |                 save_num = 0
515 |                 while save_num < 10:
516 |                     try:
517 |                         model_to_save.save_pretrained(output_dir)
518 |                         torch.save(args, os.path.join(output_dir,
519 |                                                       'training_args.bin'))
520 |                         tokenizer.save_pretrained(output_dir)
521 |                         torch.save(optimizer_to_save,
522 |                                    os.path.join(output_dir,
523 |                                                 'optimizer.pth'))
524 |                         save_file = os.path.join(args.output_dir, "last_checkpoint")
525 |                         with open(save_file, "w") as f:
526 |                             f.write('checkpoint-{:07d}/pytorch_model.bin'.format(step + 1))
527 |                         break
528 |                     except:
529 |                         save_num += 1
530 |                 logger.info(
531 |                     "Saving model checkpoint {0} to {1}".format(
532 |                         step + 1, output_dir))
533 | 
534 |     if clock_started:
535 |         total_training_time = time.time() - start_training_time
536 |     else:
537 |         total_training_time = 0.0
538 |     total_time_str = str(datetime.timedelta(seconds=total_training_time))
539 |     logger.info(
540 |         "Total training time: {} ({:.4f} s / it)".format(
541 |             total_time_str, total_training_time / max_iter
542 |         )
543 |     )
544 |     # close the tb logger
545 |     meters.close()
546 | 
547 | 
548 | if __name__ == "__main__":
549 |     main()
550 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = "0.1.0"
2 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/caption_evaluate.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 
  2 | 
  3 | from collections import OrderedDict, defaultdict
  4 | import json
  5 | import numpy as np
  6 | import os.path as op
  7 | from pprint import pprint
  8 | import torch
  9 | import re
 10 | import subprocess
 11 | import tempfile
 12 | import time
 13 | from typing import Dict, Optional
 14 | 
 15 | from coco_caption.pycocotools.coco import COCO
 16 | from coco_caption.pycocoevalcap.eval import COCOEvalCap
 17 | from .cider.pyciderevalcap.ciderD.ciderD import CiderD
 18 | 
 19 | 
 20 | def evaluate_on_nocaps(split, predict_file, data_dir='data/nocaps/', evaluate_file=None):
 21 |     '''
 22 |     NOTE: Put the auth file in folder ~/.evalai/
 23 |     '''
 24 |     if not evaluate_file:
 25 |         evaluate_file = op.splitext(predict_file)[0] + '.eval.json'
 26 |     if op.isfile(evaluate_file):
 27 |         print('{} already exists'.format(evaluate_file))
 28 |         with open(evaluate_file, 'r') as fp:
 29 |             metrics = json.load(fp)
 30 |         return metrics
 31 | 
 32 |     image_info_file = op.join(data_dir,
 33 |             'nocaps_{}_image_info.json'.format(split))
 34 |     image_info = json.load(open(image_info_file))
 35 |     open_image_id2id = {}
 36 |     for it in image_info['images']:
 37 |         open_image_id2id[it['open_images_id']] = it['id']
 38 |     predictions = []
 39 |     cap_id = 0
 40 |     with open(predict_file, 'r') as fp:
 41 |         for line in fp:
 42 |             p = line.strip().split('\t')
 43 |             predictions.append(
 44 |                     {'image_id': open_image_id2id[p[0]],
 45 |                     'caption': json.loads(p[1])[0]['caption'],
 46 |                     'id': cap_id})
 47 |             cap_id += 1
 48 |     if split == 'test':
 49 |         print('Are you sure to submit test split result at: {}'.format(predict_file))
 50 |         import ipdb;ipdb.set_trace()
 51 |     nocapseval = NocapsEvaluator(phase=split)
 52 |     metrics = nocapseval.evaluate(predictions)
 53 |     pprint(metrics)
 54 |     with open(evaluate_file, 'w') as fp:
 55 |         json.dump(metrics, fp)
 56 |     return metrics
 57 | 
 58 | 
 59 | def evaluate_on_coco_caption(res_file, label_file, outfile=None):
 60 |     """
 61 |     res_tsv: TSV file, each row is [image_key, json format list of captions].
 62 |              Each caption is a dict, with fields "caption", "conf".
 63 |     label_file: JSON file of ground truth captions in COCO format.
 64 |     """
 65 |     assert label_file.endswith('.json')
 66 |     if res_file.endswith('.tsv'):
 67 |         res_file_coco = op.splitext(res_file)[0] + '_coco_format.json'
 68 |         convert_tsv_to_coco_format(res_file, res_file_coco)
 69 |     else:
 70 |         raise ValueError('unknown prediction result file format: {}'.format(res_file))
 71 | 
 72 |     coco = COCO(label_file)
 73 |     cocoRes = coco.loadRes(res_file_coco)
 74 |     cocoEval = COCOEvalCap(coco, cocoRes, 'corpus')
 75 | 
 76 |     # evaluate on a subset of images by setting
 77 |     # cocoEval.params['image_id'] = cocoRes.getImgIds()
 78 |     # please remove this line when evaluating the full validation set
 79 |     cocoEval.params['image_id'] = cocoRes.getImgIds()
 80 | 
 81 |     # evaluate results
 82 |     # SPICE will take a few minutes the first time, but speeds up due to caching
 83 |     cocoEval.evaluate()
 84 |     result = cocoEval.eval
 85 |     if not outfile:
 86 |         print(result)
 87 |     else:
 88 |         with open(outfile, 'w') as fp:
 89 |             json.dump(result, fp, indent=4)
 90 |     return result
 91 | 
 92 | 
 93 | def convert_tsv_to_coco_format(res_tsv, outfile,
 94 |         sep='\t', key_col=0, cap_col=1):
 95 |     results = []
 96 |     with open(res_tsv) as fp:
 97 |         for line in fp:
 98 |             parts = line.strip().split(sep)
 99 |             key = parts[key_col]
100 |             if cap_col < len(parts):
101 |                 caps = json.loads(parts[cap_col])
102 |                 assert len(caps) == 1, 'cannot evaluate multiple captions per image'
103 |                 cap = caps[0].get('caption', '')
104 |             else:
105 |                 # empty caption generated
106 |                 cap = ""
107 |             results.append(
108 |                     {'image_id': key,
109 |                     'caption': cap}
110 |                     )
111 |     with open(outfile, 'w') as fp:
112 |         json.dump(results, fp)
113 | 
114 | 
115 | class ScstRewardCriterion(torch.nn.Module):
116 |     CIDER_REWARD_WEIGHT = 1
117 | 
118 |     def __init__(self, cider_cached_tokens='corpus', baseline_type='greedy'):
119 |         self.CiderD_scorer = CiderD(df=cider_cached_tokens)
120 |         assert baseline_type in ['greedy', 'sample']
121 |         self.baseline_type = baseline_type
122 |         self._cur_score = None
123 |         super().__init__()
124 | 
125 |     def forward(self, gt_res, greedy_res, sample_res, sample_logprobs):
126 |         batch_size = len(gt_res)
127 |         sample_res_size = len(sample_res)
128 |         seq_per_img = sample_res_size // batch_size
129 | 
130 |         gen_res = []
131 |         gen_res.extend(sample_res)
132 |         gt_idx = [i // seq_per_img for i in range(sample_res_size)]
133 |         if self.baseline_type == 'greedy':
134 |             assert len(greedy_res) == batch_size
135 |             gen_res.extend(greedy_res)
136 |             gt_idx.extend([i for i in range(batch_size)])
137 | 
138 |         scores = self._calculate_eval_scores(gen_res, gt_idx, gt_res)
139 | 
140 |         if self.baseline_type == 'greedy':
141 |             baseline = scores[-batch_size:][:, np.newaxis]
142 |         else:
143 |             sc_ = scores.reshape(batch_size, seq_per_img)
144 |             baseline = (sc_.sum(1, keepdims=True) - sc_) / (sc_.shape[1] - 1)
145 | 
146 |         # sample - baseline
147 |         reward = scores[:sample_res_size].reshape(batch_size, seq_per_img)
148 |         self._cur_score = reward.mean()
149 |         reward = reward - baseline
150 |         reward = reward.reshape(sample_res_size)
151 | 
152 |         reward = torch.as_tensor(reward, device=sample_logprobs.device, dtype=torch.float)
153 |         loss = - sample_logprobs * reward
154 |         loss = loss.mean()
155 |         return loss
156 | 
157 |     def get_score(self):
158 |         return self._cur_score
159 | 
160 |     def _calculate_eval_scores(self, gen_res, gt_idx, gt_res):
161 |         '''
162 |         gen_res: generated captions, list of str
163 |         gt_idx: list of int, of the same length as gen_res
164 |         gt_res: ground truth captions, list of list of str.
165 |             gen_res[i] corresponds to gt_res[gt_idx[i]]
166 |             Each image can have multiple ground truth captions
167 |         '''
168 |         gen_res_size = len(gen_res)
169 | 
170 |         res = OrderedDict()
171 |         for i in range(gen_res_size):
172 |             res[i] = [self._wrap_sentence(gen_res[i])]
173 | 
174 |         gts = OrderedDict()
175 |         gt_res_ = [
176 |             [self._wrap_sentence(gt_res[i][j]) for j in range(len(gt_res[i]))]
177 |                 for i in range(len(gt_res))
178 |         ]
179 |         for i in range(gen_res_size):
180 |             gts[i] = gt_res_[gt_idx[i]]
181 | 
182 |         res_ = [{'image_id':i, 'caption': res[i]} for i in range(len(res))]
183 |         _, batch_cider_scores = self.CiderD_scorer.compute_score(gts, res_)
184 |         scores = self.CIDER_REWARD_WEIGHT * batch_cider_scores
185 |         return scores
186 | 
187 |     @classmethod
188 |     def _wrap_sentence(self, s):
189 |         # ensure the sentence ends with <eos> token
190 |         # in order to keep consisitent with cider_cached_tokens
191 |         r = s.strip()
192 |         if r.endswith('.'):
193 |             r = r[:-1]
194 |         r += ' <eos>'
195 |         return r
196 | 
197 | 
198 | class NocapsEvaluator(object):
199 |     r"""
200 |     Code from https://github.com/nocaps-org/updown-baseline/blob/master/updown/utils/evalai.py
201 | 
202 |     A utility class to submit model predictions on nocaps splits to EvalAI, and retrieve model
203 |     performance based on captioning metrics (such as CIDEr, SPICE).
204 | 
205 |     Extended Summary
206 |     ----------------
207 |     This class and the training script together serve as a working example for "EvalAI in the
208 |     loop", showing how evaluation can be done remotely on privately held splits. Annotations
209 |     (captions) and evaluation-specific tools (e.g. `coco-caption <https://www.github.com/tylin/coco-caption>`_)
210 |     are not required locally. This enables users to select best checkpoint, perform early
211 |     stopping, learning rate scheduling based on a metric, etc. without actually doing evaluation.
212 | 
213 |     Parameters
214 |     ----------
215 |     phase: str, optional (default = "val")
216 |         Which phase to evaluate on. One of "val" or "test".
217 | 
218 |     Notes
219 |     -----
220 |     This class can be used for retrieving metrics on both, val and test splits. However, we
221 |     recommend to avoid using it for test split (at least during training). Number of allowed
222 |     submissions to test split on EvalAI are very less, and can exhaust in a few iterations! However,
223 |     the number of submissions to val split are practically infinite.
224 |     """
225 | 
226 |     def __init__(self, phase: str = "val"):
227 | 
228 |         # Constants specific to EvalAI.
229 |         self._challenge_id = 355
230 |         self._phase_id = 742 if phase == "val" else 743
231 | 
232 |     def evaluate(
233 |         self, predictions, iteration: Optional[int] = None
234 |     ) -> Dict[str, Dict[str, float]]:
235 |         r"""
236 |         Take the model predictions (in COCO format), submit them to EvalAI, and retrieve model
237 |         performance based on captioning metrics.
238 | 
239 |         Parameters
240 |         ----------
241 |         predictions: List[Prediction]
242 |             Model predictions in COCO format. They are a list of dicts with keys
243 |             ``{"image_id": int, "caption": str}``.
244 |         iteration: int, optional (default = None)
245 |             Training iteration where the checkpoint was evaluated.
246 | 
247 |         Returns
248 |         -------
249 |         Dict[str, Dict[str, float]]
250 |             Model performance based on all captioning metrics. Nested dict structure::
251 | 
252 |                 {
253 |                     "B1": {"in-domain", "near-domain", "out-domain", "entire"},  # BLEU-1
254 |                     "B2": {"in-domain", "near-domain", "out-domain", "entire"},  # BLEU-2
255 |                     "B3": {"in-domain", "near-domain", "out-domain", "entire"},  # BLEU-3
256 |                     "B4": {"in-domain", "near-domain", "out-domain", "entire"},  # BLEU-4
257 |                     "METEOR": {"in-domain", "near-domain", "out-domain", "entire"},
258 |                     "ROUGE-L": {"in-domain", "near-domain", "out-domain", "entire"},
259 |                     "CIDEr": {"in-domain", "near-domain", "out-domain", "entire"},
260 |                     "SPICE": {"in-domain", "near-domain", "out-domain", "entire"},
261 |                 }
262 | 
263 |         """
264 |         # Save predictions as a json file first.
265 |         _, predictions_filename = tempfile.mkstemp(suffix=".json", text=True)
266 |         with open(predictions_filename, "w") as f:
267 |             json.dump(predictions, f)
268 | 
269 |         submission_command = (
270 |             f"evalai challenge {self._challenge_id} phase {self._phase_id} "
271 |             f"submit --file {predictions_filename}"
272 |         )
273 | 
274 |         submission_command_subprocess = subprocess.Popen(
275 |             submission_command.split(),
276 |             stdout=subprocess.PIPE,
277 |             stdin=subprocess.PIPE,
278 |             stderr=subprocess.STDOUT,
279 |         )
280 | 
281 |         # This terminal output will have submission ID we need to check.
282 |         submission_command_stdout = submission_command_subprocess.communicate(input=b"N\n")[
283 |             0
284 |         ].decode("utf-8")
285 | 
286 |         submission_id_regex = re.search("evalai submission ([0-9]+)", submission_command_stdout)
287 |         try:
288 |             # Get an integer submission ID (as a string).
289 |             submission_id = submission_id_regex.group(0).split()[-1]  # type: ignore
290 |         except:
291 |             # Very unlikely, but submission may fail because of some glitch. Retry for that.
292 |             return self.evaluate(predictions)
293 | 
294 |         if iteration is not None:
295 |             print(f"Submitted predictions for iteration {iteration}, submission id: {submission_id}.")
296 |         else:
297 |             print(f"Submitted predictions, submission_id: {submission_id}")
298 | 
299 |         # Placeholder stdout for a pending submission.
300 |         result_stdout: str = "The Submission is yet to be evaluated."
301 |         num_tries: int = 0
302 | 
303 |         # Query every 10 seconds for result until it appears.
304 |         while "CIDEr" not in result_stdout:
305 | 
306 |             time.sleep(10)
307 |             result_stdout = subprocess.check_output(
308 |                 ["evalai", "submission", submission_id, "result"]
309 |             ).decode("utf-8")
310 |             num_tries += 1
311 | 
312 |             # Raise error if it takes more than 5 minutes.
313 |             if num_tries == 30:
314 |                 raise ConnectionError("Unable to get results from EvalAI within 5 minutes!")
315 | 
316 |         # Convert result to json.
317 |         metrics = json.loads(result_stdout, encoding="utf-8")
318 | 
319 |         # keys: {"in-domain", "near-domain", "out-domain", "entire"}
320 |         # In each of these, keys: {"B1", "B2", "B3", "B4", "METEOR", "ROUGE-L", "CIDEr", "SPICE"}
321 |         metrics = {
322 |             "in-domain": metrics[0]["in-domain"],
323 |             "near-domain": metrics[1]["near-domain"],
324 |             "out-domain": metrics[2]["out-domain"],
325 |             "entire": metrics[3]["entire"],
326 |         }
327 | 
328 |         # Restructure the metrics dict for better tensorboard logging.
329 |         # keys: {"B1", "B2", "B3", "B4", "METEOR", "ROUGE-L", "CIDEr", "SPICE"}
330 |         # In each of these, keys: keys: {"in-domain", "near-domain", "out-domain", "entire"}
331 |         flipped_metrics: Dict[str, Dict[str, float]] = defaultdict(dict)
332 |         for key, val in metrics.items():
333 |             for subkey, subval in val.items():
334 |                 flipped_metrics[subkey][key] = subval
335 | 
336 |         return flipped_metrics
337 | 
338 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/cider/pyciderevalcap/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'tylin'
2 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/cider/pyciderevalcap/cider/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'tylin'
2 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/cider/pyciderevalcap/cider/cider.py:
--------------------------------------------------------------------------------
 1 | # Filename: cider.py
 2 | #
 3 | #
 4 | # Description: Describes the class to compute the CIDEr
 5 | # (Consensus-Based Image Description Evaluation) Metric
 6 | #          by Vedantam, Zitnick, and Parikh (http://arxiv.org/abs/1411.5726)
 7 | #
 8 | # Creation Date: Sun Feb  8 14:16:54 2015
 9 | #
10 | # Authors: Ramakrishna Vedantam <vrama91@vt.edu> and
11 | # Tsung-Yi Lin <tl483@cornell.edu>
12 | from __future__ import absolute_import
13 | from __future__ import division
14 | from __future__ import print_function
15 | 
16 | from .cider_scorer import CiderScorer
17 | 
18 | 
19 | class Cider:
20 |     """
21 |     Main Class to compute the CIDEr metric
22 | 
23 |     """
24 |     def __init__(self, n=4, df="corpus"):
25 |         """
26 |         Initialize the CIDEr scoring function
27 |         : param n (int): n-gram size
28 |         : param df (string): specifies where to get the IDF values from
29 |                     takes values 'corpus', 'coco-train'
30 |         : return: None
31 |         """
32 |         # set cider to sum over 1 to 4-grams
33 |         self._n = n
34 |         self._df = df
35 |         self.cider_scorer = CiderScorer(n=self._n, df_mode=self._df)
36 | 
37 |     def compute_score(self, gts, res):
38 |         """
39 |         Main function to compute CIDEr score
40 |         : param  gts (dict) : {image:tokenized reference sentence}
41 |         : param res (dict)  : {image:tokenized candidate sentence}
42 |         : return: cider (float) : computed CIDEr score for the corpus
43 |         """
44 | 
45 |         # clear all the previous hypos and refs
46 |         self.cider_scorer.clear()
47 | 
48 |         for res_id in res:
49 | 
50 |             hypo = res_id['caption']
51 |             ref = gts[res_id['image_id']]
52 | 
53 |             # Sanity check.
54 |             assert(type(hypo) is list)
55 |             assert(len(hypo) == 1)
56 |             assert(type(ref) is list)
57 |             assert(len(ref) > 0)
58 |             self.cider_scorer += (hypo[0], ref)
59 | 
60 |         (score, scores) = self.cider_scorer.compute_score()
61 | 
62 |         return score, scores
63 | 
64 |     def method(self):
65 |         return "CIDEr"
66 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/cider/pyciderevalcap/cider/cider_scorer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # Tsung-Yi Lin <tl483@cornell.edu>
  3 | # Ramakrishna Vedantam <vrama91@vt.edu>
  4 | from __future__ import absolute_import
  5 | from __future__ import division
  6 | from __future__ import print_function
  7 | 
  8 | import copy
  9 | import six
 10 | from six.moves import cPickle
 11 | from collections import defaultdict
 12 | import numpy as np
 13 | import math
 14 | import os
 15 | 
 16 | def precook(s, n=4, out=False):
 17 |     """
 18 |     Takes a string as input and returns an object that can be given to
 19 |     either cook_refs or cook_test. This is optional: cook_refs and cook_test
 20 |     can take string arguments as well.
 21 |     :param s: string : sentence to be converted into ngrams
 22 |     :param n: int    : number of ngrams for which representation is calculated
 23 |     :return: term frequency vector for occuring ngrams
 24 |     """
 25 |     words = s.split()
 26 |     counts = defaultdict(int)
 27 |     for k in range(1,n+1):
 28 |         for i in range(len(words)-k+1):
 29 |             ngram = tuple(words[i:i+k])
 30 |             counts[ngram] += 1
 31 |     return counts
 32 | 
 33 | def cook_refs(refs, n=4): ## lhuang: oracle will call with "average"
 34 |     '''Takes a list of reference sentences for a single segment
 35 |     and returns an object that encapsulates everything that BLEU
 36 |     needs to know about them.
 37 |     :param refs: list of string : reference sentences for some image
 38 |     :param n: int : number of ngrams for which (ngram) representation is calculated
 39 |     :return: result (list of dict)
 40 |     '''
 41 |     return [precook(ref, n) for ref in refs]
 42 | 
 43 | def cook_test(test, n=4):
 44 |     '''Takes a test sentence and returns an object that
 45 |     encapsulates everything that BLEU needs to know about it.
 46 |     :param test: list of string : hypothesis sentence for some image
 47 |     :param n: int : number of ngrams for which (ngram) representation is calculated
 48 |     :return: result (dict)
 49 |     '''
 50 |     return precook(test, n, True)
 51 | 
 52 | class CiderScorer(object):
 53 |     """CIDEr scorer.
 54 |     """
 55 | 
 56 |     def copy(self):
 57 |         ''' copy the refs.'''
 58 |         new = CiderScorer(n=self.n)
 59 |         new.ctest = copy.copy(self.ctest)
 60 |         new.crefs = copy.copy(self.crefs)
 61 |         return new
 62 | 
 63 |     def __init__(self, df_mode="corpus", test=None, refs=None, n=4, sigma=6.0):
 64 |         ''' singular instance '''
 65 |         self.n = n
 66 |         self.sigma = sigma
 67 |         self.crefs = []
 68 |         self.ctest = []
 69 |         self.df_mode = df_mode
 70 |         self.ref_len = None
 71 |         if self.df_mode != "corpus":
 72 |             pkl_file = cPickle.load(open(os.path.join('data', df_mode + '.p'),'rb'), **(dict(encoding='latin1') if six.PY3 else {}))
 73 |             self.ref_len = np.log(float(pkl_file['ref_len']))
 74 |             self.document_frequency = pkl_file['document_frequency']
 75 |         self.cook_append(test, refs)
 76 |     
 77 |     def clear(self):
 78 |         self.crefs = []
 79 |         self.ctest = []
 80 | 
 81 |     def cook_append(self, test, refs):
 82 |         '''called by constructor and __iadd__ to avoid creating new instances.'''
 83 | 
 84 |         if refs is not None:
 85 |             self.crefs.append(cook_refs(refs))
 86 |             if test is not None:
 87 |                 self.ctest.append(cook_test(test)) ## N.B.: -1
 88 |             else:
 89 |                 self.ctest.append(None) # lens of crefs and ctest have to match
 90 | 
 91 |     def size(self):
 92 |         assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest))
 93 |         return len(self.crefs)
 94 | 
 95 |     def __iadd__(self, other):
 96 |         '''add an instance (e.g., from another sentence).'''
 97 | 
 98 |         if type(other) is tuple:
 99 |             ## avoid creating new CiderScorer instances
100 |             self.cook_append(other[0], other[1])
101 |         else:
102 |             self.ctest.extend(other.ctest)
103 |             self.crefs.extend(other.crefs)
104 | 
105 |         return self
106 |     def compute_doc_freq(self):
107 |         '''
108 |         Compute term frequency for reference data.
109 |         This will be used to compute idf (inverse document frequency later)
110 |         The term frequency is stored in the object
111 |         :return: None
112 |         '''
113 |         for refs in self.crefs:
114 |             # refs, k ref captions of one image
115 |             for ngram in set([ngram for ref in refs for (ngram,count) in ref.items()]):
116 |                 self.document_frequency[ngram] += 1
117 |             # maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
118 | 
119 |     def compute_cider(self):
120 |         def counts2vec(cnts):
121 |             """
122 |             Function maps counts of ngram to vector of tfidf weights.
123 |             The function returns vec, an array of dictionary that store mapping of n-gram and tf-idf weights.
124 |             The n-th entry of array denotes length of n-grams.
125 |             :param cnts:
126 |             :return: vec (array of dict), norm (array of float), length (int)
127 |             """
128 |             vec = [defaultdict(float) for _ in range(self.n)]
129 |             length = 0
130 |             norm = [0.0 for _ in range(self.n)]
131 |             for (ngram,term_freq) in cnts.items():
132 |                 # give word count 1 if it doesn't appear in reference corpus
133 |                 df = np.log(max(1.0, self.document_frequency[ngram]))
134 |                 # ngram index
135 |                 n = len(ngram)-1
136 |                 # tf (term_freq) * idf (precomputed idf) for n-grams
137 |                 vec[n][ngram] = float(term_freq)*(self.ref_len - df)
138 |                 # compute norm for the vector.  the norm will be used for
139 |                 # computing similarity
140 |                 norm[n] += pow(vec[n][ngram], 2)
141 | 
142 |                 if n == 1:
143 |                     length += term_freq
144 |             norm = [np.sqrt(n) for n in norm]
145 |             return vec, norm, length
146 | 
147 |         def sim(vec_hyp, vec_ref, norm_hyp, norm_ref, length_hyp, length_ref):
148 |             '''
149 |             Compute the cosine similarity of two vectors.
150 |             :param vec_hyp: array of dictionary for vector corresponding to hypothesis
151 |             :param vec_ref: array of dictionary for vector corresponding to reference
152 |             :param norm_hyp: array of float for vector corresponding to hypothesis
153 |             :param norm_ref: array of float for vector corresponding to reference
154 |             :param length_hyp: int containing length of hypothesis
155 |             :param length_ref: int containing length of reference
156 |             :return: array of score for each n-grams cosine similarity
157 |             '''
158 |             delta = float(length_hyp - length_ref)
159 |             # measure consine similarity
160 |             val = np.array([0.0 for _ in range(self.n)])
161 |             for n in range(self.n):
162 |                 # ngram
163 |                 for (ngram,count) in vec_hyp[n].items():
164 |                     val[n] += vec_hyp[n][ngram] * vec_ref[n][ngram]
165 | 
166 |                 if (norm_hyp[n] != 0) and (norm_ref[n] != 0):
167 |                     val[n] /= (norm_hyp[n]*norm_ref[n])
168 | 
169 |                 assert(not math.isnan(val[n]))
170 |             return val
171 | 
172 |         # compute log reference length
173 |         if self.df_mode == "corpus":
174 |             self.ref_len = np.log(float(len(self.crefs)))
175 | 
176 |         scores = []
177 |         for test, refs in zip(self.ctest, self.crefs):
178 |             # compute vector for test captions
179 |             vec, norm, length = counts2vec(test)
180 |             # compute vector for ref captions
181 |             score = np.array([0.0 for _ in range(self.n)])
182 |             for ref in refs:
183 |                 vec_ref, norm_ref, length_ref = counts2vec(ref)
184 |                 score += sim(vec, vec_ref, norm, norm_ref, length, length_ref)
185 |             # change by vrama91 - mean of ngram scores, instead of sum
186 |             score_avg = np.mean(score)
187 |             # divide by number of references
188 |             score_avg /= len(refs)
189 |             # multiply score by 10
190 |             score_avg *= 10.0
191 |             # append score of an image to the score list
192 |             scores.append(score_avg)
193 |         return scores
194 | 
195 |     def compute_score(self, option=None, verbose=0):
196 |         # compute idf
197 |         if self.df_mode == "corpus":
198 |             self.document_frequency = defaultdict(float)
199 |             self.compute_doc_freq()
200 |             # assert to check document frequency
201 |             assert(len(self.ctest) >= max(self.document_frequency.values()))
202 |             # import json for now and write the corresponding files
203 |         # compute cider score
204 |         score = self.compute_cider()
205 |         # debug
206 |         # print score
207 |         return np.mean(np.array(score)), np.array(score)
208 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/cider/pyciderevalcap/ciderD/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'tylin'
2 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/cider/pyciderevalcap/ciderD/ciderD.py:
--------------------------------------------------------------------------------
 1 | # Filename: ciderD.py
 2 | #
 3 | # Description: Describes the class to compute the CIDEr-D (Consensus-Based Image Description Evaluation) Metric
 4 | #               by Vedantam, Zitnick, and Parikh (http://arxiv.org/abs/1411.5726)
 5 | #
 6 | # Creation Date: Sun Feb  8 14:16:54 2015
 7 | #
 8 | # Authors: Ramakrishna Vedantam <vrama91@vt.edu> and Tsung-Yi Lin <tl483@cornell.edu>
 9 | from __future__ import absolute_import
10 | from __future__ import division
11 | from __future__ import print_function
12 | 
13 | from .ciderD_scorer import CiderScorer
14 | import pdb
15 | 
16 | class CiderD:
17 |     """
18 |     Main Class to compute the CIDEr metric
19 | 
20 |     """
21 |     def __init__(self, n=4, sigma=6.0, df="corpus"):
22 |         # set cider to sum over 1 to 4-grams
23 |         self._n = n
24 |         # set the standard deviation parameter for gaussian penalty
25 |         self._sigma = sigma
26 |         # set which where to compute document frequencies from
27 |         self._df = df
28 |         self.cider_scorer = CiderScorer(n=self._n, df_mode=self._df)
29 | 
30 |     def compute_score(self, gts, res):
31 |         """
32 |         Main function to compute CIDEr score
33 |         :param  hypo_for_image (dict) : dictionary with key <image> and value <tokenized hypothesis / candidate sentence>
34 |                 ref_for_image (dict)  : dictionary with key <image> and value <tokenized reference sentence>
35 |         :return: cider (float) : computed CIDEr score for the corpus
36 |         """
37 | 
38 |         # clear all the previous hypos and refs
39 |         tmp_cider_scorer = self.cider_scorer.copy_empty()
40 |         tmp_cider_scorer.clear()
41 |         for res_id in res:
42 | 
43 |             hypo = res_id['caption']
44 |             ref = gts[res_id['image_id']]
45 | 
46 |             # Sanity check.
47 |             assert(type(hypo) is list)
48 |             assert(len(hypo) == 1)
49 |             assert(type(ref) is list)
50 |             assert(len(ref) > 0)
51 |             tmp_cider_scorer += (hypo[0], ref)
52 | 
53 |         (score, scores) = tmp_cider_scorer.compute_score()
54 | 
55 |         return score, scores
56 | 
57 |     def method(self):
58 |         return "CIDEr-D"
59 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/cider/pyciderevalcap/ciderD/ciderD_scorer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # Tsung-Yi Lin <tl483@cornell.edu>
  3 | # Ramakrishna Vedantam <vrama91@vt.edu>
  4 | from __future__ import absolute_import
  5 | from __future__ import division
  6 | from __future__ import print_function
  7 | 
  8 | import copy
  9 | from collections import defaultdict
 10 | import numpy as np
 11 | import pdb
 12 | import math
 13 | import six
 14 | from six.moves import cPickle
 15 | import os
 16 | 
 17 | def precook(s, n=4, out=False):
 18 |     """
 19 |     Takes a string as input and returns an object that can be given to
 20 |     either cook_refs or cook_test. This is optional: cook_refs and cook_test
 21 |     can take string arguments as well.
 22 |     :param s: string : sentence to be converted into ngrams
 23 |     :param n: int    : number of ngrams for which representation is calculated
 24 |     :return: term frequency vector for occuring ngrams
 25 |     """
 26 |     words = s.split()
 27 |     counts = defaultdict(int)
 28 |     for k in range(1,n+1):
 29 |         for i in range(len(words)-k+1):
 30 |             ngram = tuple(words[i:i+k])
 31 |             counts[ngram] += 1
 32 |     return counts
 33 | 
 34 | def cook_refs(refs, n=4): ## lhuang: oracle will call with "average"
 35 |     '''Takes a list of reference sentences for a single segment
 36 |     and returns an object that encapsulates everything that BLEU
 37 |     needs to know about them.
 38 |     :param refs: list of string : reference sentences for some image
 39 |     :param n: int : number of ngrams for which (ngram) representation is calculated
 40 |     :return: result (list of dict)
 41 |     '''
 42 |     return [precook(ref, n) for ref in refs]
 43 | 
 44 | def cook_test(test, n=4):
 45 |     '''Takes a test sentence and returns an object that
 46 |     encapsulates everything that BLEU needs to know about it.
 47 |     :param test: list of string : hypothesis sentence for some image
 48 |     :param n: int : number of ngrams for which (ngram) representation is calculated
 49 |     :return: result (dict)
 50 |     '''
 51 |     return precook(test, n, True)
 52 | 
 53 | class CiderScorer(object):
 54 |     """CIDEr scorer.
 55 |     """
 56 | 
 57 |     def copy(self):
 58 |         ''' copy the refs.'''
 59 |         new = CiderScorer(n=self.n)
 60 |         new.ctest = copy.copy(self.ctest)
 61 |         new.crefs = copy.copy(self.crefs)
 62 |         return new
 63 | 
 64 |     def copy_empty(self):
 65 |         new = CiderScorer(df_mode="corpus", n=self.n, sigma=self.sigma)
 66 |         new.df_mode = self.df_mode
 67 |         new.ref_len = self.ref_len
 68 |         new.document_frequency = self.document_frequency
 69 |         return new
 70 | 
 71 |     def __init__(self, df_mode="corpus", test=None, refs=None, n=4, sigma=6.0):
 72 |         ''' singular instance '''
 73 |         self.n = n
 74 |         self.sigma = sigma
 75 |         self.crefs = []
 76 |         self.ctest = []
 77 |         self.df_mode = df_mode
 78 |         self.ref_len = None
 79 |         if self.df_mode != "corpus":
 80 |             pkl_file = cPickle.load(open(df_mode,'rb'), **(dict(encoding='latin1') if six.PY3 else {}))
 81 |             self.ref_len = np.log(float(pkl_file['ref_len']))
 82 |             self.document_frequency = pkl_file['document_frequency']
 83 |         else:
 84 |             self.document_frequency = None
 85 |         self.cook_append(test, refs)
 86 |     
 87 |     def clear(self):
 88 |         self.crefs = []
 89 |         self.ctest = []
 90 | 
 91 |     def cook_append(self, test, refs):
 92 |         '''called by constructor and __iadd__ to avoid creating new instances.'''
 93 | 
 94 |         if refs is not None:
 95 |             self.crefs.append(cook_refs(refs))
 96 |             if test is not None:
 97 |                 self.ctest.append(cook_test(test)) ## N.B.: -1
 98 |             else:
 99 |                 self.ctest.append(None) # lens of crefs and ctest have to match
100 | 
101 |     def size(self):
102 |         assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest))
103 |         return len(self.crefs)
104 | 
105 |     def __iadd__(self, other):
106 |         '''add an instance (e.g., from another sentence).'''
107 | 
108 |         if type(other) is tuple:
109 |             ## avoid creating new CiderScorer instances
110 |             self.cook_append(other[0], other[1])
111 |         else:
112 |             self.ctest.extend(other.ctest)
113 |             self.crefs.extend(other.crefs)
114 | 
115 |         return self
116 |     def compute_doc_freq(self):
117 |         '''
118 |         Compute term frequency for reference data.
119 |         This will be used to compute idf (inverse document frequency later)
120 |         The term frequency is stored in the object
121 |         :return: None
122 |         '''
123 |         for refs in self.crefs:
124 |             # refs, k ref captions of one image
125 |             for ngram in set([ngram for ref in refs for (ngram,count) in ref.items()]):
126 |                 self.document_frequency[ngram] += 1
127 |             # maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
128 | 
129 |     def compute_cider(self):
130 |         def counts2vec(cnts):
131 |             """
132 |             Function maps counts of ngram to vector of tfidf weights.
133 |             The function returns vec, an array of dictionary that store mapping of n-gram and tf-idf weights.
134 |             The n-th entry of array denotes length of n-grams.
135 |             :param cnts:
136 |             :return: vec (array of dict), norm (array of float), length (int)
137 |             """
138 |             vec = [defaultdict(float) for _ in range(self.n)]
139 |             length = 0
140 |             norm = [0.0 for _ in range(self.n)]
141 |             for (ngram,term_freq) in cnts.items():
142 |                 # give word count 1 if it doesn't appear in reference corpus
143 |                 df = np.log(max(1.0, self.document_frequency[ngram]))
144 |                 # ngram index
145 |                 n = len(ngram)-1
146 |                 # tf (term_freq) * idf (precomputed idf) for n-grams
147 |                 vec[n][ngram] = float(term_freq)*(self.ref_len - df)
148 |                 # compute norm for the vector.  the norm will be used for computing similarity
149 |                 norm[n] += pow(vec[n][ngram], 2)
150 | 
151 |                 if n == 1:
152 |                     length += term_freq
153 |             norm = [np.sqrt(n) for n in norm]
154 |             return vec, norm, length
155 | 
156 |         def sim(vec_hyp, vec_ref, norm_hyp, norm_ref, length_hyp, length_ref):
157 |             '''
158 |             Compute the cosine similarity of two vectors.
159 |             :param vec_hyp: array of dictionary for vector corresponding to hypothesis
160 |             :param vec_ref: array of dictionary for vector corresponding to reference
161 |             :param norm_hyp: array of float for vector corresponding to hypothesis
162 |             :param norm_ref: array of float for vector corresponding to reference
163 |             :param length_hyp: int containing length of hypothesis
164 |             :param length_ref: int containing length of reference
165 |             :return: array of score for each n-grams cosine similarity
166 |             '''
167 |             delta = float(length_hyp - length_ref)
168 |             # measure consine similarity
169 |             val = np.array([0.0 for _ in range(self.n)])
170 |             for n in range(self.n):
171 |                 # ngram
172 |                 for (ngram,count) in vec_hyp[n].items():
173 |                     # vrama91 : added clipping
174 |                     val[n] += min(vec_hyp[n][ngram], vec_ref[n][ngram]) * vec_ref[n][ngram]
175 | 
176 |                 if (norm_hyp[n] != 0) and (norm_ref[n] != 0):
177 |                     val[n] /= (norm_hyp[n]*norm_ref[n])
178 | 
179 |                 assert(not math.isnan(val[n]))
180 |                 # vrama91: added a length based gaussian penalty
181 |                 val[n] *= np.e**(-(delta**2)/(2*self.sigma**2))
182 |             return val
183 | 
184 |         # compute log reference length
185 |         if self.df_mode == "corpus":
186 |             self.ref_len = np.log(float(len(self.crefs)))
187 |         #elif self.df_mode == "coco-val-df":
188 |             # if coco option selected, use length of coco-val set
189 |         #    self.ref_len = np.log(float(40504))
190 | 
191 |         scores = []
192 |         for test, refs in zip(self.ctest, self.crefs):
193 |             # compute vector for test captions
194 |             vec, norm, length = counts2vec(test)
195 |             # compute vector for ref captions
196 |             score = np.array([0.0 for _ in range(self.n)])
197 |             for ref in refs:
198 |                 vec_ref, norm_ref, length_ref = counts2vec(ref)
199 |                 score += sim(vec, vec_ref, norm, norm_ref, length, length_ref)
200 |             # change by vrama91 - mean of ngram scores, instead of sum
201 |             score_avg = np.mean(score)
202 |             # divide by number of references
203 |             score_avg /= len(refs)
204 |             # multiply score by 10
205 |             score_avg *= 10.0
206 |             # append score of an image to the score list
207 |             scores.append(score_avg)
208 |         return scores
209 | 
210 |     def compute_score(self, option=None, verbose=0):
211 |         # compute idf
212 |         if self.df_mode == "corpus":
213 |             self.document_frequency = defaultdict(float)
214 |             self.compute_doc_freq()
215 |             # assert to check document frequency
216 |             assert(len(self.ctest) >= max(self.document_frequency.values()))
217 |             # import json for now and write the corresponding files
218 |         # compute cider score
219 |         score = self.compute_cider()
220 |         # debug
221 |         # print score
222 |         return np.mean(np.array(score)), np.array(score)
223 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/logger.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 
  2 | 
  3 | import logging
  4 | from logging import StreamHandler, Handler, getLevelName
  5 | import os
  6 | import sys
  7 | 
  8 | 
  9 | # this class is a copy of logging.FileHandler except we end self.close()
 10 | # at the end of each emit. While closing file and reopening file after each
 11 | # write is not efficient, it allows us to see partial logs when writing to
 12 | # fused Azure blobs, which is very convenient
 13 | class FileHandler(StreamHandler):
 14 |     """
 15 |     A handler class which writes formatted logging records to disk files.
 16 |     """
 17 |     def __init__(self, filename, mode='a', encoding=None, delay=False):
 18 |         """
 19 |         Open the specified file and use it as the stream for logging.
 20 |         """
 21 |         # Issue #27493: add support for Path objects to be passed in
 22 |         filename = os.fspath(filename)
 23 |         #keep the absolute path, otherwise derived classes which use this
 24 |         #may come a cropper when the current directory changes
 25 |         self.baseFilename = os.path.abspath(filename)
 26 |         self.mode = mode
 27 |         self.encoding = encoding
 28 |         self.delay = delay
 29 |         if delay:
 30 |             #We don't open the stream, but we still need to call the
 31 |             #Handler constructor to set level, formatter, lock etc.
 32 |             Handler.__init__(self)
 33 |             self.stream = None
 34 |         else:
 35 |             StreamHandler.__init__(self, self._open())
 36 | 
 37 |     def close(self):
 38 |         """
 39 |         Closes the stream.
 40 |         """
 41 |         self.acquire()
 42 |         try:
 43 |             try:
 44 |                 if self.stream:
 45 |                     try:
 46 |                         self.flush()
 47 |                     finally:
 48 |                         stream = self.stream
 49 |                         self.stream = None
 50 |                         if hasattr(stream, "close"):
 51 |                             stream.close()
 52 |             finally:
 53 |                 # Issue #19523: call unconditionally to
 54 |                 # prevent a handler leak when delay is set
 55 |                 StreamHandler.close(self)
 56 |         finally:
 57 |             self.release()
 58 | 
 59 |     def _open(self):
 60 |         """
 61 |         Open the current base file with the (original) mode and encoding.
 62 |         Return the resulting stream.
 63 |         """
 64 |         return open(self.baseFilename, self.mode, encoding=self.encoding)
 65 | 
 66 |     def emit(self, record):
 67 |         """
 68 |         Emit a record.
 69 | 
 70 |         If the stream was not opened because 'delay' was specified in the
 71 |         constructor, open it before calling the superclass's emit.
 72 |         """
 73 |         if self.stream is None:
 74 |             self.stream = self._open()
 75 |         StreamHandler.emit(self, record)
 76 |         self.close()
 77 | 
 78 |     def __repr__(self):
 79 |         level = getLevelName(self.level)
 80 |         return '<%s %s (%s)>' % (self.__class__.__name__, self.baseFilename, level)
 81 | 
 82 | 
 83 | def setup_logger(name, save_dir, distributed_rank, filename="log.txt"):
 84 |     logger = logging.getLogger(name)
 85 |     logger.setLevel(logging.DEBUG)
 86 |     # don't log results for the non-master process
 87 |     if distributed_rank > 0:
 88 |         return logger
 89 |     ch = logging.StreamHandler(stream=sys.stdout)
 90 |     ch.setLevel(logging.DEBUG)
 91 |     formatter = logging.Formatter("%(asctime)s %(name)s %(levelname)s: %(message)s")
 92 |     ch.setFormatter(formatter)
 93 |     logger.addHandler(ch)
 94 | 
 95 |     if save_dir:
 96 |         fh = FileHandler(os.path.join(save_dir, filename))
 97 |         fh.setLevel(logging.DEBUG)
 98 |         fh.setFormatter(formatter)
 99 |         logger.addHandler(fh)
100 | 
101 |     return logger
102 | 
103 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/metric_logger.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
  2 | from collections import defaultdict
  3 | from collections import deque
  4 | import os
  5 | 
  6 | import torch
  7 | 
  8 | from .misc import is_main_process
  9 | 
 10 | 
 11 | class SmoothedValue(object):
 12 |     """Track a series of values and provide access to smoothed values over a
 13 |     window or the global series average.
 14 |     """
 15 | 
 16 |     def __init__(self, window_size=20):
 17 |         self.deque = deque(maxlen=window_size)
 18 |         # self.series = []
 19 |         self.total = 0.0
 20 |         self.count = 0
 21 | 
 22 |     def update(self, value):
 23 |         self.deque.append(value)
 24 |         # self.series.append(value)
 25 |         self.count += 1
 26 |         self.total += value
 27 | 
 28 |     @property
 29 |     def median(self):
 30 |         d = torch.tensor(list(self.deque))
 31 |         return d.median().item()
 32 | 
 33 |     @property
 34 |     def avg(self):
 35 |         d = torch.tensor(list(self.deque))
 36 |         return d.mean().item()
 37 | 
 38 |     @property
 39 |     def global_avg(self):
 40 |         return self.total / self.count
 41 | 
 42 |     @property
 43 |     def last_value(self):
 44 |         return self.deque[-1]
 45 | 
 46 | 
 47 | class MetricLogger(object):
 48 |     def __init__(self, delimiter="\t"):
 49 |         self.meters = {}
 50 |         self.params = {}
 51 |         self.delimiter = delimiter
 52 | 
 53 |     def update_params(self, update_dict):
 54 |         for param_group, group_dict in update_dict.items():
 55 |             if param_group not in self.params:
 56 |                 self.params[param_group] = {}
 57 |             for param_name, param_value in group_dict.items():
 58 |                 # skipping parameters if they start with '_'
 59 |                 if param_name.startswith('_'):
 60 |                     continue
 61 |                 if isinstance(param_value, torch.Tensor):
 62 |                     param_value = param_value.item()
 63 |                 assert isinstance(param_value, (float, int))
 64 |                 self.params[param_group][param_name] = param_value
 65 | 
 66 |     def update_metrics(self, update_dict):
 67 |         for metric_group, group_dict in update_dict.items():
 68 |             if metric_group not in self.meters:
 69 |                 self.meters[metric_group] = defaultdict(SmoothedValue)
 70 |             for metric_name, metric_value in group_dict.items():
 71 |                 # skipping metrics if they start with '_'
 72 |                 if metric_name.startswith('_'):
 73 |                     continue
 74 |                 if isinstance(metric_value, torch.Tensor):
 75 |                     metric_value = metric_value.item()
 76 |                 assert isinstance(metric_value, (float, int))
 77 |                 self.meters[metric_group][metric_name].update(metric_value)
 78 | 
 79 |     def get_logs(self, iteration):
 80 |         return_str = []
 81 |         if len(self.meters) > 0:
 82 |             offset_m = max([len(group_name) for group_name in self.meters.keys()])
 83 |         else:
 84 |             offset_m = 0
 85 |         if len(self.params) > 0:
 86 |             offset_p = max([len(group_name) for group_name in self.params.keys()])
 87 |         else:
 88 |             offset_p = 0
 89 |         offset = max(offset_m, offset_p)
 90 | 
 91 |         for group_name, values in sorted(self.meters.items(),
 92 |                                          key=lambda x: x[0]):
 93 |             loss_str = []
 94 |             for name, meter in values.items():
 95 |                 loss_str.append("{}: {:.4f} ({:.4f})".format(
 96 |                     name, meter.median, meter.global_avg,
 97 |                 ))
 98 |             return_str.append(
 99 |                 "{:{offset}s} - {}".format(
100 |                     group_name, self.delimiter.join(loss_str), offset=offset,
101 |                 ),
102 |             )
103 |         for group_name, values in self.params.items():
104 |             loss_str = []
105 |             for name, param in values.items():
106 |                 loss_str.append("{}: {:.6f}".format(name, param))
107 |             return_str.append(
108 |                 "{:{offset}s} - {}".format(
109 |                     group_name, self.delimiter.join(loss_str), offset=offset,
110 |                 ),
111 |             )
112 |         return "\n    ".join(return_str)
113 | 
114 | 
115 | class TensorboardLogger(MetricLogger):
116 |     def __init__(self,
117 |                  log_dir,
118 |                  delimiter='\t'):
119 |         super(TensorboardLogger, self).__init__(delimiter)
120 |         try:
121 |             from tensorboardX import SummaryWriter
122 |         except ImportError:
123 |             raise ImportError(
124 |                 'To use tensorboard please install tensorboardX '
125 |                 '[ pip install tensorboardx ].'
126 |             )
127 |         self.philly_tb_logger = None
128 |         self.philly_tb_logger_avg = None
129 |         self.philly_tb_logger_med = None
130 |         if is_main_process():
131 |             self.tb_logger = SummaryWriter(log_dir)
132 |             self.tb_logger_avg = SummaryWriter(os.path.join(log_dir, 'avg'))
133 |             self.tb_logger_med = SummaryWriter(os.path.join(log_dir, 'med'))
134 |         else:
135 |             self.tb_logger = None
136 |             self.tb_logger_avg = None
137 |             self.tb_logger_med = None
138 | 
139 |     def get_logs(self, iteration):
140 |         if self.tb_logger:
141 |             for group_name, values in self.meters.items():
142 |                 for name, meter in values.items():
143 |                     self.tb_logger.add_scalar(
144 |                         '{}/{}'.format(group_name, name),
145 |                         meter.last_value, iteration,
146 |                     )
147 |                     self.tb_logger_avg.add_scalar(
148 |                         '{}/{}'.format(group_name, name),
149 |                         meter.avg, iteration,
150 |                     )
151 |                     self.tb_logger_med.add_scalar(
152 |                         '{}/{}'.format(group_name, name),
153 |                         meter.median, iteration,
154 |                     )
155 |                     if self.philly_tb_logger:
156 |                         self.philly_tb_logger.add_scalar(
157 |                             '{}/{}'.format(group_name, name),
158 |                             meter.last_value, iteration,
159 |                         )
160 |                         self.philly_tb_logger_avg.add_scalar(
161 |                             '{}/{}'.format(group_name, name),
162 |                             meter.avg, iteration,
163 |                         )
164 |                         self.philly_tb_logger_med.add_scalar(
165 |                             '{}/{}'.format(group_name, name),
166 |                             meter.median, iteration,
167 |                         )
168 |             for group_name, values in self.params.items():
169 |                 for name, param in values.items():
170 |                     self.tb_logger.add_scalar(
171 |                         '{}/{}'.format(group_name, name),
172 |                         param, iteration,
173 |                     )
174 |                     if self.philly_tb_logger:
175 |                         self.philly_tb_logger.add_scalar(
176 |                             '{}/{}'.format(group_name, name),
177 |                             param, iteration,
178 |                         )
179 |         return super(TensorboardLogger, self).get_logs(iteration)
180 | 
181 |     def close(self):
182 |         if is_main_process():
183 |             self.tb_logger.close()
184 |             self.tb_logger_avg.close()
185 |             self.tb_logger_med.close()
186 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/misc.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 
 2 | 
 3 | import errno
 4 | import os
 5 | import os.path as op
 6 | import yaml
 7 | import random
 8 | import torch
 9 | import numpy as np
10 | import torch.distributed as dist
11 | 
12 | 
13 | def mkdir(path):
14 |     # if it is the current folder, skip.
15 |     if path == '':
16 |         return
17 |     try:
18 |         os.makedirs(path)
19 |     except OSError as e:
20 |         if e.errno != errno.EEXIST:
21 |             raise
22 | 
23 | 
24 | def set_seed(seed, n_gpu):
25 |     random.seed(seed)
26 |     np.random.seed(seed)
27 |     torch.manual_seed(seed)
28 |     if n_gpu > 0:
29 |         torch.cuda.manual_seed_all(seed)
30 | 
31 | 
32 | def load_from_yaml_file(yaml_file):
33 |     with open(yaml_file, 'r') as fp:
34 |         return yaml.load(fp)
35 | 
36 | 
37 | def find_file_path_in_yaml(fname, root):
38 |     if fname is not None:
39 |         if op.isfile(fname):
40 |             return fname
41 |         elif op.isfile(op.join(root, fname)):
42 |             return op.join(root, fname)
43 |         else:
44 |             raise FileNotFoundError(
45 |                 errno.ENOENT, os.strerror(errno.ENOENT), op.join(root, fname)
46 |             )
47 | 
48 | 
49 | def get_rank():
50 |     if not dist.is_available():
51 |         return 0
52 |     if not dist.is_initialized():
53 |         return 0
54 |     return dist.get_rank()
55 | 
56 | 
57 | def is_main_process():
58 |     return get_rank() == 0
59 | 
60 | 
61 | def get_world_size():
62 |     if not dist.is_available():
63 |         return 1
64 |     if not dist.is_initialized():
65 |         return 1
66 |     return dist.get_world_size()
67 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/task_utils.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 
  2 | 
  3 | from __future__ import absolute_import, division, print_function
  4 | 
  5 | import csv, json
  6 | import logging
  7 | import os
  8 | import sys
  9 | from io import open
 10 | import _pickle as cPickle
 11 | import torch
 12 | 
 13 | logger = logging.getLogger(__name__)
 14 | 
 15 | 
 16 | class InputInstance(object):
 17 |     """A single training/test example for simple sequence classification."""
 18 | 
 19 |     def __init__(self, guid, text_a, text_b=None, label=None, score=None, img_key=None, q_id=None):
 20 |         """Constructs a InputExample.
 21 | 
 22 |         Args:
 23 |             guid: Unique id for the example.
 24 |             text_a: string. The untokenized text of the first sequence. For single
 25 |             sequence tasks, only this sequence must be specified.
 26 |             text_b: (Optional) string. The untokenized text of the second sequence.
 27 |             Only must be specified for sequence pair tasks.
 28 |             label: (Optional) string. The label of the example. This should be
 29 |             specified for train and dev examples, but not for test examples.
 30 |         """
 31 | 
 32 |         self.guid = guid
 33 |         self.text_a = text_a
 34 |         self.text_b = text_b
 35 |         self.label = label
 36 |         self.score = score
 37 |         self.img_key = img_key
 38 |         self.q_id = q_id
 39 | 
 40 | 
 41 | class InputFeat(object):
 42 |     """A single set of features of data."""
 43 | 
 44 |     def __init__(self, input_ids, input_mask, segment_ids, label_id, score, img_feat):
 45 |         self.input_ids = input_ids
 46 |         self.input_mask = input_mask
 47 |         self.segment_ids = segment_ids
 48 |         self.label_id = label_id
 49 |         self.score = score
 50 |         self.img_feat = img_feat
 51 | 
 52 | 
 53 | class DataProcessor(object):
 54 |     """Base class for data converters for sequence classification data sets."""
 55 | 
 56 |     def get_train_examples(self, data_dir):
 57 |         """Gets a collection of `InputExample`s for the train set."""
 58 |         raise NotImplementedError()
 59 | 
 60 |     def get_dev_examples(self, data_dir):
 61 |         """Gets a collection of `InputExample`s for the dev set."""
 62 |         raise NotImplementedError()
 63 | 
 64 |     def get_labels(self):
 65 |         """Gets the list of labels for this data set."""
 66 |         raise NotImplementedError()
 67 | 
 68 |     @classmethod
 69 |     def _read_tsv(cls, input_file, quotechar=None):
 70 |         """Reads a tab separated value file."""
 71 |         with open(input_file, "r", encoding="utf-8-sig") as f:
 72 |             reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
 73 |             lines = []
 74 |             for line in reader:
 75 |                 if sys.version_info[0] == 2:
 76 |                     line = list(unicode(cell, 'utf-8') for cell in line)
 77 |                 lines.append(line)
 78 |             return lines
 79 | 
 80 | 
 81 | class VQATextProcessor(DataProcessor):
 82 |     """ Processor for the VQA Text data set. """
 83 | 
 84 |     def get_train_examples(self, data_dir, file_name='train2014_qla.json'):
 85 |         """ See base class."""
 86 | 
 87 |         lines = json.load(open(os.path.join(data_dir, file_name)))
 88 |         declars = open(os.path.join(data_dir, 'train2014_declarative.json'), "r").read().split("\n")
 89 |         assert len(lines) == len(declars)
 90 |         return self._create_examples(lines, declars, "train")
 91 | 
 92 |         #return self._create_examples(self._read_tsv(os.path.join(data_dir, "train2014_qla.tsv")), "train")
 93 | 
 94 |     def get_dev_examples(self, data_dir, file_name='val2014_qla.json'):
 95 |         """ See base class."""
 96 | 
 97 |         lines = json.load(open(os.path.join(data_dir, file_name)))
 98 |         declars = open(os.path.join(data_dir, 'val2014_declarative.json'), "r").read().split("\n")
 99 |         assert len(lines) == len(declars)
100 |         return self._create_examples(lines, declars, "dev")
101 | 
102 |         #return self._create_examples(self._read_tsv(os.path.join(data_dir, "val2014_qla.tsv")), "dev")
103 | 
104 |     def get_test_examples(self, data_dir, file_name='test2015_qla.json'):
105 |         """ See base class."""
106 | 
107 |         lines = json.load(open(os.path.join(data_dir, file_name)))
108 |         return self._create_examples(lines, "test")
109 | 
110 |     def get_labels(self, label_file):
111 |         """ See base class."""
112 | 
113 |         ans2label = cPickle.load(open(label_file, 'rb'))
114 |         return list(range(len(ans2label)))
115 |         # return list(ans2label.values())
116 |         #return ["entailment", "not_entailment"]
117 | 
118 |     def _create_examples(self, lines, declars, set_type):
119 |         """Creates examples for the training and dev sets."""
120 | 
121 |         examples = []
122 |         for (i, (line, declar)) in enumerate(zip(lines, declars)):
123 |             if set_type!='test' and len(line['an']) == 0: continue
124 | 
125 |             guid = "%s-%s" % (set_type, str(i))
126 |             text_a = line['q']
127 |             text_b = line['o'].replace(';', ' ').strip() #line['o']
128 |             label = None if set_type.startswith('test') else line['an']
129 |             score = None if set_type.startswith('test') else line['s']
130 |             img_key = line['img_id']
131 |             q_id = i # int(line['q_id']) if set_type.startswith('test') else 0
132 | 
133 |             exam = InputInstance(guid=guid, text_a=text_a, text_b=text_b, label=label, score=score, img_key=img_key,
134 |                                  q_id=q_id)
135 |             exam.text_c = declar
136 |             examples.append(exam)
137 |         return examples
138 | 
139 | class VQATextAProcessor(DataProcessor):
140 |     """ Processor for the VQA Text data set. """
141 | 
142 |     def get_train_examples(self, data_dir, file_name='train2014_qla.json'):
143 |         """ See base class."""
144 | 
145 |         lines = json.load(open(os.path.join(data_dir, file_name)))
146 |         return self._create_examples(lines, "train")
147 | 
148 |         #return self._create_examples(self._read_tsv(os.path.join(data_dir, "train2014_qla.tsv")), "train")
149 | 
150 |     def get_dev_examples(self, data_dir, file_name='val2014_qla.json'):
151 |         """ See base class."""
152 | 
153 |         lines = json.load(open(os.path.join(data_dir, file_name)))
154 |         return self._create_examples(lines, "dev")
155 | 
156 |         #return self._create_examples(self._read_tsv(os.path.join(data_dir, "val2014_qla.tsv")), "dev")
157 | 
158 |     def get_test_examples(self, data_dir, file_name='test2015_qla.json'):
159 |         """ See base class."""
160 | 
161 |         lines = json.load(open(os.path.join(data_dir, file_name)))
162 |         return self._create_examples(lines, "test")
163 | 
164 |     def get_labels(self, label_file):
165 |         """ See base class."""
166 | 
167 |         ans2label = cPickle.load(open(label_file, 'rb'))
168 |         return list(ans2label.values())
169 | 
170 |     def _create_examples(self, lines, set_type):
171 |         """Creates examples for the training and dev sets."""
172 | 
173 |         examples = []
174 |         for (i, line) in enumerate(lines):
175 |             if set_type!='test' and len(line['an']) == 0: continue
176 | 
177 |             guid = "%s-%s" % (set_type, str(i))
178 |             text_a = line['q']
179 |             text_b = None # line['o'] # or None
180 |             label = None if set_type.startswith('test') else line['an']
181 |             score = None if set_type.startswith('test') else line['s']
182 |             img_key = line['img_id']
183 |             q_id = int(line['q_id']) if set_type.startswith('test') else 0
184 |             examples.append(InputInstance(guid=guid, text_a=text_a, text_b=text_b, label=label, score=score, img_key=img_key, q_id=q_id))
185 |         return examples
186 | 
187 | class GQAProcessor(DataProcessor):
188 |     """ Processor for the GQA data set. """
189 | 
190 |     def get_train_examples(self, data_dir, file_name='train2014_qla.json'):
191 |         """ See base class."""
192 | 
193 |         lines = json.load(open(os.path.join(data_dir, file_name)))
194 |         return self._create_examples(lines, "train")
195 | 
196 |         #return self._create_examples(self._read_tsv(os.path.join(data_dir, "train2014_qla.tsv")), "train")
197 | 
198 |     def get_dev_examples(self, data_dir, file_name='val2014_qla.json'):
199 |         """ See base class."""
200 | 
201 |         lines = json.load(open(os.path.join(data_dir, file_name)))
202 |         return self._create_examples(lines, "dev")
203 | 
204 |         #return self._create_examples(self._read_tsv(os.path.join(data_dir, "val2014_qla.tsv")), "dev")
205 | 
206 |     def get_test_examples(self, data_dir, file_name='test2015_qla.json'):
207 |         """ See base class."""
208 | 
209 |         lines = json.load(open(os.path.join(data_dir, file_name)))
210 |         return self._create_examples(lines, "test")
211 | 
212 |     def get_labels(self, label_file='trainval_testdev_all_ans2label.pkl'):
213 |         """ See base class."""
214 | 
215 |         ans2label = cPickle.load(open(label_file, 'rb'))
216 |         return list(ans2label.values())
217 | 
218 |     def _create_examples(self, lines, set_type):
219 |         """Creates examples for the training and dev sets."""
220 | 
221 |         examples = []
222 |         for (i, line) in enumerate(lines):
223 |             if set_type!='test' and len(line['an']) == 0: continue
224 | 
225 |             guid = "%s-%s" % (set_type, str(i))
226 |             text_a = line['q']
227 |             text_b = line['o'] # or None
228 |             label = None if set_type.startswith('test') else line['an']
229 |             score = 0
230 |             img_key = line['img_id']
231 |             q_id = int(line['q_id']) # if set_type.startswith('test') else 0
232 |             examples.append(InputInstance(guid=guid, text_a=text_a, text_b=text_b, label=label, score=score, img_key=img_key, q_id=q_id))
233 |         return examples
234 | 
235 | class NLVRProcessor(DataProcessor):
236 |     """ Processor for the NLVR data set. """
237 | 
238 |     def get_train_examples(self, data_dir, use_label_seq=True, file_name='nlvr2_train.json'):
239 |         """ See base class."""
240 | 
241 |         lines = json.load(open(os.path.join(data_dir, file_name)))
242 |         return self._create_examples(lines, "train", use_label_seq)
243 | 
244 |         #return self._create_examples(self._read_tsv(os.path.join(data_dir, "train2014_qla.tsv")), "train")
245 | 
246 |     def get_dev_examples(self, data_dir, use_label_seq=True, file_name='nlvr2_dev.json'):
247 |         """ See base class."""
248 | 
249 |         lines = json.load(open(os.path.join(data_dir, file_name)))
250 |         return self._create_examples(lines, "dev", use_label_seq)
251 | 
252 |         #return self._create_examples(self._read_tsv(os.path.join(data_dir, "val2014_qla.tsv")), "dev")
253 | 
254 |     def get_test_examples(self, data_dir, use_label_seq=True, file_name='nlvr2_test1.json'):
255 |         """ See base class."""
256 | 
257 |         lines = json.load(open(os.path.join(data_dir, file_name)))
258 |         return self._create_examples(lines, "test", use_label_seq)
259 | 
260 |     def get_labels(self, label_file=None):
261 |         """ See base class."""
262 | 
263 |         #ans2label = cPickle.load(open(label_file, 'rb'))
264 |         #return list(ans2label.values())
265 |         return [0, 1]
266 | 
267 |     def _create_examples(self, lines, set_type, use_label_seq=True):
268 |         """ Creates examples for the training and dev sets. """
269 | 
270 |         examples = []
271 |         for (i, line) in enumerate(lines):
272 |             guid = "%s-%s" % (set_type, str(i))
273 |             text_a = line['q']
274 |             text_b = line['o'] if use_label_seq else None
275 |             label = line['label'] #None if set_type.startswith('test') else line['label']
276 |             score = 0
277 |             img_key = line['img_id'] #[line['img_left'], line['img_left']]
278 |             q_id = 0 #int(line['q_id']) if set_type.startswith('test') else 0
279 |             examples.append(InputInstance(guid=guid, text_a=text_a, text_b=text_b, label=label, score=score, img_key=img_key, q_id=q_id))
280 |         return examples
281 | 
282 | class VCR_Q_A_Processor(DataProcessor):
283 |     """ Processor for the VCR (q -> a) (Det) data set. """
284 | 
285 |     def get_train_examples(self, data_dir, file_name='vcr_train.json'):
286 |         """ See base class."""
287 | 
288 |         lines = json.load(open(os.path.join(data_dir, file_name)))
289 |         return self._create_examples(lines, "train")
290 | 
291 |     def get_dev_examples(self, data_dir, file_name='vcr_val.json'):
292 |         """ See base class."""
293 | 
294 |         lines = json.load(open(os.path.join(data_dir, file_name)))
295 |         return self._create_examples(lines, "dev")
296 | 
297 |     def get_test_examples(self, data_dir, file_name='vcr_test.json'):
298 |         """ See base class."""
299 | 
300 |         lines = json.load(open(os.path.join(data_dir, file_name)))
301 |         return self._create_examples(lines, "test")
302 | 
303 |     def get_labels(self, label_file=None):
304 |         """ See base class."""
305 | 
306 |         #ans2label = cPickle.load(open(label_file, 'rb'))
307 |         #return list(ans2label.values())
308 |         return [0, 1]
309 | 
310 |     def _create_examples(self, lines, set_type):
311 |         """ Creates examples for the training and dev sets. """
312 | 
313 |         examples = []
314 |         for (i, line) in enumerate(lines):
315 |             #if set_type!='test': continue
316 | 
317 |             guid = "%s-%s" % (set_type, str(i))
318 |             text_a = line['q'] # question
319 |             choices = line['choices']
320 |             label = None if set_type.startswith('test') else line['label']
321 |             img_key = line['img_id']
322 |             q_id = int(line['annot_id'].split('-')[-1]) #int(line['q_id']) if set_type.startswith('test') else 0
323 |             score = line['objects'] if 'objects' in line else None
324 |             examples.append(InputInstance(guid=guid, text_a=text_a, text_b=choices, label=label, score=score, img_key=img_key, q_id=q_id))
325 |         return examples
326 | 
327 | class VCR_QA_R_Processor(DataProcessor):
328 |     """ Processor for the VCR (qa -> r) QA_R data set. """
329 | 
330 |     def get_train_examples(self, data_dir, file_name='vcr_train.json'):
331 |         """ See base class."""
332 | 
333 |         lines = json.load(open(os.path.join(data_dir, file_name)))
334 |         return self._create_examples(lines, "train")
335 | 
336 |     def get_dev_examples(self, data_dir, file_name='vcr_val.json'):
337 |         """ See base class."""
338 | 
339 |         lines = json.load(open(os.path.join(data_dir, file_name)))
340 |         return self._create_examples(lines, "dev")
341 | 
342 |     def get_test_examples(self, data_dir, file_name='vcr_test.json'):
343 |         """ See base class."""
344 | 
345 |         lines = json.load(open(os.path.join(data_dir, file_name)))
346 |         return self._create_examples(lines, "test")
347 | 
348 |     def get_labels(self, label_file=None):
349 |         """ See base class."""
350 | 
351 |         #ans2label = cPickle.load(open(label_file, 'rb'))
352 |         #return list(ans2label.values())
353 |         return [0, 1]
354 | 
355 |     def _create_examples(self, lines, set_type):
356 |         """ Creates examples for the training and dev sets. """
357 | 
358 |         examples = []
359 |         for (i, line) in enumerate(lines):
360 |             #if set_type!='test': continue
361 | 
362 |             guid = "%s-%s" % (set_type, str(i))
363 |             text_a = line['q'] + ' ' + line['choices'][line['label']] # question_choice
364 |             choices = line['rational_choices'] # rational_choice
365 |             label = None if set_type.startswith('test') else line['rational_label'] # rational_label
366 |             img_key = line['img_id']
367 |             q_id = int(line['annot_id'].split('-')[-1]) #int(line['q_id']) if set_type.startswith('test') else 0
368 |             examples.append(InputInstance(guid=guid, text_a=text_a, text_b=choices, label=label, score=None, img_key=img_key, q_id=q_id))
369 |         return examples
370 | 
371 | class VCR_QAR_Processor(DataProcessor):
372 |     """ Processor for the VCR (q->a, qa->r) data set. """
373 | 
374 |     def get_train_examples(self, data_dir, file_name='vcr_train.json'):
375 |         """ See base class."""
376 | 
377 |         lines = json.load(open(os.path.join(data_dir, file_name)))
378 |         return self._create_examples(lines, "train")
379 | 
380 |     def get_dev_examples(self, data_dir, file_name='vcr_val.json'):
381 |         """ See base class."""
382 | 
383 |         lines = json.load(open(os.path.join(data_dir, file_name)))
384 |         return self._create_examples(lines, "dev")
385 | 
386 |     def get_test_examples(self, data_dir, file_name='vcr_test.json'):
387 |         """ See base class."""
388 | 
389 |         lines = json.load(open(os.path.join(data_dir, file_name)))
390 |         return self._create_examples(lines, "test")
391 | 
392 |     def get_labels(self, label_file=None):
393 |         """ See base class."""
394 | 
395 |         #ans2label = cPickle.load(open(label_file, 'rb'))
396 |         #return list(ans2label.values())
397 |         return [0, 1]
398 | 
399 |     def _create_examples(self, lines, set_type):
400 |         """ Creates examples for the training and dev sets. """
401 | 
402 |         examples = []
403 |         for (i, line) in enumerate(lines):
404 |             #if set_type!='test': continue
405 | 
406 |             guid = "%s-%s-q-a" % (set_type, str(i))
407 |             text_a = line['q'] # question
408 |             choices = line['choices']
409 |             label = None if set_type.startswith('test') else line['label']
410 |             img_key = line['img_id']
411 |             q_id = int(line['annot_id'].split('-')[-1]) #int(line['q_id']) if set_type.startswith('test') else 0
412 |             score = line['objects'] if 'objects' in line else None
413 |             examples.append(InputInstance(guid=guid, text_a=text_a, text_b=choices, label=label, score=score, img_key=img_key, q_id=q_id))
414 | 
415 |             if set_type == 'train': # qa -> r
416 |                 guid = "%s-%s-qa-r" % (set_type, str(i))
417 |                 text_a = line['q'] + ' ' + line['choices'][line['label']] # question_choice
418 |                 choices = line['rational_choices'] # rational_choice
419 |                 label = None if set_type.startswith('test') else line['rational_label'] # rational_label
420 |                 img_key = line['img_id']
421 |                 q_id = int(line['annot_id'].split('-')[-1]) # int(line['q_id']) if set_type.startswith('test') else 0
422 |                 score = line['objects'] if 'objects' in line else None
423 |                 examples.append(InputInstance(guid=guid, text_a=text_a, text_b=choices, label=label, score=score, img_key=img_key, q_id=q_id))
424 |         return examples
425 | 
426 | 
427 | def convert_examples_to_features_vqa(examples, img_feats, label_list, max_img_seq_length, max_seq_length,
428 |                                  tokenizer, output_mode,
429 |                                  cls_token_at_end=False, pad_on_left=False,
430 |                                  cls_token='[CLS]', sep_token='[SEP]', pad_token=0,
431 |                                  sequence_a_segment_id=0, sequence_b_segment_id=1,
432 |                                  cls_token_segment_id=1, pad_token_segment_id=0,
433 |                                  mask_padding_with_zero=True):
434 |     """ Loads a data file into a list of `InputBatch`s
435 |         `cls_token_at_end` define the location of the CLS token:
436 |             - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
437 |             - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
438 |         `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
439 |     """
440 | 
441 |     label_map = {label:i for i, label in enumerate(label_list)}
442 | 
443 |     features = []
444 |     #debug:
445 |     debug_size = 500
446 | 
447 |     for (ex_index, example) in enumerate(examples[0: ]):
448 |         if len(example.label) == 0: continue
449 |         if ex_index % 10000 == 0:
450 |             logger.info("Writing example %d of %d" % (ex_index, len(examples)))
451 | 
452 |         tokens_a = tokenizer.tokenize(example.text_a)
453 | 
454 |         tokens_b = None
455 |         if example.text_b:
456 |             tokens_b = tokenizer.tokenize(example.text_b)
457 |             # Modifies `tokens_a` and `tokens_b` in place so that the total
458 |             # length is less than the specified length.
459 |             # Account for [CLS], [SEP], [SEP] with "- 3"
460 |             _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
461 |         else:
462 |             # Account for [CLS] and [SEP] with "- 2"
463 |             if len(tokens_a) > max_seq_length - 2:
464 |                 tokens_a = tokens_a[:(max_seq_length - 2)]
465 | 
466 |         # The convention in BERT is:
467 |         # (a) For sequence pairs:
468 |         #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
469 |         #  type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
470 |         # (b) For single sequences:
471 |         #  tokens:   [CLS] the dog is hairy . [SEP]
472 |         #  type_ids:   0   0   0   0  0     0   0
473 |         #
474 |         # Where "type_ids" are used to indicate whether this is the first
475 |         # sequence or the second sequence. The embedding vectors for `type=0` and
476 |         # `type=1` were learned during pre-training and are added to the wordpiece
477 |         # embedding vector (and position vector). This is not *strictly* necessary
478 |         # since the [SEP] token unambiguously separates the sequences, but it makes
479 |         # it easier for the model to learn the concept of sequences.
480 |         #
481 |         # For classification tasks, the first vector (corresponding to [CLS]) is
482 |         # used as as the "sentence vector". Note that this only makes sense because
483 |         # the entire model is fine-tuned.
484 |         tokens = tokens_a + [sep_token]
485 |         segment_ids = [sequence_a_segment_id] * len(tokens)
486 | 
487 |         if tokens_b:
488 |             tokens += tokens_b + [sep_token]
489 |             segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1)
490 | 
491 |         if cls_token_at_end:
492 |             tokens = tokens + [cls_token]
493 |             segment_ids = segment_ids + [cls_token_segment_id]
494 |         else:
495 |             tokens = [cls_token] + tokens
496 |             segment_ids = [cls_token_segment_id] + segment_ids
497 | 
498 |         input_ids = tokenizer.convert_tokens_to_ids(tokens)
499 | 
500 |         # The mask has 1 for real tokens and 0 for padding tokens. Only real
501 |         # tokens are attended to.
502 |         input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
503 | 
504 |         # Zero-pad up to the sequence length.
505 |         padding_length = max_seq_length - len(input_ids)
506 |         if pad_on_left:
507 |             input_ids = ([pad_token] * padding_length) + input_ids
508 |             input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
509 |             segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
510 |         else:
511 |             input_ids = input_ids + ([pad_token] * padding_length)
512 |             input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
513 |             segment_ids = segment_ids + ([pad_token_segment_id] * padding_length)
514 | 
515 |         assert len(input_ids) == max_seq_length
516 |         assert len(input_mask) == max_seq_length
517 |         assert len(segment_ids) == max_seq_length
518 | 
519 |         # image features
520 |         #img_feat = img_feats[example.img_key] # torch
521 |         img_feat = img_feats.item().get(example.img_key) # numpy
522 |         if img_feat.shape[0] > max_img_seq_length:
523 |             img_feat = img_feat[0:max_img_seq_length, ]
524 |             if max_img_seq_length > 0:
525 |                 input_mask = input_mask + [1 if mask_padding_with_zero else 0] * img_feat.shape[0]
526 |                 #segment_ids += [sequence_b_segment_id] * img_feat.shape[0]
527 |         else:
528 |             if max_img_seq_length > 0:
529 |                 input_mask = input_mask + [1 if mask_padding_with_zero else 0] * img_feat.shape[0]
530 |                 #segment_ids = segment_ids + [sequence_b_segment_id] * img_feat.shape[0]
531 |             padding_matrix = torch.zeros((max_img_seq_length - img_feat.shape[0], img_feat.shape[1]))
532 |             img_feat = torch.cat((img_feat, padding_matrix), 0)
533 |             if max_img_seq_length > 0:
534 |                 input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_matrix.shape[0])
535 |                 #segment_ids = segment_ids + [pad_token_segment_id] * padding_matrix.shape[0]
536 | 
537 |         if output_mode == "classification":
538 |             label_id = [label_map[l] for l in example.label]
539 |             score = example.score
540 |         elif output_mode == "regression":
541 |             label_id = float(example.label)
542 |         else:
543 |             raise KeyError(output_mode)
544 | 
545 |         if ex_index < 5:
546 |             logger.info("*** Example ***")
547 |             logger.info("guid: %s" % (example.guid))
548 |             logger.info("tokens: %s" % " ".join([str(x) for x in tokens]))
549 |             logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
550 |             logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
551 |             logger.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
552 |             logger.info("label: %s (id = %s)" % (example.label, label_id))
553 |             logger.info("score: %s (score = %s)" % (example.score, score))
554 | 
555 |         features.append(InputFeat(input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_id=label_id, score=score, img_feat=img_feat))
556 |     return features
557 | 
558 | 
559 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
560 |     """Truncates a sequence pair in place to the maximum length."""
561 | 
562 |     # This is a simple heuristic which will always truncate the longer sequence
563 |     # one token at a time. This makes more sense than truncating an equal percent
564 |     # of tokens from each, since if one sequence is very short then each token
565 |     # that's truncated likely contains more information than a longer sequence.
566 |     while True:
567 |         total_length = len(tokens_a) + len(tokens_b)
568 |         if total_length <= max_length:
569 |             break
570 |         if len(tokens_a) > len(tokens_b):
571 |             tokens_a.pop()
572 |         else:
573 |             tokens_b.pop()
574 | 
575 | 
576 | processors = {
577 |     "vqa_text": VQATextProcessor,
578 |     "vqa_text_a": VQATextAProcessor,
579 |     "gqa": GQAProcessor,
580 |     "nlvr": NLVRProcessor,
581 |     "vcr_q_a": VCR_Q_A_Processor,
582 |     "vcr_qa_r": VCR_QA_R_Processor,
583 |     "vcr_qar": VCR_QAR_Processor,
584 | }
585 | 
586 | output_modes = {
587 |     "vqa_text": "classification",
588 |     "vqa_text_a": "classification",
589 |     "gqa": "classification",
590 |     "nlvr": "classification",
591 |     "vcr_q_a": "classification",
592 |     "vcr_qa_r": "classification",
593 |     "vcr_qar": "classification",
594 | }
595 | 
596 | GLUE_TASKS_NUM_LABELS = {
597 |     "vqa_text": 3129,
598 |     "vqa_text_a": 3129,
599 |     "gqa": 1853,
600 |     "nlvr": 2,
601 |     "vcr_q_a": 2,
602 |     "vcr_qa_r": 2,
603 |     "vcr_qar": 2,
604 | }


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/tsv_file.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license. 
 2 | 
 3 | import logging
 4 | import os
 5 | import os.path as op
 6 | 
 7 | 
 8 | def generate_lineidx_file(filein, idxout):
 9 |     idxout_tmp = idxout + '.tmp'
10 |     with open(filein, 'r') as tsvin, open(idxout_tmp,'w') as tsvout:
11 |         fsize = os.fstat(tsvin.fileno()).st_size
12 |         fpos = 0
13 |         while fpos!=fsize:
14 |             tsvout.write(str(fpos)+"\n")
15 |             tsvin.readline()
16 |             fpos = tsvin.tell()
17 |     os.rename(idxout_tmp, idxout)
18 | 
19 | 
20 | class TSVFile(object):
21 |     def __init__(self, tsv_file, generate_lineidx=False):
22 |         self.tsv_file = tsv_file
23 |         self.lineidx = op.splitext(tsv_file)[0] + '.lineidx'
24 |         self._fp = None
25 |         self._lineidx = None
26 |         # the process always keeps the process which opens the file. 
27 |         # If the pid is not equal to the currrent pid, we will re-open the file.
28 |         self.pid = None
29 |         # generate lineidx if not exist
30 |         if not op.isfile(self.lineidx) and generate_lineidx:
31 |             generate_lineidx_file(self.tsv_file, self.lineidx)
32 | 
33 |     def __del__(self):
34 |         if self._fp:
35 |             self._fp.close()
36 | 
37 |     def __str__(self):
38 |         return "TSVFile(tsv_file='{}')".format(self.tsv_file)
39 | 
40 |     def __repr__(self):
41 |         return str(self)
42 | 
43 |     def num_rows(self):
44 |         self._ensure_lineidx_loaded()
45 |         return len(self._lineidx)
46 | 
47 |     def seek(self, idx):
48 |         self._ensure_tsv_opened()
49 |         self._ensure_lineidx_loaded()
50 |         try:
51 |             pos = self._lineidx[idx]
52 |         except:
53 |             logging.info('{}-{}'.format(self.tsv_file, idx))
54 |             raise
55 |         self._fp.seek(pos)
56 |         return [s.strip() for s in self._fp.readline().split('\t')]
57 | 
58 |     def seek_first_column(self, idx):
59 |         self._ensure_tsv_opened()
60 |         self._ensure_lineidx_loaded()
61 |         pos = self._lineidx[idx]
62 |         self._fp.seek(pos)
63 |         return read_to_character(self._fp, '\t')
64 | 
65 |     def __getitem__(self, index):
66 |         return self.seek(index)
67 | 
68 |     def __len__(self):
69 |         return self.num_rows()
70 | 
71 |     def _ensure_lineidx_loaded(self):
72 |         if self._lineidx is None:
73 |             logging.info('loading lineidx: {}'.format(self.lineidx))
74 |             with open(self.lineidx, 'r') as fp:
75 |                 self._lineidx = [int(i.strip()) for i in fp.readlines()]
76 | 
77 |     def _ensure_tsv_opened(self):
78 |         if self._fp is None:
79 |             self._fp = open(self.tsv_file, 'r')
80 |             self.pid = os.getpid()
81 | 
82 |         if self.pid != os.getpid():
83 |             logging.info('re-open {} because the process id changed'.format(self.tsv_file))
84 |             self._fp = open(self.tsv_file, 'r')
85 |             self.pid = os.getpid()
86 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/oscar/utils/tsv_file_ops.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2020 Microsoft Corporation. Licensed under the MIT license.
 2 | 
 3 | import logging
 4 | import numpy as np
 5 | import os
 6 | import os.path as op
 7 | import shutil
 8 | from .misc import mkdir
 9 | from .tsv_file import TSVFile
10 | 
11 | 
12 | def tsv_writer(values, tsv_file_name, sep='\t'):
13 |     mkdir(os.path.dirname(tsv_file_name))
14 |     tsv_file_name_tmp = tsv_file_name + '.tmp'
15 |     with open(tsv_file_name_tmp, 'wb') as fp:
16 |         assert values is not None
17 |         for value in values:
18 |             assert value is not None
19 |             v = sep.join(map(lambda v: v.decode() if type(v) == bytes else str(v), value)) + '\n'
20 |             v = v.encode()
21 |             fp.write(v)
22 |     os.rename(tsv_file_name_tmp, tsv_file_name)
23 | 
24 | 
25 | def concat_files(ins, out):
26 |     out_tmp = out + '.tmp'
27 |     with open(out_tmp, 'wb') as fp_out:
28 |         for i, f in enumerate(ins):
29 |             with open(f, 'rb') as fp_in:
30 |                 shutil.copyfileobj(fp_in, fp_out, 1024*1024*10)
31 |     os.rename(out_tmp, out)
32 | 
33 | 
34 | def concat_tsv_files(tsvs, out_tsv, generate_lineidx=False):
35 |     concat_files(tsvs, out_tsv)
36 |     if generate_lineidx:
37 |         sizes = [os.stat(t).st_size for t in tsvs]
38 |         sizes = np.cumsum(sizes)
39 |         all_idx = []
40 |         for i, t in enumerate(tsvs):
41 |             for idx in load_list_file(op.splitext(t)[0] + '.lineidx'):
42 |                 if i == 0:
43 |                     all_idx.append(idx)
44 |                 else:
45 |                     all_idx.append(str(int(idx) + sizes[i - 1]))
46 |         with open(op.splitext(out_tsv)[0] + '.lineidx', 'w') as f:
47 |             f.write('\n'.join(all_idx))
48 | 
49 | 
50 | def load_list_file(fname):
51 |     with open(fname, 'r') as fp:
52 |         lines = fp.readlines()
53 |     result = [line.strip() for line in lines]
54 |     if len(result) > 0 and result[-1] == '':
55 |         result = result[:-1]
56 |     return result
57 | 
58 | 
59 | def reorder_tsv_keys(in_tsv_file, ordered_keys, out_tsv_file):
60 |     tsv = TSVFile(in_tsv_file, generate_lineidx=True)
61 |     keys = [tsv.seek(i)[0] for i in range(len(tsv))]
62 |     key_to_idx = {key: i for i, key in enumerate(keys)}
63 |     def gen_rows():
64 |         for key in ordered_keys:
65 |             idx = key_to_idx[key]
66 |             yield tsv.seek(idx)
67 |     tsv_writer(gen_rows(), out_tsv_file)
68 | 
69 | 
70 | def delete_tsv_files(tsvs):
71 |     for t in tsvs:
72 |         if op.isfile(t):
73 |             try_delete(t)
74 |         line = op.splitext(t)[0] + '.lineidx'
75 |         if op.isfile(line):
76 |             try_delete(line)
77 | 
78 | 
79 | def try_once(func):
80 |     def func_wrapper(*args, **kwargs):
81 |         try:
82 |             return func(*args, **kwargs)
83 |         except Exception as e:
84 |             logging.info('ignore error \n{}'.format(str(e)))
85 |     return func_wrapper
86 | 
87 | 
88 | @try_once
89 | def try_delete(f):
90 |     os.remove(f)
91 | 
92 | 
93 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/requirements.txt:
--------------------------------------------------------------------------------
1 | tqdm
2 | pyyaml
3 | matplotlib
4 | requests
5 | scikit-image
6 | anytree
7 | regex
8 | boto3
9 | 


--------------------------------------------------------------------------------
/VinVL/Oscar/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | from __future__ import print_function
 4 | import os
 5 | import sys
 6 | import re
 7 | import os.path as op
 8 | from setuptools import find_packages, setup
 9 | 
10 | # change directory to this module path
11 | try:
12 |     this_file = __file__
13 | except NameError:
14 |     this_file = sys.argv[0]
15 | this_file = os.path.abspath(this_file)
16 | if op.dirname(this_file):
17 |     os.chdir(op.dirname(this_file))
18 | script_dir = os.getcwd()
19 | 
20 | def readme(fname):
21 |     """Read text out of a file in the same directory as setup.py.
22 |     """
23 |     return open(op.join(script_dir, fname)).read()
24 | 
25 | 
26 | def find_version(fname):
27 |     version_file = readme(fname)
28 |     version_match = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]",
29 |                               version_file, re.M)
30 |     if version_match:
31 |         return version_match.group(1)
32 |     raise RuntimeError("Unable to find version string.")
33 | 
34 | 
35 | setup(
36 |     name="oscar",
37 |     version=find_version("oscar/__init__.py"),
38 |     url='https://github.com/xjli/Oscar',
39 |     description="Oscar for vision and language tasks",
40 |     long_description=readme('README.md'),
41 |     packages=find_packages(),
42 |     classifiers=[
43 |         'Intended Audience :: Developers',
44 |         "Programming Language :: Python",
45 |         'Topic :: Software Development',
46 |     ]
47 | )
48 | 


--------------------------------------------------------------------------------
/VinVL/README.md:
--------------------------------------------------------------------------------
  1 | # README
  2 | ___
  3 | This directory contains the files for training and evaluating vision-language 
  4 | pre-trained model, i.e., VinVL. Most of the files are copied from VinVL 
  5 | [github repo](https://github.com/microsoft/Oscar/tree/vinvl). We mainly modified or 
  6 | added the following files:
  7 | ```
  8 | |- Oscar/
  9 |    |- oscar/
 10 |       |- run_gqa_prompt_mlm.py
 11 |       |- run_gqa_prompt_itm.py
 12 |       |- run_gqa_prompt_zero_few.py
 13 |       |- run_vqa_prompt_mlm.py
 14 |       |- run_vqa_prompt_itm.py
 15 |       |- utils/
 16 |          |- task_utils.py
 17 | ```
 18 | 
 19 | ## INSTALL
 20 | 
 21 | + Refer to [INSTALL](https://github.com/microsoft/Oscar/tree/vinvl) for installation.
 22 | 
 23 | ## Data Preparation
 24 | 
 25 | ### GQA
 26 | Please follow the steps bellow to configure data:
 27 | 1. Refer to [DOWNLOAD](https://github.com/microsoft/Oscar/blob/vinvl/VinVL_DOWNLOAD.md) 
 28 |    to download the pre-processed GQA dataset.
 29 |    The downloaded data should contain following files:
 30 |    ```
 31 |    |- [DATA_ROOT]/gqa/
 32 |        |- gqa_bal_qla_train.json
 33 |        |- gqa_bal_qla_val.json
 34 |        |- gqa_all_qla_train.json
 35 |        |- gqa_all_qla_val.json
 36 |        |- gqa_all_qla_submission.json
 37 |        ...
 38 |    ```
 39 | 2. Download the corresponding declaration files and put them in the `gqa/` directory. 
 40 |    The declaration files are downloaded from [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/vinvl/gqa/*_declarative.json`). 
 41 |    These files contain declarative sentence per line, which can used for later data-loading
 42 |    and processing. Please put these `*_declarative.json` files into the `gqa/` directory, 
 43 |    resulting in following directory tree:
 44 |    ```text
 45 |    |- [DATA_ROOT]/gqa/
 46 |        |- gqa_bal_qla_train.json
 47 |        |- gqa_bal_qla_val.json
 48 |        |- gqa_all_qla_train.json
 49 |        |- gqa_all_qla_val.json
 50 |        |- gqa_all_qla_submission.json
 51 |        |- gqa_bal_qla_train_declarative.json  # newly added
 52 |        |- gqa_bal_qla_val_declarative.json  # newly added
 53 |        |- gqa_all_qla_train_declarative.json  # newly added
 54 |        |- gqa_all_qla_val_declarative.json  # newly added
 55 |        |- gqa_all_qla_submission_declarative.json  # newly added
 56 |        ...
 57 |    ```
 58 | 
 59 | ### VQA v2.0
 60 | 
 61 | Please follow the steps bellow to configure data:
 62 | 1. Refer to [DOWNLOAD](https://github.com/microsoft/Oscar/blob/vinvl/VinVL_DOWNLOAD.md) 
 63 |    to download the pre-processed VQA v2.0 dataset.
 64 |    The downloaded data should contain following files:
 65 |    ```
 66 |    |- [DATA_ROOT]/vqa/
 67 |        |- train2014_qla_mrcnn.json
 68 |        |- val2014_qla_mrcnn.json
 69 |        ...
 70 |    ```
 71 | 2. Download the corresponding declaration files and put them in the `vqa/` directory. 
 72 |    The declaration files are downloaded from [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/vinvl/vqa/*_declarative.json`). 
 73 |    Please put these `*_declarative.json` files into the `vqa/` directory,
 74 |    resulting in following directory tree:
 75 |    ```text
 76 |    |- [DATA_ROOT]/vqa/
 77 |        |- train2014_qla_mrcnn.json
 78 |        |- val2014_qla_mrcnn.json
 79 |        |- train2014_declarative.json  # newly added
 80 |        |- val2014_declarative.json  # newly added
 81 |        ...
 82 |    ```
 83 | 
 84 | ## Pre-trained Model
 85 | 
 86 | Please refer to [DOWNLOAD](https://github.com/microsoft/Oscar/blob/vinvl/VinVL_DOWNLOAD.md)
 87 | to download the pre-trained VinVL base model (`checkpoint-2000000`). We also provide the 
 88 | model checkpoint in [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/model/vinvl/checkpoint-2000000`). Assume that the 
 89 | `checkpoint-2000000` is placed in directory `[MODEL_ROOT]`, resulting in `[MODEL_ROOT]/checkpoint-2000000/`
 90 | 
 91 | ## Training and Validation
 92 | 
 93 | ### GQA
 94 | Please follow the steps bellow to reproduce the results (we take the balanced 
 95 | split for example). 
 96 | 
 97 | We first utilize the adapted **masked language model (MLM) task** for GQA fine-tuning:
 98 | 
 99 | 1. **Training(MLM)**: Run the following code to train VinVL-DPT(MLM) on the balanced split:
100 |    ```bash
101 |    python oscar/run_gqa_prompt_mlm.py \
102 |     -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
103 |     --data_dir [DATA_ROOT]/gqa/ \
104 |     --model_type bert \
105 |     --model_name_or_path [MODEL_ROOT]/checkpoint-2000000/ \
106 |     --task_name gqa --do_lower_case --max_seq_length 165 \
107 |     --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
108 |     --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \
109 |     --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
110 |     --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
111 |     --eval_data_type bal \
112 |     --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
113 |     --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \
114 |     --logging_steps 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 \
115 |     --gradient_accumulation_steps 2
116 |    ```
117 |    If successful, the _overall_ accuracy will reach up to ~62.7%. We also 
118 |    provide the fine-tuned model in [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/model/vinvl/vinvl_bal_mlm`).
119 | 2. **Validation(MLM)**: Evaluate the fine-tuned model on the GQA validation set using the fine-tuned model we 
120 |    provide (or the model in the output_dir `gqa_mlm`).
121 |    ```bash
122 |    python oscar/run_gqa_prompt_mlm.py \
123 |     -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
124 |     --data_dir [DATA_ROOT]/gqa/ \
125 |     --model_type bert \
126 |     --model_name_or_path data/model/vinvl/vinvl_bal_mlm \
127 |     --task_name gqa --do_lower_case --max_seq_length 165 \
128 |     --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
129 |     --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \
130 |     --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
131 |     --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
132 |     --eval_data_type bal \
133 |     --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
134 |     --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \
135 |     --logging_steps 4000 --drop_out 0.3 --do_val --weight_decay 0.05 --warmup_steps 0 \
136 |     --gradient_accumulation_steps 2
137 |    ```
138 |    Note that the `model_name_or_path` and `--do_val` arguments have been changed compared to 
139 |    training stage.
140 | 3. **Testing and Submission(MLM)**: Test the fine-tuned model and submit the result file
141 |    to the [online evaluation website](https://eval.ai/web/challenges/challenge-page/225/leaderboard/733):
142 |    Run the following code:
143 |    ```bash
144 |    python oscar/run_gqa_prompt_mlm.py \
145 |     -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
146 |     --data_dir [DATA_ROOT]/gqa/ \
147 |     --model_type bert \
148 |     --model_name_or_path data/model/vinvl/vinvl_bal_mlm \
149 |     --task_name gqa --do_lower_case --max_seq_length 165 \
150 |     --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
151 |     --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \
152 |     --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
153 |     --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
154 |     --eval_data_type bal \
155 |     --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
156 |     --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \
157 |     --logging_steps 4000 --drop_out 0.3 --do_test --weight_decay 0.05 --warmup_steps 0 \
158 |     --gradient_accumulation_steps 2
159 |    ```
160 |    Note that the `--do_test` argument has been changed compared to 
161 |    validation stage.
162 |    
163 | Then, we apply the adapated **image-text matching (ITM) task** to solve VQA problem. To 
164 | achieve this, we need to obtain the top-k candidate answer predicted by MLM tasks. Specifically,
165 | we pre-generate the prediction results of MLM task:
166 | + Pre-generate topk results for training and validation.
167 |   ```bash
168 |    python oscar/run_gqa_prompt_mlm.py \
169 |        -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
170 |        --data_dir [DATA_ROOT]/gqa/ \
171 |        --model_type bert \
172 |        --model_name_or_path data/model/vinvl/vinvl_bal_mlm/ \
173 |        --task_name gqa --do_lower_case --max_seq_length 165 \
174 |        --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
175 |        --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \
176 |        --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
177 |        --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
178 |        --eval_data_type bal \
179 |        --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
180 |        --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \
181 |        --logging_steps 4000 --drop_out 0.3 --do_train --do_generate --weight_decay 0.05 --warmup_steps 0 \
182 |        --gradient_accumulation_steps 2
183 |    ```
184 | + Pre-generate topk results for submission.
185 |    ```bash
186 |    python oscar/run_gqa_prompt_mlm.py \
187 |        -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
188 |        --data_dir [DATA_ROOT]/gqa/ \
189 |        --model_type bert \
190 |        --model_name_or_path data/model/vinvl/vinvl_bal_mlm/ \
191 |        --task_name gqa --do_lower_case --max_seq_length 165 \
192 |        --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
193 |        --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \
194 |        --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
195 |        --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
196 |        --eval_data_type bal \
197 |        --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
198 |        --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \
199 |        --logging_steps 4000 --drop_out 0.3 --do_test --do_generate --weight_decay 0.05 --warmup_steps 0 \
200 |        --gradient_accumulation_steps 2
201 |    ```
202 | 
203 | 
204 | Note that the `--do_generate` argument has been added. In this way, there will be two result files 
205 | saved in `model_name_or_path`, i.e., `stage1.pkl`, `stage1_eval.pkl`, and `stage1_submission.pkl`. The files have following 
206 | data format:
207 | ```text
208 | {
209 |     "[QID]": (np.ndarray([topk, ], np.int16),     # Topk answer indices
210 |               np.ndarray([topk, ], np.float16),), # Topk answer scores
211 |     ...
212 | }
213 | ```
214 | > We also provide the result files in the fine-tuned checkpoint `vinvl_bal_mlm`.
215 | 
216 | 4. **Training(ITM)**: Equipped with the pre-generated topk answers, we can apply ITM by running following
217 | code:
218 |    ```bash
219 |    python oscar/run_gqa_prompt_itm.py \
220 |        -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
221 |        --data_dir [DATA_ROOT]/gqa/ \
222 |        --model_type bert \
223 |        --model_name_or_path data/model/vinvl/vinvl_bal_mlm/ \
224 |        --task_name gqa --do_lower_case --max_seq_length 165 \
225 |        --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
226 |        --learning_rate 5e-05 --num_train_epochs 2 --output_dir gqa_itm \
227 |        --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
228 |        --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
229 |        --eval_data_type bal \
230 |        --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
231 |        --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \
232 |        --logging_steps 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 \
233 |        --gradient_accumulation_steps 2
234 |    ```
235 | Note that we need to load the checkpoint from MLM task. We also provide the checkpoint in 
236 | [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/model/vinvl/vinvl_bal_itm/`).
237 | 5. **Validation(ITM)**: Once the model is fine-tuned via ITM, we can validate the model 
238 | through following code:
239 |    ```bash
240 |    python oscar/run_gqa_prompt_itm.py \
241 |        -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
242 |        --data_dir [DATA_ROOT]/gqa/ \
243 |        --model_type bert \
244 |        --model_name_or_path data/model/vinvl/vinvl_bal_itm/ \
245 |        --task_name gqa --do_lower_case --max_seq_length 165 \
246 |        --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
247 |        --learning_rate 5e-05 --num_train_epochs 2 --output_dir gqa_itm \
248 |        --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
249 |        --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
250 |        --eval_data_type bal \
251 |        --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
252 |        --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \
253 |        --logging_steps 4000 --drop_out 0.3 --do_val --weight_decay 0.05 --warmup_steps 0 \
254 |        --gradient_accumulation_steps 2
255 |    ```
256 |    Note that the pre-generate result files, i.e., `stage1.pkl`, `stage1_eval.pkl`, and `stage1_submission.pkl`
257 |    should be copied to `data/model/vinvl/vinvl_bal_itm/` so that the code has the access to the 
258 |    MLM results.
259 | 6. **Testing and Submission(ITM)**: (Please make sure that the `stage1_submission.pkl` has been
260 |    pre-generated or downloaded, and placed in the `model_name_or_path`.) Run the following code to run testing:
261 |    ```bash
262 |    python oscar/run_gqa_prompt_itm.py \
263 |        -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
264 |        --data_dir [DATA_ROOT]/gqa/ \
265 |        --model_type bert \
266 |        --model_name_or_path data/model/vinvl/vinvl_bal_itm/ \
267 |        --task_name gqa --do_lower_case --max_seq_length 165 \
268 |        --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
269 |        --learning_rate 5e-05 --num_train_epochs 2 --output_dir gqa_itm \
270 |        --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
271 |        --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
272 |        --eval_data_type bal \
273 |        --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
274 |        --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \
275 |        --logging_steps 4000 --drop_out 0.3 --do_test --weight_decay 0.05 --warmup_steps 0 \
276 |        --gradient_accumulation_steps 2
277 |    ```
278 | 
279 | ### VQA v2.0
280 | 
281 | Please follow the steps bellow to reproduce the results on VQA v2.0:
282 | 
283 | We first utilize the **masked language model (MLM) task** to fine-tune the model:
284 | 1. **Training(MLM)**: Run the following code to train VinVL-DPT(MLM):
285 |    ```bash
286 |    python oscar/run_vqa_prompt_mlm.py -j 4 \
287 |        --img_feature_dim 2054 --max_img_seq_length 50 \
288 |        --data_label_type mask --img_feature_type faster_r-cnn \
289 |        --data_dir [DATA_ROOT]/vqa --model_type bert \
290 |        --model_name_or_path [MODEL_ROOT]/checkpoint-2000000 \
291 |        --task_name vqa_text --do_train --do_lower_case --max_seq_length 158 \
292 |        --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
293 |        --learning_rate 5e-05 --num_train_epochs 25 \
294 |        --output_dir vqa_mlm --label_file [DATA_ROOT]/vqatrainval_ans2label.pkl \
295 |        --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 \
296 |        --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce \
297 |        --img_feat_format pt --classifier linear --cls_hidden_scale 3 \
298 |        --txt_data_dir [DATA_ROOT]/vqa
299 |    ```
300 |    We also provide the checkpoint in [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/model/vinvl/vqa_mlm/`).
301 |    Then, we pre-generate the topk results of MLM task via following code:
302 |    ```bash
303 |    python oscar/run_vqa_prompt_mlm.py -j 4 \
304 |        --img_feature_dim 2054 --max_img_seq_length 50 \
305 |        --data_label_type mask --img_feature_type faster_r-cnn \
306 |        --data_dir [DATA_ROOT]/vqa --model_type bert \
307 |        --model_name_or_path data/model/vinvl/vqa_mlm/ \
308 |        --task_name vqa_text --do_train --do_generate --do_lower_case --max_seq_length 158 \
309 |        --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
310 |        --learning_rate 5e-05 --num_train_epochs 25 \
311 |        --output_dir vqa_mlm --label_file [DATA_ROOT]/vqatrainval_ans2label.pkl \
312 |        --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 \
313 |        --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce \
314 |        --img_feat_format pt --classifier linear --cls_hidden_scale 3 \
315 |        --txt_data_dir [DATA_ROOT]/vqa
316 |    ```
317 |    Note that `model_name_or_path` and `do_generate` arguments have been changed. In this way, 
318 |    two result files are generated and saved in `model_name_or_path`, i.e., `stage1.pkl` and 
319 |    `stage1_eval.pkl`.
320 | 2. **Training(ITM)**: Run the following code to train image-text matching (ITM) task for VQA:
321 |    ```bash
322 |    python oscar/run_vqa_prompt_itm.py -j 4 \
323 |        --img_feature_dim 2054 --max_img_seq_length 50 \
324 |        --data_label_type mask --img_feature_type faster_r-cnn \
325 |        --data_dir [DATA_ROOT]/vqa --model_type bert \
326 |        --model_name_or_path data/model/vinvl/vqa_mlm/ \
327 |        --task_name vqa_text --do_train --do_lower_case --max_seq_length 158 \
328 |        --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
329 |        --learning_rate 5e-05 --num_train_epochs 6 \
330 |        --output_dir vqa_itm --label_file [DATA_ROOT]/vqatrainval_ans2label.pkl \
331 |        --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 \
332 |        --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce \
333 |        --img_feat_format pt --classifier linear --cls_hidden_scale 3 \
334 |        --txt_data_dir [DATA_ROOT]/vqa
335 |    ```
336 |    We also provide the fine-tuned checkpoint in [Baidu Yun (PSW:8888)](https://pan.baidu.com/s/1BnMODk2q92KQ0FPTz33zkA) (`data/model/vinvl/vqa_itm/`).
337 | 
338 | ## Zero-shot and Few-shot Learning
339 | 
340 | In zero-shot and few-shot settings, zero or only a few samples (1~128) are used to fine-tune the 
341 | model. Run the following code to split `[K]`-shot training set for fine-tuning, and evaluate on the 
342 | whole validation set.
343 | ```bash
344 | python oscar/run_gqa_prompt_zero_few.py \
345 |    -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
346 |    --data_dir [DATA_ROOT]/gqa/ \
347 |    --model_type bert \
348 |    --model_name_or_path [MODEL_ROOT]/checkpoint-2000000/ \
349 |    --task_name gqa --do_lower_case --max_seq_length 165 \
350 |    --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 1 \
351 |    --learning_rate 5e-05 --num_train_epochs 25 --output_dir gqa_subset \
352 |    --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
353 |    --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
354 |    --eval_data_type bal \
355 |    --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
356 |    --loss_type xe --save_epoch 10 --seed 88 --evaluate_during_training \
357 |    --logging_steps 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 \
358 |    --gradient_accumulation_steps 1 \
359 |    --num_examples [K] --subset_seed 0
360 | ```
361 | 
362 | ## Online Results
363 | 
364 | + VinVL Baseline trained on balanced split: 
365 |   + [testdev](https://evalai.s3.amazonaws.com/media/submission_files/submission_176247/bd8113ca-6e55-4559-a508-9ecedbd4a49c.json)
366 |   + [teststd](https://evalai.s3.amazonaws.com/media/submission_files/submission_176248/55c074d6-f381-4458-8413-69333596d8f7.json)
367 | + VinVL-DPT trained on balanced split:
368 |   + [testdev](https://evalai.s3.amazonaws.com/media/submission_files/submission_176397/6150d2b9-09cf-4ac9-aecc-ae859be3fc79.json)
369 |   + [teststd](https://evalai.s3.amazonaws.com/media/submission_files/submission_176400/e6f063d1-0df9-4c82-9b4c-c1ed130857bc.json)
370 | + VinVL-DPT trained on all split:
371 |   + [testdev](https://evalai.s3.amazonaws.com/media/submission_files/submission_176134/ec622e48-6b78-47a4-a939-13c466d0622c.json)
372 |   + [teststd](https://evalai.s3.amazonaws.com/media/submission_files/submission_176088/2eee503a-a756-4752-b48e-758a85ebd35b.json)


--------------------------------------------------------------------------------