├── .gitignore
├── LICENSE.txt
├── README.md
├── data
    ├── download_edgebox_proposals.sh
    ├── metadata
    │   └── .keep
    ├── split
    │   ├── referit_all_imlist.txt
    │   ├── referit_test_imlist.txt
    │   ├── referit_train_imlist.txt
    │   ├── referit_trainval_imlist.txt
    │   └── referit_val_imlist.txt
    ├── training
    │   └── .keep
    └── vocabulary.txt
├── datasets
    ├── ReferIt
    │   ├── ImageCLEF
    │   │   └── .keep
    │   └── ReferitData
    │   │   └── .keep
    ├── download_kitchen_dataset.sh
    └── download_referit_dataset.sh
├── demo
    ├── demo_data
    │   ├── 40429.jpg
    │   └── 40429.txt
    └── retrieval_demo.ipynb
├── exp-kitchen
    ├── cache_kitchen_training_batches.py
    ├── caffemodel
    │   └── .keep
    ├── test_scrc_on_kitchen.py
    └── train_scrc_kitchen.sh
├── exp-referit
    ├── cache_referit_context_features.py
    ├── cache_referit_training_batches.py
    ├── caffemodel
    │   └── .keep
    ├── initialize_weights_scrc_full.py
    ├── initialize_weights_scrc_no_context.py
    ├── preprocess_dataset.py
    ├── test_scrc_on_referit.py
    ├── train_scrc_full_on_referit.sh
    └── train_scrc_no_context_on_referit.sh
├── external
    └── download_caffe.sh
├── models
    └── download_trained_models.sh
├── prototxt
    ├── VGG_ILSVRC_16_layers_deploy.prototxt
    ├── coco_pretrained.prototxt
    ├── scrc_full_vgg_buffer_50.prototxt
    ├── scrc_full_vgg_solver.prototxt
    ├── scrc_kitchen_buffer_50.prototxt
    ├── scrc_kitchen_solver.prototxt
    ├── scrc_no_context_vgg_buffer_50.prototxt
    ├── scrc_no_context_vgg_solver.prototxt
    ├── scrc_word_to_preds_full.prototxt
    ├── scrc_word_to_preds_no_context.prototxt
    └── scrc_word_to_preds_no_spatial_no_context.prototxt
├── retriever.py
└── util
    ├── __init__.py
    └── io.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.mat
 2 | *.h5
 3 | *.pyc
 4 | *-checkpoint.ipynb
 5 | *~
 6 | *.swp
 7 | 
 8 | *.zip
 9 | *.tar.gz
10 | 
11 | datasets/ReferIt/*
12 | datasets/Kitchen/*
13 | 
14 | data/*
15 | external/*
16 | 
17 | *.caffemodel
18 | *.solverstate
19 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | UC Berkeley's Standard Copyright and Disclaimer Notice:
 2 | 
 3 | Copyright (c) 2016. The Regents of the University of California (Regents). All
 4 | Rights Reserved. Permission to use, copy, modify, and distribute this software
 5 | and its documentation for educational, research, and not-for-profit purposes,
 6 | without fee and without a signed licensing agreement, is hereby granted,
 7 | provided that the above copyright notice, this paragraph and the following
 8 | two paragraphs appear in all copies, modifications, and distributions.
 9 | Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue,
10 | Suite 510, Berkeley, CA 94720-1620, (510) 643-7201, for commercial licensing
11 | opportunities.
12 | 
13 | Ronghang Hu, University of California, Berkeley.
14 | 
15 | IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL,
16 | INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF
17 | THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN
18 | ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
19 | 
20 | REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
21 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
22 | THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS
23 | PROVIDED "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT,
24 | UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
25 | 
26 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Natural Language Object Retrieval
 2 | This repository contains the code for the following paper:
 3 | 
 4 | * R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, T. Darrell, *Natural Language Object Retrieval*, in Computer Vision and Pattern Recognition (CVPR), 2016 ([PDF](http://arxiv.org/pdf/1511.04164))
 5 | ```
 6 | @article{hu2016natural,
 7 |   title={Natural Language Object Retrieval},
 8 |   author={Hu, Ronghang and Xu, Huazhe and Rohrbach, Marcus and Feng, Jiashi and Saenko, Kate and Darrell, Trevor},
 9 |   journal={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
10 |   year={2016}
11 | }
12 | ```
13 | 
14 | Project Page: http://ronghanghu.com/text_obj_retrieval  
15 | 
16 | ## Installation
17 | 1. Download this repository or clone with Git, and then `cd` into the root directory of the repository.
18 | 2. Run `./external/download_caffe.sh` to download the SCRC Caffe version for this experiment. It will be downloaded and unzipped into `external/caffe-natural-language-object-retrieval`. This version is modified from the [Caffe LRCN implementation](http://jeffdonahue.com/lrcn/).
19 | 3. Build the SCRC Caffe version in `external/caffe-natural-language-object-retrieval`, following the [Caffe installation instruction](http://caffe.berkeleyvision.org/installation.html). **Remember to also build pycaffe.**
20 | 
21 | ## SCRC demo
22 | 1. Download the pretrained models with `./models/download_trained_models.sh`.  
23 | 2. Run the SCRC demo in `./demo/retrieval_demo.ipynb` with [Jupyter Notebook (IPython Notebook)](http://ipython.org/notebook.html).
24 | 
25 | ![Image](http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/scrc_demo.jpg)
26 | 
27 | ## Train and evaluate SCRC model on ReferIt Dataset
28 | 1. Download the ReferIt dataset: `./datasets/download_referit_dataset.sh`.
29 | 2. Download pre-extracted EdgeBox proposals: `./data/download_edgebox_proposals.sh`.
30 | 3. You may need to add the SRCR root directory to Python's module path: `export PYTHONPATH=.:$PYTHONPATH`.
31 | 4. Preprocess the ReferIt dataset to generate metadata needed for training and evaluation: `python ./exp-referit/preprocess_dataset.py`.
32 | 5. Cache the scene-level contextual features to disk: `python ./exp-referit/cache_referit_context_features.py`.
33 | 6. Build training image lists and HDF5 batches: `python ./exp-referit/cache_referit_training_batches.py`.
34 | 7. Initialize the model parameters and train with SGD: `python ./exp-referit/initialize_weights_scrc_full.py && ./exp-referit/train_scrc_full_on_referit.sh`.
35 | 8. Evaluate the trained model: `python ./exp-referit/test_scrc_on_referit.py`.
36 | 
37 | Optionally, you may also train a SCRC version without contextual feature, using `python ./exp-referit/initialize_weights_scrc_no_context.py && ./exp-referit/train_scrc_no_context_on_referit.sh`.
38 | 
39 | ## Train and evaluate SCRC model on Kitchen Dataset
40 | 1. Download the Kitchen dataset: `./datasets/download_kitchen_dataset.sh`.
41 | 2. You may need to add the SRCR root directory to Python's module path: `export PYTHONPATH=.:$PYTHONPATH`.
42 | 3. Build training image lists and HDF5 batches: `python exp-kitchen/cache_kitchen_training_batches.py`.
43 | 4. Train with SGD: `./exp-kitchen/train_scrc_kitchen.sh`.
44 | 5. Evaluate the trained model: `python exp-kitchen/test_scrc_on_kitchen.py`.
45 | 


--------------------------------------------------------------------------------
/data/download_edgebox_proposals.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | wget -O ./data/referit_edgeboxes_top100.zip http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/referit_edgeboxes_top100.zip
3 | unzip ./data/referit_edgeboxes_top100.zip -d ./data/
4 | 


--------------------------------------------------------------------------------
/data/metadata/.keep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/data/metadata/.keep


--------------------------------------------------------------------------------
/data/split/referit_val_imlist.txt:
--------------------------------------------------------------------------------
   1 | 26953
   2 | 9849
   3 | 27582
   4 | 7599
   5 | 10917
   6 | 12784
   7 | 6834
   8 | 15896
   9 | 10784
  10 | 11446
  11 | 8010
  12 | 8900
  13 | 18817
  14 | 1319
  15 | 10398
  16 | 15594
  17 | 14964
  18 | 37838
  19 | 8588
  20 | 32036
  21 | 22291
  22 | 51
  23 | 20277
  24 | 2983
  25 | 39011
  26 | 9962
  27 | 7336
  28 | 16305
  29 | 37903
  30 | 13424
  31 | 19975
  32 | 22642
  33 | 37667
  34 | 37167
  35 | 19456
  36 | 15833
  37 | 32716
  38 | 9152
  39 | 10334
  40 | 11837
  41 | 39153
  42 | 2065
  43 | 13066
  44 | 31689
  45 | 30188
  46 | 31155
  47 | 37681
  48 | 25207
  49 | 13295
  50 | 30235
  51 | 7369
  52 | 12012
  53 | 32356
  54 | 31636
  55 | 1957
  56 | 39261
  57 | 4117
  58 | 31085
  59 | 20953
  60 | 35682
  61 | 7544
  62 | 8169
  63 | 30938
  64 | 7136
  65 | 26920
  66 | 3086
  67 | 13854
  68 | 8704
  69 | 30883
  70 | 35578
  71 | 6764
  72 | 22860
  73 | 30192
  74 | 17903
  75 | 38048
  76 | 8834
  77 | 37313
  78 | 6956
  79 | 4993
  80 | 18376
  81 | 844
  82 | 10111
  83 | 799
  84 | 9636
  85 | 37220
  86 | 37675
  87 | 807
  88 | 19128
  89 | 15889
  90 | 39980
  91 | 31974
  92 | 9370
  93 | 31788
  94 | 5003
  95 | 38971
  96 | 12601
  97 | 18080
  98 | 4828
  99 | 35615
 100 | 18329
 101 | 7740
 102 | 37147
 103 | 5011
 104 | 10285
 105 | 40402
 106 | 24027
 107 | 39468
 108 | 8209
 109 | 30314
 110 | 32284
 111 | 14784
 112 | 929
 113 | 3593
 114 | 25239
 115 | 3908
 116 | 2948
 117 | 16330
 118 | 37776
 119 | 14840
 120 | 23339
 121 | 22801
 122 | 30378
 123 | 7929
 124 | 8316
 125 | 11063
 126 | 40553
 127 | 31068
 128 | 18113
 129 | 18494
 130 | 37853
 131 | 32540
 132 | 37234
 133 | 10603
 134 | 8384
 135 | 1842
 136 | 6323
 137 | 22994
 138 | 30521
 139 | 19114
 140 | 30968
 141 | 19642
 142 | 8208
 143 | 35863
 144 | 27662
 145 | 30504
 146 | 15700
 147 | 27689
 148 | 38704
 149 | 32583
 150 | 2993
 151 | 25377
 152 | 30073
 153 | 6283
 154 | 9413
 155 | 30234
 156 | 32206
 157 | 39139
 158 | 21145
 159 | 38926
 160 | 35943
 161 | 21328
 162 | 14116
 163 | 2658
 164 | 17537
 165 | 30704
 166 | 37715
 167 | 3269
 168 | 4778
 169 | 26020
 170 | 19813
 171 | 33478
 172 | 700
 173 | 33374
 174 | 31806
 175 | 12896
 176 | 9684
 177 | 31262
 178 | 798
 179 | 22873
 180 | 4313
 181 | 2389
 182 | 1142
 183 | 30367
 184 | 22158
 185 | 7331
 186 | 13857
 187 | 4168
 188 | 58
 189 | 30616
 190 | 31607
 191 | 18883
 192 | 8596
 193 | 31019
 194 | 32114
 195 | 16152
 196 | 16342
 197 | 20902
 198 | 5104
 199 | 37284
 200 | 12778
 201 | 7055
 202 | 10010
 203 | 9588
 204 | 13516
 205 | 38217
 206 | 16674
 207 | 15403
 208 | 32577
 209 | 40585
 210 | 31944
 211 | 22644
 212 | 10974
 213 | 38112
 214 | 23014
 215 | 39051
 216 | 37218
 217 | 6869
 218 | 15779
 219 | 1862
 220 | 16886
 221 | 30565
 222 | 16812
 223 | 39753
 224 | 37286
 225 | 40451
 226 | 9015
 227 | 10233
 228 | 11117
 229 | 30231
 230 | 20761
 231 | 37844
 232 | 19565
 233 | 13637
 234 | 32720
 235 | 40101
 236 | 19696
 237 | 18460
 238 | 11345
 239 | 12516
 240 | 37232
 241 | 23730
 242 | 19408
 243 | 8440
 244 | 7008
 245 | 3015
 246 | 17973
 247 | 33420
 248 | 4952
 249 | 2804
 250 | 20163
 251 | 11151
 252 | 27618
 253 | 19698
 254 | 8741
 255 | 12350
 256 | 26025
 257 | 31572
 258 | 32620
 259 | 31147
 260 | 40261
 261 | 18677
 262 | 9749
 263 | 4662
 264 | 7527
 265 | 35841
 266 | 39099
 267 | 3837
 268 | 10158
 269 | 6545
 270 | 15672
 271 | 39740
 272 | 8458
 273 | 19442
 274 | 19779
 275 | 6876
 276 | 40638
 277 | 7161
 278 | 37550
 279 | 17582
 280 | 10890
 281 | 12925
 282 | 21776
 283 | 3277
 284 | 7654
 285 | 39675
 286 | 40428
 287 | 9984
 288 | 30537
 289 | 1149
 290 | 32704
 291 | 10119
 292 | 11709
 293 | 40054
 294 | 8407
 295 | 8845
 296 | 10035
 297 | 7066
 298 | 16355
 299 | 32389
 300 | 811
 301 | 35801
 302 | 39669
 303 | 21152
 304 | 7721
 305 | 31508
 306 | 37128
 307 | 32253
 308 | 13017
 309 | 19553
 310 | 14977
 311 | 39074
 312 | 31409
 313 | 7535
 314 | 17047
 315 | 59
 316 | 10299
 317 | 8991
 318 | 30644
 319 | 14757
 320 | 11391
 321 | 21319
 322 | 32098
 323 | 10631
 324 | 39949
 325 | 32073
 326 | 9483
 327 | 38690
 328 | 17501
 329 | 17776
 330 | 1394
 331 | 16222
 332 | 15966
 333 | 35885
 334 | 2581
 335 | 15316
 336 | 6406
 337 | 6661
 338 | 12651
 339 | 37868
 340 | 38011
 341 | 38185
 342 | 30779
 343 | 9490
 344 | 30952
 345 | 32080
 346 | 10506
 347 | 40399
 348 | 37441
 349 | 10157
 350 | 10896
 351 | 2386
 352 | 37150
 353 | 32329
 354 | 6981
 355 | 16098
 356 | 20203
 357 | 39909
 358 | 38841
 359 | 18941
 360 | 25191
 361 | 30224
 362 | 39239
 363 | 1003
 364 | 40165
 365 | 22447
 366 | 22911
 367 | 30366
 368 | 8721
 369 | 24277
 370 | 14704
 371 | 2889
 372 | 31498
 373 | 8166
 374 | 12495
 375 | 17026
 376 | 5159
 377 | 34160
 378 | 17725
 379 | 19630
 380 | 31506
 381 | 18753
 382 | 14905
 383 | 20983
 384 | 40461
 385 | 32489
 386 | 25607
 387 | 39971
 388 | 14551
 389 | 2531
 390 | 7039
 391 | 4965
 392 | 22711
 393 | 9251
 394 | 10705
 395 | 12741
 396 | 27460
 397 | 40372
 398 | 3362
 399 | 32203
 400 | 11493
 401 | 40563
 402 | 20595
 403 | 19295
 404 | 10559
 405 | 22605
 406 | 10719
 407 | 31296
 408 | 20864
 409 | 970
 410 | 21956
 411 | 30374
 412 | 13416
 413 | 30764
 414 | 8639
 415 | 12388
 416 | 5196
 417 | 16391
 418 | 11401
 419 | 11714
 420 | 38047
 421 | 36042
 422 | 12764
 423 | 8075
 424 | 7188
 425 | 11424
 426 | 15084
 427 | 18895
 428 | 37538
 429 | 7159
 430 | 9969
 431 | 32824
 432 | 3151
 433 | 21606
 434 | 6872
 435 | 30259
 436 | 9528
 437 | 38115
 438 | 40640
 439 | 7581
 440 | 7052
 441 | 31381
 442 | 30552
 443 | 31005
 444 | 13552
 445 | 4642
 446 | 9343
 447 | 22189
 448 | 5160
 449 | 17146
 450 | 32813
 451 | 11582
 452 | 32286
 453 | 2145
 454 | 9477
 455 | 40300
 456 | 20769
 457 | 13186
 458 | 2541
 459 | 22182
 460 | 31769
 461 | 7140
 462 | 31421
 463 | 2485
 464 | 27686
 465 | 25236
 466 | 9498
 467 | 40156
 468 | 11782
 469 | 8566
 470 | 12815
 471 | 31412
 472 | 8978
 473 | 32108
 474 | 38888
 475 | 27679
 476 | 35662
 477 | 3071
 478 | 3066
 479 | 10835
 480 | 31289
 481 | 21941
 482 | 1764
 483 | 31964
 484 | 39216
 485 | 24982
 486 | 14049
 487 | 10421
 488 | 9647
 489 | 13431
 490 | 22997
 491 | 32420
 492 | 32045
 493 | 19772
 494 | 11399
 495 | 15353
 496 | 6598
 497 | 23940
 498 | 23177
 499 | 18167
 500 | 19112
 501 | 21855
 502 | 6281
 503 | 894
 504 | 4805
 505 | 797
 506 | 39630
 507 | 21180
 508 | 6802
 509 | 40327
 510 | 18043
 511 | 15429
 512 | 1720
 513 | 9027
 514 | 3103
 515 | 11074
 516 | 20664
 517 | 39607
 518 | 3248
 519 | 24914
 520 | 8359
 521 | 24490
 522 | 20481
 523 | 35939
 524 | 13495
 525 | 2794
 526 | 37621
 527 | 31948
 528 | 19039
 529 | 8303
 530 | 14472
 531 | 1234
 532 | 25968
 533 | 17850
 534 | 39764
 535 | 9739
 536 | 15541
 537 | 6527
 538 | 13123
 539 | 20259
 540 | 7736
 541 | 31063
 542 | 37723
 543 | 31451
 544 | 31545
 545 | 18287
 546 | 31276
 547 | 8686
 548 | 20141
 549 | 37971
 550 | 10220
 551 | 13451
 552 | 17393
 553 | 38719
 554 | 10550
 555 | 21607
 556 | 21219
 557 | 40121
 558 | 27384
 559 | 12151
 560 | 11685
 561 | 15157
 562 | 12103
 563 | 10843
 564 | 32498
 565 | 30238
 566 | 39976
 567 | 33364
 568 | 26725
 569 | 7986
 570 | 9614
 571 | 35737
 572 | 10107
 573 | 31354
 574 | 23463
 575 | 26277
 576 | 17723
 577 | 30228
 578 | 8344
 579 | 922
 580 | 26141
 581 | 2275
 582 | 32461
 583 | 10444
 584 | 13257
 585 | 32722
 586 | 10004
 587 | 22133
 588 | 40362
 589 | 14475
 590 | 39746
 591 | 11950
 592 | 3811
 593 | 26325
 594 | 21637
 595 | 11218
 596 | 10919
 597 | 35964
 598 | 19650
 599 | 32218
 600 | 35665
 601 | 38958
 602 | 8121
 603 | 27006
 604 | 9354
 605 | 38929
 606 | 8680
 607 | 18869
 608 | 30689
 609 | 8217
 610 | 40292
 611 | 31396
 612 | 7353
 613 | 2119
 614 | 9186
 615 | 11112
 616 | 865
 617 | 18934
 618 | 10756
 619 | 13328
 620 | 1030
 621 | 3339
 622 | 39027
 623 | 39096
 624 | 39142
 625 | 685
 626 | 1354
 627 | 20024
 628 | 3984
 629 | 12519
 630 | 18574
 631 | 20696
 632 | 22586
 633 | 9218
 634 | 30740
 635 | 24258
 636 | 23166
 637 | 8514
 638 | 15695
 639 | 31090
 640 | 14579
 641 | 7007
 642 | 8464
 643 | 37369
 644 | 39368
 645 | 25781
 646 | 19748
 647 | 10689
 648 | 39706
 649 | 2232
 650 | 10911
 651 | 37085
 652 | 22570
 653 | 9159
 654 | 7738
 655 | 40060
 656 | 8092
 657 | 26428
 658 | 5075
 659 | 3992
 660 | 19403
 661 | 40379
 662 | 22334
 663 | 19174
 664 | 11684
 665 | 11334
 666 | 26972
 667 | 38134
 668 | 17761
 669 | 37850
 670 | 10374
 671 | 4953
 672 | 37980
 673 | 6461
 674 | 22093
 675 | 37917
 676 | 4212
 677 | 7678
 678 | 9121
 679 | 12169
 680 | 2100
 681 | 14474
 682 | 32424
 683 | 3659
 684 | 14036
 685 | 13255
 686 | 9850
 687 | 7253
 688 | 4265
 689 | 6593
 690 | 30853
 691 | 24670
 692 | 37390
 693 | 27604
 694 | 19334
 695 | 23438
 696 | 1159
 697 | 37149
 698 | 37629
 699 | 30147
 700 | 22234
 701 | 31685
 702 | 27296
 703 | 38927
 704 | 3786
 705 | 23123
 706 | 15300
 707 | 25504
 708 | 22762
 709 | 27542
 710 | 17997
 711 | 8685
 712 | 20102
 713 | 19757
 714 | 30871
 715 | 10425
 716 | 20799
 717 | 30777
 718 | 18208
 719 | 30700
 720 | 22936
 721 | 14883
 722 | 2701
 723 | 19744
 724 | 8191
 725 | 19001
 726 | 19664
 727 | 19239
 728 | 10217
 729 | 15307
 730 | 14978
 731 | 7940
 732 | 37070
 733 | 10127
 734 | 14590
 735 | 30805
 736 | 39726
 737 | 30985
 738 | 1844
 739 | 21628
 740 | 27238
 741 | 8944
 742 | 8002
 743 | 38918
 744 | 31024
 745 | 27506
 746 | 31366
 747 | 2979
 748 | 24320
 749 | 31916
 750 | 4736
 751 | 15870
 752 | 6553
 753 | 7074
 754 | 37056
 755 | 1348
 756 | 8463
 757 | 4722
 758 | 19461
 759 | 943
 760 | 11014
 761 | 6542
 762 | 30882
 763 | 31860
 764 | 40557
 765 | 39893
 766 | 17646
 767 | 19381
 768 | 11602
 769 | 24979
 770 | 7856
 771 | 9702
 772 | 37303
 773 | 32188
 774 | 9902
 775 | 19303
 776 | 22463
 777 | 1375
 778 | 19071
 779 | 10839
 780 | 8893
 781 | 35945
 782 | 9475
 783 | 1835
 784 | 19395
 785 | 39422
 786 | 31311
 787 | 8001
 788 | 27044
 789 | 6299
 790 | 6255
 791 | 14112
 792 | 16839
 793 | 24492
 794 | 37946
 795 | 14171
 796 | 30951
 797 | 39623
 798 | 13311
 799 | 32493
 800 | 22682
 801 | 30852
 802 | 26281
 803 | 35788
 804 | 12405
 805 | 2308
 806 | 40403
 807 | 18269
 808 | 8822
 809 | 3249
 810 | 2628
 811 | 6474
 812 | 15404
 813 | 10054
 814 | 13026
 815 | 27482
 816 | 16372
 817 | 35935
 818 | 4858
 819 | 14441
 820 | 26337
 821 | 21311
 822 | 31113
 823 | 11046
 824 | 40660
 825 | 7481
 826 | 37896
 827 | 7123
 828 | 7037
 829 | 37536
 830 | 10716
 831 | 10324
 832 | 37458
 833 | 37320
 834 | 20507
 835 | 3345
 836 | 31468
 837 | 31475
 838 | 12244
 839 | 10929
 840 | 17279
 841 | 39145
 842 | 23250
 843 | 12986
 844 | 23860
 845 | 32784
 846 | 16046
 847 | 4872
 848 | 2441
 849 | 30309
 850 | 8742
 851 | 16078
 852 | 12919
 853 | 19409
 854 | 6322
 855 | 24677
 856 | 22908
 857 | 1343
 858 | 16759
 859 | 21823
 860 | 3618
 861 | 22645
 862 | 2165
 863 | 27633
 864 | 40267
 865 | 17758
 866 | 11076
 867 | 14528
 868 | 32891
 869 | 9606
 870 | 37524
 871 | 11575
 872 | 7674
 873 | 25227
 874 | 21608
 875 | 8160
 876 | 31867
 877 | 21833
 878 | 26497
 879 | 16314
 880 | 13881
 881 | 19952
 882 | 30967
 883 | 37640
 884 | 11145
 885 | 30773
 886 | 8604
 887 | 37172
 888 | 7346
 889 | 22861
 890 | 11071
 891 | 15564
 892 | 33426
 893 | 37717
 894 | 20757
 895 | 7456
 896 | 15679
 897 | 16822
 898 | 17035
 899 | 35579
 900 | 17262
 901 | 32445
 902 | 3856
 903 | 3585
 904 | 2805
 905 | 32805
 906 | 15441
 907 | 19756
 908 | 31188
 909 | 10353
 910 | 18843
 911 | 30424
 912 | 7989
 913 | 22546
 914 | 4005
 915 | 34135
 916 | 8433
 917 | 37806
 918 | 13715
 919 | 30729
 920 | 10009
 921 | 18135
 922 | 31470
 923 | 38098
 924 | 3081
 925 | 30774
 926 | 7082
 927 | 30187
 928 | 5123
 929 | 26336
 930 | 2508
 931 | 934
 932 | 8775
 933 | 30866
 934 | 17963
 935 | 3907
 936 | 2194
 937 | 22453
 938 | 4006
 939 | 10720
 940 | 31436
 941 | 23015
 942 | 7569
 943 | 30353
 944 | 21648
 945 | 2934
 946 | 33525
 947 | 13727
 948 | 31758
 949 | 15342
 950 | 8054
 951 | 30793
 952 | 32264
 953 | 19354
 954 | 18036
 955 | 2588
 956 | 3862
 957 | 32453
 958 | 39415
 959 | 3155
 960 | 8591
 961 | 18145
 962 | 9237
 963 | 16762
 964 | 10619
 965 | 37845
 966 | 19564
 967 | 18492
 968 | 31656
 969 | 37573
 970 | 19736
 971 | 9516
 972 | 13511
 973 | 18944
 974 | 14557
 975 | 26805
 976 | 15379
 977 | 4964
 978 | 17187
 979 | 17607
 980 | 7330
 981 | 13856
 982 | 6962
 983 | 15452
 984 | 23186
 985 | 39494
 986 | 8272
 987 | 30811
 988 | 19217
 989 | 10620
 990 | 27619
 991 | 13261
 992 | 35643
 993 | 9074
 994 | 14485
 995 | 5064
 996 | 16854
 997 | 18730
 998 | 38034
 999 | 25552
1000 | 25628
1001 | 


--------------------------------------------------------------------------------
/data/training/.keep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/data/training/.keep


--------------------------------------------------------------------------------
/datasets/ReferIt/ImageCLEF/.keep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/datasets/ReferIt/ImageCLEF/.keep


--------------------------------------------------------------------------------
/datasets/ReferIt/ReferitData/.keep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/datasets/ReferIt/ReferitData/.keep


--------------------------------------------------------------------------------
/datasets/download_kitchen_dataset.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | wget -O ./datasets/Kitchen.tar.gz http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/Kitchen.tar.gz
3 | tar -xzvf ./datasets/Kitchen.tar.gz -C ./datasets/
4 | cp ./datasets/Kitchen/split/*.txt ./data/split/
5 | cp ./datasets/Kitchen/annotation/*.json ./data/metadata/
6 | 


--------------------------------------------------------------------------------
/datasets/download_referit_dataset.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | wget -O ./datasets/ReferIt/ReferitData/ReferitData.zip http://tamaraberg.com/referitgame/ReferitData.zip
3 | unzip ./datasets/ReferIt/ReferitData/ReferitData.zip -d ./datasets/ReferIt/ReferitData/
4 | wget -O ./datasets/ReferIt/ImageCLEF/referitdata.tar.gz http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/referitdata.tar.gz
5 | tar -xzvf ./datasets/ReferIt/ImageCLEF/referitdata.tar.gz -C ./datasets/ReferIt/ImageCLEF/
6 | 


--------------------------------------------------------------------------------
/demo/demo_data/40429.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/demo/demo_data/40429.jpg


--------------------------------------------------------------------------------
/demo/demo_data/40429.txt:
--------------------------------------------------------------------------------
  1 |    1.2600000e+02   2.3400000e+02   2.0000000e+02   3.0500000e+02
  2 |    6.9000000e+01   6.9000000e+01   2.5900000e+02   3.3900000e+02
  3 |    7.3000000e+01   7.0000000e+01   3.0300000e+02   3.2000000e+02
  4 |    0.0000000e+00   6.6000000e+01   4.7800000e+02   3.5800000e+02
  5 |    1.9900000e+02   6.0000000e+01   4.7800000e+02   3.1400000e+02
  6 |    9.6000000e+01   6.0000000e+01   4.7800000e+02   3.2800000e+02
  7 |    7.5000000e+01   2.2000000e+01   2.6000000e+02   3.0600000e+02
  8 |    0.0000000e+00   4.8000000e+01   2.8200000e+02   3.2600000e+02
  9 |    1.1300000e+02   4.8000000e+01   2.8700000e+02   3.1400000e+02
 10 |    1.1400000e+02   1.8000000e+02   2.1400000e+02   3.2500000e+02
 11 |    8.1000000e+01   2.2000000e+01   4.1800000e+02   3.2800000e+02
 12 |    1.1800000e+02   2.2300000e+02   2.0100000e+02   3.1900000e+02
 13 |    9.3000000e+01   1.5400000e+02   2.9600000e+02   3.4000000e+02
 14 |    6.3000000e+01   7.4000000e+01   2.1900000e+02   3.2400000e+02
 15 |    1.1400000e+02   2.3400000e+02   2.1500000e+02   3.0400000e+02
 16 |    1.1100000e+02   5.8000000e+01   3.8400000e+02   3.2800000e+02
 17 |    1.5700000e+02   6.0000000e+01   4.3900000e+02   3.2800000e+02
 18 |    0.0000000e+00   7.5000000e+01   2.1800000e+02   3.4400000e+02
 19 |    0.0000000e+00   1.4600000e+02   4.7800000e+02   3.5800000e+02
 20 |    5.3000000e+01   7.0000000e+01   3.0900000e+02   2.7100000e+02
 21 |    3.3000000e+01   4.5000000e+01   2.6000000e+02   2.6900000e+02
 22 |    6.9000000e+01   7.4000000e+01   2.5900000e+02   2.6300000e+02
 23 |    1.1200000e+02   8.1000000e+01   2.1800000e+02   3.0400000e+02
 24 |    2.0400000e+02   7.0000000e+01   4.2100000e+02   2.7300000e+02
 25 |    0.0000000e+00   1.5000000e+02   2.1800000e+02   3.2600000e+02
 26 |    2.0600000e+02   1.1300000e+02   4.7800000e+02   3.0700000e+02
 27 |    7.9000000e+01   1.2800000e+02   2.1400000e+02   3.2600000e+02
 28 |    3.6000000e+01   7.5000000e+01   3.6400000e+02   3.1300000e+02
 29 |    8.0000000e+01   1.1300000e+02   4.7800000e+02   3.3900000e+02
 30 |    9.9000000e+01   1.8000000e+02   2.1900000e+02   3.0600000e+02
 31 |    3.7000000e+01   1.5400000e+02   2.1300000e+02   3.4000000e+02
 32 |    1.0800000e+02   8.3000000e+01   2.3100000e+02   3.3900000e+02
 33 |    1.6200000e+02   2.3000000e+01   3.9300000e+02   3.3900000e+02
 34 |    9.8000000e+01   1.8000000e+02   2.7600000e+02   3.2800000e+02
 35 |    0.0000000e+00   2.4000000e+01   4.7800000e+02   2.8800000e+02
 36 |    0.0000000e+00   4.6000000e+01   2.1800000e+02   2.9800000e+02
 37 |    1.1100000e+02   1.2900000e+02   2.1300000e+02   3.2500000e+02
 38 |    1.5600000e+02   3.1000000e+01   4.7800000e+02   2.8900000e+02
 39 |    1.2000000e+02   7.4000000e+01   2.5900000e+02   3.4200000e+02
 40 |    3.7000000e+01   1.2000000e+02   2.1300000e+02   3.1800000e+02
 41 |    1.1500000e+02   2.3400000e+02   2.0900000e+02   3.4000000e+02
 42 |    0.0000000e+00   1.0800000e+02   2.7800000e+02   3.5200000e+02
 43 |    0.0000000e+00   4.7000000e+01   3.9400000e+02   2.9800000e+02
 44 |    1.9300000e+02   7.0000000e+01   3.0500000e+02   2.7100000e+02
 45 |    7.1000000e+01   1.8000000e+02   2.0100000e+02   3.2100000e+02
 46 |    9.6000000e+01   4.8000000e+01   2.4200000e+02   3.2600000e+02
 47 |    8.2000000e+01   1.2700000e+02   3.0000000e+02   3.0400000e+02
 48 |    9.5000000e+01   2.2300000e+02   1.9500000e+02   3.1900000e+02
 49 |    1.1100000e+02   1.5300000e+02   4.3900000e+02   3.4000000e+02
 50 |    9.1000000e+01   1.9600000e+02   1.9500000e+02   3.0500000e+02
 51 |    2.0400000e+02   7.5000000e+01   2.6000000e+02   2.6300000e+02
 52 |    6.8000000e+01   1.5300000e+02   2.1300000e+02   3.0900000e+02
 53 |    0.0000000e+00   1.7900000e+02   2.1800000e+02   3.5200000e+02
 54 |    8.7000000e+01   1.7800000e+02   2.9900000e+02   3.1000000e+02
 55 |    1.2400000e+02   1.8600000e+02   2.1300000e+02   3.0500000e+02
 56 |    3.6000000e+01   1.0200000e+02   3.1100000e+02   3.3600000e+02
 57 |    9.3000000e+01   1.5500000e+02   2.0600000e+02   3.2000000e+02
 58 |    9.8000000e+01   1.3000000e+02   2.5400000e+02   3.4100000e+02
 59 |    4.6000000e+01   7.4000000e+01   2.1800000e+02   2.5500000e+02
 60 |    6.9000000e+01   1.3000000e+02   3.6400000e+02   3.4500000e+02
 61 |    7.2000000e+01   2.2200000e+02   2.0000000e+02   3.0400000e+02
 62 |    0.0000000e+00   1.5400000e+02   3.0600000e+02   3.2400000e+02
 63 |    1.6400000e+02   6.9000000e+01   2.7200000e+02   3.0100000e+02
 64 |    2.4300000e+02   5.7000000e+01   4.4000000e+02   3.2900000e+02
 65 |    2.9000000e+01   1.5300000e+02   2.5900000e+02   3.1900000e+02
 66 |    1.2400000e+02   1.5100000e+02   2.1400000e+02   3.4000000e+02
 67 |    0.0000000e+00   1.1300000e+02   3.8400000e+02   3.3400000e+02
 68 |    1.9900000e+02   5.9000000e+01   3.8000000e+02   3.1300000e+02
 69 |    2.3200000e+02   1.5300000e+02   4.7800000e+02   3.2800000e+02
 70 |    7.3000000e+01   4.7000000e+01   4.7800000e+02   2.6900000e+02
 71 |    1.8900000e+02   5.5000000e+01   3.0400000e+02   3.1400000e+02
 72 |    2.8000000e+01   1.9100000e+02   2.1400000e+02   3.2400000e+02
 73 |    1.1300000e+02   1.7900000e+02   3.6400000e+02   3.4100000e+02
 74 |    1.2000000e+02   2.3400000e+02   1.9500000e+02   2.8600000e+02
 75 |    1.1400000e+02   1.1100000e+02   3.9200000e+02   3.3900000e+02
 76 |    5.6000000e+01   2.2300000e+02   2.0200000e+02   3.1900000e+02
 77 |    2.4300000e+02   1.5300000e+02   4.7800000e+02   2.8800000e+02
 78 |    1.8000000e+02   3.2000000e+01   3.8500000e+02   2.7700000e+02
 79 |    1.3300000e+02   9.7000000e+01   4.7800000e+02   3.0400000e+02
 80 |    1.9700000e+02   6.0000000e+01   2.8700000e+02   3.0100000e+02
 81 |    1.1300000e+02   1.3000000e+02   4.1900000e+02   3.1300000e+02
 82 |    1.5300000e+02   6.0000000e+01   3.0500000e+02   3.0100000e+02
 83 |    1.2000000e+02   1.5200000e+02   2.6000000e+02   3.4100000e+02
 84 |    1.1300000e+02   1.9100000e+02   4.3100000e+02   3.2900000e+02
 85 |    2.5000000e+02   6.0000000e+01   4.6200000e+02   2.7300000e+02
 86 |    9.3000000e+01   2.1900000e+02   2.9800000e+02   3.1900000e+02
 87 |    1.2300000e+02   1.9600000e+02   2.0000000e+02   3.1300000e+02
 88 |    1.1400000e+02   2.2400000e+02   2.9800000e+02   3.4100000e+02
 89 |    1.5800000e+02   3.2000000e+01   2.8200000e+02   2.8800000e+02
 90 |    1.8900000e+02   1.1300000e+02   3.9200000e+02   3.3000000e+02
 91 |    0.0000000e+00   2.2200000e+02   2.1200000e+02   3.5200000e+02
 92 |    1.8300000e+02   2.4000000e+01   3.2800000e+02   3.0100000e+02
 93 |    2.5000000e+02   1.9100000e+02   3.6600000e+02   3.2900000e+02
 94 |    1.1500000e+02   2.5400000e+02   2.0100000e+02   3.0400000e+02
 95 |    2.0400000e+02   5.9000000e+01   4.7800000e+02   2.3700000e+02
 96 |    2.0600000e+02   1.1300000e+02   2.5600000e+02   2.6200000e+02
 97 |    1.2000000e+02   2.0900000e+02   2.1500000e+02   3.1300000e+02
 98 |    0.0000000e+00   6.6000000e+01   2.1800000e+02   2.4900000e+02
 99 |    0.0000000e+00   1.5200000e+02   1.3800000e+02   3.5300000e+02
100 |    1.3800000e+02   7.2000000e+01   2.6000000e+02   2.6900000e+02
101 | 


--------------------------------------------------------------------------------
/exp-kitchen/cache_kitchen_training_batches.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function, division
 2 | 
 3 | import os
 4 | import numpy as np
 5 | 
 6 | import util
 7 | import retriever
 8 | 
 9 | trn_imlist_file = './data/split/kitchen_trainval_imlist.txt'
10 | 
11 | image_dir = './datasets/Kitchen/images/Kitchen/'
12 | query_file = './data/metadata/kitchen_query_dict.json'
13 | vocab_file = './data/vocabulary.txt'
14 | 
15 | N_batch = 50  # batch size during training
16 | T = 20  # unroll timestep of LSTM
17 | 
18 | save_image_list_file = './data/kitchen_train_image_list.txt'
19 | save_hdf5_list_file  = './data/kitchen_train_hdf5_list.txt'
20 | save_hdf5_dir = './data/kitchen_hdf5_50/'
21 | 
22 | imset = set(util.io.load_str_list(trn_imlist_file))
23 | vocab_dict = retriever.build_vocab_dict_from_file(vocab_file)
24 | query_dict = util.io.load_json(query_file)
25 | 
26 | train_pairs = []
27 | for imname, des in query_dict.iteritems():
28 |     if imname not in imset:
29 |         continue
30 |     train_pairs += [(imname, d) for d in des]
31 | 
32 | # random shuffle training pairs
33 | np.random.seed(3)
34 | perm_idx = np.random.permutation(np.arange(len(train_pairs)))
35 | train_pairs = [train_pairs[n] for n in perm_idx]
36 | 
37 | num_train_pairs = len(train_pairs)
38 | num_train_pairs = num_train_pairs - num_train_pairs % N_batch
39 | train_pairs = train_pairs[:num_train_pairs]
40 | num_batch = int(num_train_pairs // N_batch)
41 | 
42 | image_list = []
43 | hdf5_list = []
44 | 
45 | # generate hdf5 files
46 | if not os.path.isdir(save_hdf5_dir):
47 |     os.mkdir(save_hdf5_dir)
48 | for n_batch in range(num_batch):
49 |     if (n_batch+1) % 10 == 0:
50 |         print('writing batch %d / %d' % (n_batch+1, num_batch))
51 |     begin = n_batch * N_batch
52 |     end = (n_batch + 1) * N_batch
53 |     cont_sentences = np.zeros([T, N_batch], dtype=np.float32)
54 |     input_sentences = np.zeros([T, N_batch], dtype=np.float32)
55 |     target_sentences = np.zeros([T, N_batch], dtype=np.float32)
56 |     for n_pair in range(begin, end):
57 |         # Append 0 as dummy label
58 |         image_path = image_dir + train_pairs[n_pair][0] + '.JPEG 0' # 0 as dummy label
59 |         image_list.append(image_path)
60 | 
61 |         stream = retriever.sentence2vocab_indices(train_pairs[n_pair][1], vocab_dict)
62 |         if len(stream) > T-1:
63 |             stream = stream[:T-1]
64 |         pad = T - 1 - len(stream)
65 |         cont_sentences[:, n_pair-begin] = [0] + [1] * len(stream) + [0] * pad
66 |         input_sentences[:, n_pair-begin] = [0] + stream + [-1] * pad
67 |         target_sentences[:, n_pair-begin] = stream + [0] + [-1] * pad
68 |     h5_filename = save_hdf5_dir + '%d_to_%d.h5' % (begin, end)
69 |     retriever.write_batch_to_hdf5(h5_filename, cont_sentences, input_sentences, target_sentences)
70 |     hdf5_list.append(h5_filename)
71 | 
72 | util.io.save_str_list(image_list, save_image_list_file)
73 | util.io.save_str_list(hdf5_list, save_hdf5_list_file)
74 | 


--------------------------------------------------------------------------------
/exp-kitchen/caffemodel/.keep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/exp-kitchen/caffemodel/.keep


--------------------------------------------------------------------------------
/exp-kitchen/test_scrc_on_kitchen.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division, print_function
  2 | 
  3 | import sys
  4 | import numpy as np
  5 | import skimage.io
  6 | sys.path.append('./external/caffe-natural-language-object-retrieval/python/')
  7 | sys.path.append('./external/caffe-natural-language-object-retrieval/examples/coco_caption/')
  8 | import caffe
  9 | 
 10 | import util
 11 | from captioner import Captioner
 12 | import retriever
 13 | 
 14 | ################################################################################
 15 | # Test Parameters
 16 | 
 17 | # distractor_set can be either "kitchen" or "imagenet"
 18 | # For "kitchen" experiment, the distractors are sampled from test set itsef
 19 | # For "imagenet" experiment, the distractors are sampled from ImageNET distractor images
 20 | distractor_set = "kitchen"
 21 | # Number of distractors sampled for each object
 22 | distractor_per_object = 10
 23 | 
 24 | pretrained_weights_path = './models/scrc_kitchen.caffemodel'
 25 | 
 26 | gpu_id = 0  # the GPU to test the SCRC model
 27 | 
 28 | tst_imlist_file = './data/split/kitchen_test_imlist.txt'
 29 | ################################################################################
 30 | 
 31 | image_dir = './datasets/Kitchen/images/Kitchen/'
 32 | 
 33 | if distractor_set == "kitchen":
 34 |     distractor_dir = image_dir
 35 |     distractor_imlist_file = tst_imlist_file
 36 | else:
 37 |     distractor_dir = './datasets/Kitchen/images/ImageNET/'
 38 |     distractor_imlist_file = './data/split/kitchen_imagenet_imlist.txt'
 39 | 
 40 | query_file = './data/metadata/kitchen_query_dict.json'
 41 | vocab_file = './data/vocabulary.txt'
 42 | 
 43 | # utilize the captioner module from LRCN
 44 | lstm_net_proto = './prototxt/scrc_word_to_preds_no_spatial_no_context.prototxt'
 45 | image_net_proto = './prototxt/VGG_ILSVRC_16_layers_deploy.prototxt'
 46 | captioner = Captioner(pretrained_weights_path, image_net_proto, lstm_net_proto,
 47 |                       vocab_file, gpu_id)
 48 | captioner.set_image_batch_size(50)
 49 | vocab_dict = retriever.build_vocab_dict_from_captioner(captioner)
 50 | 
 51 | # Load image and caption list
 52 | imlist = util.io.load_str_list(tst_imlist_file)
 53 | num_im = len(imlist)
 54 | query_dict = util.io.load_json(query_file)
 55 | 
 56 | # Load distractors
 57 | distractor_list = util.io.load_str_list(distractor_imlist_file)
 58 | num_distractors = len(distractor_list)
 59 | 
 60 | # Sample distractor images for each test image
 61 | distractor_ids_per_im = {}
 62 | np.random.seed(3)  # fix random seed for test repeatibility
 63 | for imname in imlist:
 64 |     # Sample distractor_per_object*2 distractors to make sure the test image
 65 |     # itself is not among the distractors (this)
 66 |     distractor_ids = np.random.choice(num_distractors,
 67 |                                       distractor_per_object*2, replace=False)
 68 |     distractor_names = [distractor_list[n] for n in distractor_ids[:distractor_per_object]]
 69 |     # Use the second half if the imname is among the first half
 70 |     if imname not in distractor_names:
 71 |         distractor_ids_per_im[imname] = distractor_ids[:distractor_per_object]
 72 |     else:
 73 |         distractor_ids_per_im[imname] = distractor_ids[distractor_per_object:]
 74 | 
 75 | # Compute descriptors for both object images and distractor images
 76 | image_path_list = [image_dir+imname+'.JPEG' for imname in imlist]
 77 | distractor_path_list = [distractor_dir+imname+'.JPEG' for imname in distractor_list]
 78 | 
 79 | obj_descriptors = captioner.compute_descriptors(image_path_list)
 80 | dis_descriptors = captioner.compute_descriptors(distractor_path_list)
 81 | 
 82 | ################################################################################
 83 | # Test top-1 precision
 84 | correct_num = 0
 85 | total_num = 0
 86 | for n_im in range(num_im):
 87 |     print('testing image %d / %d' % (n_im, num_im))
 88 |     imname = imlist[n_im]
 89 |     for sentence in query_dict[imname]:
 90 |         # compute test image (target object) score given the description sentence
 91 |         obj_score = retriever.score_descriptors(obj_descriptors[n_im:n_im+1, :],
 92 |                                                 sentence, captioner, vocab_dict)[0]
 93 |         # compute distractor scores given the description sentence
 94 |         dis_idx = distractor_ids_per_im[imname]
 95 |         dis_scores = retriever.score_descriptors(dis_descriptors[dis_idx, :],
 96 |                                                  sentence, captioner, vocab_dict)
 97 | 
 98 |         # for a retrieval to be correct, the object image must score higher than
 99 |         # all distractor images
100 |         correct_num += np.all(obj_score > dis_scores)
101 |         total_num += 1
102 | 
103 | print('Top-1 precision on the whole test set: %f' % (correct_num/total_num))
104 | ################################################################################
105 | 


--------------------------------------------------------------------------------
/exp-kitchen/train_scrc_kitchen.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | GPU_ID=0
3 | WEIGHTS=./models/coco_pretrained_iter_100000.caffemodel
4 | 
5 | caffe train \
6 |     -solver ./prototxt/scrc_kitchen_solver.prototxt \
7 |     -weights $WEIGHTS \
8 |     -gpu $GPU_ID 2>&1
9 | 


--------------------------------------------------------------------------------
/exp-referit/cache_referit_context_features.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function, division
 2 | 
 3 | import sys
 4 | import os
 5 | import numpy as np
 6 | import skimage.io
 7 | sys.path.append('./external/caffe-natural-language-object-retrieval/python/')
 8 | sys.path.append('./external/caffe-natural-language-object-retrieval/examples/coco_caption/')
 9 | import caffe
10 | 
11 | import util
12 | from captioner import Captioner
13 | 
14 | 
15 | vgg_weights_path = './models/VGG_ILSVRC_16_layers.caffemodel'
16 | gpu_id = 0
17 | 
18 | image_dir = './datasets/ReferIt/ImageCLEF/images/'
19 | cached_context_features_dir = './data/referit_context_features/'
20 | 
21 | 
22 | image_net_proto = './prototxt/VGG_ILSVRC_16_layers_deploy.prototxt'
23 | lstm_net_proto = './prototxt/scrc_word_to_preds_full.prototxt'
24 | vocab_file = './data/vocabulary.txt'
25 | 
26 | captioner = Captioner(vgg_weights_path, image_net_proto, lstm_net_proto, vocab_file, gpu_id)
27 | batch_size = 100
28 | captioner.set_image_batch_size(batch_size)
29 | 
30 | imlist = util.io.load_str_list('./data/split/referit_all_imlist.txt')
31 | num_im = len(imlist)
32 | 
33 | # Load all images into memory
34 | loaded_images = []
35 | for n_im in range(num_im):
36 |     if n_im % 200 == 0:
37 |         print('loading image %d / %d into memory' % (n_im, num_im))
38 | 
39 |     im = skimage.io.imread(image_dir + imlist[n_im] + '.jpg')
40 |     # Gray scale to RGB
41 |     if im.ndim == 2:
42 |         im = np.tile(im[..., np.newaxis], (1, 1, 3))
43 |     # RGBA to RGB
44 |     im = im[:, :, :3]
45 |     loaded_images.append(im)
46 | 
47 | # Compute fc7 feature from loaded images, as whole image contextual feature
48 | descriptors = captioner.compute_descriptors(loaded_images, output_name='fc7')
49 | 
50 | # Save computed contextual features
51 | if not os.path.isdir(cached_context_features_dir):
52 |     os.mkdir(cached_context_features_dir)
53 | for n_im in range(num_im):
54 |     if n_im % 200 == 0:
55 |         print('saving contextual features %d / %d' % (n_im, num_im))
56 |     save_path = cached_context_features_dir + imlist[n_im] + '_fc7.npy'
57 |     np.save(save_path, descriptors[n_im, :])
58 | 


--------------------------------------------------------------------------------
/exp-referit/cache_referit_training_batches.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function, division
  2 | 
  3 | import os
  4 | import numpy as np
  5 | 
  6 | import util
  7 | import retriever
  8 | 
  9 | trn_imlist_file = './data/split/referit_trainval_imlist.txt'
 10 | 
 11 | image_dir = './datasets/ReferIt/ImageCLEF/images/'
 12 | resized_imcrop_dir = './data/resized_imcrop/'
 13 | cached_context_features_dir = './data/referit_context_features/'
 14 | 
 15 | imcrop_dict_file = './data/metadata/referit_imcrop_dict.json'
 16 | imcrop_bbox_dict_file = './data/metadata/referit_imcrop_bbox_dict.json'
 17 | imsize_dict_file = './data/metadata/referit_imsize_dict.json'
 18 | query_file = './data/metadata/referit_query_dict.json'
 19 | vocab_file = './data/vocabulary.txt'
 20 | 
 21 | N_batch = 50  # batch size during training
 22 | T = 20  # unroll timestep of LSTM
 23 | 
 24 | save_imcrop_list_file = './data/training/train_bbox_context_imcrop_list.txt'
 25 | save_wholeim_list_file = './data/training/train_bbox_context_wholeim_list.txt'
 26 | save_hdf5_text_list_file = './data/training/train_bbox_context_hdf5_text_list.txt'
 27 | save_hdf5_bbox_list_file = './data/training/train_bbox_context_hdf5_bbox_list.txt'
 28 | save_hdf5_dir = './data/training/hdf5_50_bbox_context/'
 29 | 
 30 | imset = set(util.io.load_str_list(trn_imlist_file))
 31 | vocab_dict = retriever.build_vocab_dict_from_file(vocab_file)
 32 | query_dict = util.io.load_json(query_file)
 33 | imsize_dict = util.io.load_json(imsize_dict_file)
 34 | imcrop_bbox_dict = util.io.load_json(imcrop_bbox_dict_file)
 35 | 
 36 | train_pairs = []
 37 | for imcrop_name, des in query_dict.iteritems():
 38 |     imname = imcrop_name.split('_', 1)[0]
 39 |     if imname not in imset:
 40 |         continue
 41 |     imsize = np.array(imsize_dict[imname])
 42 |     bbox = np.array(imcrop_bbox_dict[imcrop_name])
 43 |     bbox_feat = retriever.compute_spatial_feat(bbox, imsize)
 44 |     context_feature = np.load(cached_context_features_dir + imname + '_fc7.npy')
 45 |     train_pairs += [(imcrop_name, d, bbox_feat, imname, context_feature) for d in des]
 46 | 
 47 | # random shuffle training pairs
 48 | np.random.seed(3)
 49 | perm_idx = np.random.permutation(np.arange(len(train_pairs)))
 50 | train_pairs = [train_pairs[n] for n in perm_idx]
 51 | 
 52 | num_train_pairs = len(train_pairs)
 53 | num_train_pairs = num_train_pairs - num_train_pairs % N_batch
 54 | train_pairs = train_pairs[:num_train_pairs]
 55 | num_batch = int(num_train_pairs // N_batch)
 56 | 
 57 | imcrop_list = []
 58 | wholeim_list = []
 59 | hdf5_text_list = []
 60 | hdf5_bbox_list = []
 61 | 
 62 | # generate hdf5 files
 63 | if not os.path.isdir(save_hdf5_dir):
 64 |     os.mkdir(save_hdf5_dir)
 65 | for n_batch in range(num_batch):
 66 |     if (n_batch+1) % 100 == 0:
 67 |         print('writing batch %d / %d' % (n_batch+1, num_batch))
 68 |     begin = n_batch * N_batch
 69 |     end = (n_batch + 1) * N_batch
 70 |     cont_sentences = np.zeros([T, N_batch], dtype=np.float32)
 71 |     input_sentences = np.zeros([T, N_batch], dtype=np.float32)
 72 |     target_sentences = np.zeros([T, N_batch], dtype=np.float32)
 73 |     bbox_coordinates = np.zeros([N_batch, 8], dtype=np.float32)
 74 |     fc7_context = np.zeros([N_batch, 4096], dtype=np.float32)
 75 |     for n_pair in range(begin, end):
 76 |         # Append 0 as dummy label
 77 |         imcrop_path = resized_imcrop_dir + train_pairs[n_pair][0] + '.png 0'
 78 |         imcrop_list.append(imcrop_path)
 79 |         # Append 0 as dummy label
 80 |         wholeim_path = image_dir + train_pairs[n_pair][3] + '.jpg 0'
 81 |         wholeim_list.append(wholeim_path)
 82 |         stream = retriever.sentence2vocab_indices(train_pairs[n_pair][1],
 83 |                                                   vocab_dict)
 84 |         if len(stream) > T-1:
 85 |             stream = stream[:T-1]
 86 |         pad = T - 1 - len(stream)
 87 |         cont_sentences[:, n_pair-begin] = [0] + [1] * len(stream) + [0] * pad
 88 |         input_sentences[:, n_pair-begin] = [0] + stream + [-1] * pad
 89 |         target_sentences[:, n_pair-begin] = stream + [0] + [-1] * pad
 90 |         bbox_coordinates[n_pair-begin, :] = np.squeeze(train_pairs[n_pair][2])
 91 |         fc7_context[n_pair-begin, :] = train_pairs[n_pair][4]
 92 |     h5_text_filename = save_hdf5_dir + 'text_%d_to_%d.h5' % (begin, end)
 93 |     h5_bbox_filename = save_hdf5_dir + 'bbox_context_%d_to_%d.h5' % (begin, end)
 94 |     retriever.write_batch_to_hdf5(h5_text_filename, cont_sentences,
 95 |                                   input_sentences, target_sentences)
 96 |     retriever.write_bbox_context_to_hdf5(h5_bbox_filename, bbox_coordinates,
 97 |                                          fc7_context)
 98 |     hdf5_text_list.append(h5_text_filename)
 99 |     hdf5_bbox_list.append(h5_bbox_filename)
100 | 
101 | util.io.save_str_list(imcrop_list, save_imcrop_list_file)
102 | util.io.save_str_list(wholeim_list, save_wholeim_list_file)
103 | util.io.save_str_list(hdf5_text_list, save_hdf5_text_list_file)
104 | util.io.save_str_list(hdf5_bbox_list, save_hdf5_bbox_list_file)
105 | 


--------------------------------------------------------------------------------
/exp-referit/caffemodel/.keep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/exp-referit/caffemodel/.keep


--------------------------------------------------------------------------------
/exp-referit/initialize_weights_scrc_full.py:
--------------------------------------------------------------------------------
 1 | from __future__ import division, print_function
 2 | 
 3 | import sys
 4 | import numpy as np
 5 | sys.path.append('./external/caffe-natural-language-object-retrieval/python/')
 6 | import caffe
 7 | 
 8 | old_prototxt = './prototxt/coco_pretrained.prototxt'
 9 | old_caffemodel = './models/coco_pretrained_iter_100000.caffemodel'
10 | new_prototxt = './prototxt/scrc_full_vgg_buffer_50.prototxt'
11 | new_caffemodel = './exp-referit/caffemodel/scrc_full_vgg_init.caffemodel'
12 | old_net = caffe.Net(old_prototxt, old_caffemodel, caffe.TRAIN)
13 | new_net = caffe.Net(new_prototxt, old_caffemodel, caffe.TRAIN)
14 | 
15 | new_net.params['fc8_context'][0].data[...] = old_net.params['fc8'][0].data[...]
16 | new_net.params['fc8_context'][1].data[...] = old_net.params['fc8'][1].data[...]
17 | 
18 | new_net.params['lstm2-extended'][0].data[...] = old_net.params['lstm2'][0].data[...]
19 | new_net.params['lstm2-extended'][1].data[...] = old_net.params['lstm2'][1].data[...]
20 | new_net.params['lstm2-extended'][2].data[:, :1000] = old_net.params['lstm2'][2].data[...]
21 | new_net.params['lstm2-extended'][2].data[:, 1000:] = 0
22 | new_net.params['lstm2-extended'][3].data[...] = old_net.params['lstm2'][3].data[...]
23 | 
24 | new_net.params['lstm2_context'][0].data[...] = old_net.params['lstm2'][0].data[...]
25 | new_net.params['lstm2_context'][1].data[...] = old_net.params['lstm2'][1].data[...]
26 | new_net.params['lstm2_context'][2].data[...] = old_net.params['lstm2'][2].data[...]
27 | new_net.params['lstm2_context'][3].data[...] = old_net.params['lstm2'][3].data[...]
28 | 
29 | new_net.save(new_caffemodel)
30 | 


--------------------------------------------------------------------------------
/exp-referit/initialize_weights_scrc_no_context.py:
--------------------------------------------------------------------------------
 1 | from __future__ import division, print_function
 2 | 
 3 | import sys
 4 | import numpy as np
 5 | sys.path.append('./external/caffe-natural-language-object-retrieval/python/')
 6 | import caffe
 7 | 
 8 | old_prototxt = './prototxt/coco_pretrained.prototxt'
 9 | old_caffemodel = './models/coco_pretrained_iter_100000.caffemodel'
10 | new_prototxt = './prototxt/scrc_no_context_vgg_buffer_50.prototxt'
11 | new_caffemodel = './exp-referit/caffemodel/scrc_no_context_vgg_init.caffemodel'
12 | old_net = caffe.Net(old_prototxt, old_caffemodel, caffe.TRAIN)
13 | new_net = caffe.Net(new_prototxt, old_caffemodel, caffe.TRAIN)
14 | 
15 | new_net.params['lstm2-extended'][0].data[...] = old_net.params['lstm2'][0].data[...]
16 | new_net.params['lstm2-extended'][1].data[...] = old_net.params['lstm2'][1].data[...]
17 | new_net.params['lstm2-extended'][2].data[:, :1000] = old_net.params['lstm2'][2].data[...]
18 | new_net.params['lstm2-extended'][2].data[:, 1000:] = 0
19 | new_net.params['lstm2-extended'][3].data[...] = old_net.params['lstm2'][3].data[...]
20 | 
21 | new_net.save(new_caffemodel)
22 | 


--------------------------------------------------------------------------------
/exp-referit/preprocess_dataset.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division, print_function
  2 | 
  3 | import os
  4 | import numpy as np
  5 | import scipy.io as sio
  6 | import skimage
  7 | import skimage.io
  8 | import skimage.transform
  9 | 
 10 | import util
 11 | 
 12 | 
 13 | def load_imcrop(imlist, mask_dir):
 14 |     imcrop_dict = {im_name: [] for im_name in imlist}
 15 |     imcroplist = []
 16 |     masklist = os.listdir(mask_dir)
 17 |     for mask_name in masklist:
 18 |         imcrop_name = mask_name.split('.', 1)[0]
 19 |         imcroplist.append(imcrop_name)
 20 |         im_name = imcrop_name.split('_', 1)[0]
 21 |         imcrop_dict[im_name].append(imcrop_name)
 22 |     return imcroplist, imcrop_dict
 23 | 
 24 | 
 25 | def load_image_size(imlist, image_dir):
 26 |     num_im = len(imlist)
 27 |     imsize_dict = {}
 28 |     for n_im in range(num_im):
 29 |         if n_im % 200 == 0:
 30 |             print('processing image %d / %d' % (n_im, num_im))
 31 |         im = skimage.io.imread(image_dir + imlist[n_im] + '.jpg')
 32 |         imsize_dict[imlist[n_im]] = [im.shape[1], im.shape[0]]  # [width, height]
 33 |     return imsize_dict
 34 | 
 35 | 
 36 | def load_referit_annotation(imcroplist, annotation_file):
 37 |     print('loading ReferIt dataset annotations...')
 38 |     query_dict = {imcrop_name: [] for imcrop_name in imcroplist}
 39 |     with open(annotation_file) as f:
 40 |         raw_annotation = f.readlines()
 41 |     for s in raw_annotation:
 42 |         # example annotation line:
 43 |         # 8756_2.jpg~sunray at very top~.33919597989949750~.023411371237458192
 44 |         splits = s.strip().split('~', 2)
 45 |         # example: 8756_2 (segmentation regions)
 46 |         imcrop_name = splits[0].split('.', 1)[0]
 47 |         # example: 'sunray at very top'
 48 |         description = splits[1]
 49 |         # construct imcrop_name - discription list dictionary
 50 |         # an image crop can have zero or mutiple annotations
 51 |         query_dict[imcrop_name].append(description)
 52 |     return query_dict
 53 | 
 54 | 
 55 | def load_and_resize_imcrop(mask_dir, image_dir, resized_imcrop_dir):
 56 |     print('loading image crop bounding boxes...')
 57 |     imcrop_bbox_dict = {}
 58 |     masklist = os.listdir(mask_dir)
 59 |     if not os.path.isdir(resized_imcrop_dir):
 60 |         os.mkdir(resized_imcrop_dir)
 61 |     for n in range(len(masklist)):
 62 |         if n % 200 == 0:
 63 |             print('processing image crop %d / %d' % (n, len(masklist)))
 64 |         mask_name = masklist[n]
 65 |         mask = sio.loadmat(mask_dir + mask_name)['segimg_t']
 66 |         idx = np.nonzero(mask == 0)
 67 |         x_min, x_max = np.min(idx[1]), np.max(idx[1])
 68 |         y_min, y_max = np.min(idx[0]), np.max(idx[0])
 69 |         bbox = [x_min, y_min, x_max, y_max]
 70 |         imcrop_name = mask_name.split('.', 1)[0]
 71 |         imcrop_bbox_dict[imcrop_name] = bbox
 72 | 
 73 |         # resize the image crops
 74 |         imname = imcrop_name.split('_', 1)[0] + '.jpg'
 75 |         image_path = image_dir + imname
 76 |         im = skimage.io.imread(image_path)
 77 |         # Gray scale to RGB
 78 |         if im.ndim == 2:
 79 |             im = np.tile(im[..., np.newaxis], (1, 1, 3))
 80 |         # RGBA to RGB
 81 |         im = im[:, :, :3]
 82 |         resized_im = skimage.transform.resize(im[y_min:y_max+1,
 83 |                                                  x_min:x_max+1, :], [224, 224])
 84 |         save_path = resized_imcrop_dir + imcrop_name + '.png'
 85 |         skimage.io.imsave(save_path, resized_im)
 86 |     return imcrop_bbox_dict
 87 | 
 88 | 
 89 | def main():
 90 |     image_dir = './datasets/ReferIt/ImageCLEF/images/'
 91 |     mask_dir = './datasets/ReferIt/ImageCLEF/mask/'
 92 |     annotation_file = './datasets/ReferIt/ReferitData/RealGames.txt'
 93 |     imlist_file = './data/split/referit_all_imlist.txt'
 94 |     metadata_dir = './data/metadata/'
 95 |     resized_imcrop_dir = './data/resized_imcrop/'
 96 | 
 97 |     imlist = util.io.load_str_list(imlist_file)
 98 |     imsize_dict = load_image_size(imlist, image_dir)
 99 |     imcroplist, imcrop_dict = load_imcrop(imlist, mask_dir)
100 |     query_dict = load_referit_annotation(imcroplist, annotation_file)
101 |     imcrop_bbox_dict = load_and_resize_imcrop(mask_dir, image_dir,
102 |                                               resized_imcrop_dir)
103 | 
104 |     util.io.save_json(imsize_dict, metadata_dir + 'referit_imsize_dict.json')
105 |     util.io.save_json(imcrop_dict, metadata_dir + 'referit_imcrop_dict.json')
106 |     util.io.save_json(query_dict,  metadata_dir + 'referit_query_dict.json')
107 |     util.io.save_json(imcrop_bbox_dict, metadata_dir + 'referit_imcrop_bbox_dict.json')
108 | 
109 | if __name__ == '__main__':
110 |     main()
111 | 


--------------------------------------------------------------------------------
/exp-referit/test_scrc_on_referit.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division, print_function
  2 | 
  3 | import sys
  4 | import numpy as np
  5 | import skimage.io
  6 | sys.path.append('./external/caffe-natural-language-object-retrieval/python/')
  7 | sys.path.append('./external/caffe-natural-language-object-retrieval/examples/coco_caption/')
  8 | import caffe
  9 | 
 10 | import util
 11 | from captioner import Captioner
 12 | import retriever
 13 | 
 14 | ################################################################################
 15 | # Test Parameters
 16 | 
 17 | # Test on either all annotated regions, or top-100 EdgeBox proposals
 18 | # See Section 4.1 in the paper for details
 19 | candidate_regions = 'proposal_regions'
 20 | # candidate_regions = 'annotated_regions'
 21 | 
 22 | # Whether or not scene-level context are used in predictions
 23 | use_context = True
 24 | 
 25 | if use_context:
 26 |     lstm_net_proto = './prototxt/scrc_word_to_preds_full.prototxt'
 27 |     pretrained_weights_path = './models/scrc_full_vgg.caffemodel'
 28 | else:
 29 |     lstm_net_proto = './prototxt/scrc_word_to_preds_no_context.prototxt'
 30 |     pretrained_weights_path = './models/scrc_no_context_vgg.caffemodel'
 31 | 
 32 | gpu_id = 0  # the GPU to test the SCRC model
 33 | correct_IoU_threshold = 0.5
 34 | 
 35 | tst_imlist_file = './data/split/referit_test_imlist.txt'
 36 | ################################################################################
 37 | 
 38 | image_dir = './datasets/ReferIt/ImageCLEF/images/'
 39 | proposal_dir = './data/referit_edgeboxes_top100/'
 40 | cached_context_features_dir = './data/referit_context_features/'
 41 | 
 42 | imcrop_dict_file = './data/metadata/referit_imcrop_dict.json'
 43 | imcrop_bbox_dict_file = './data/metadata/referit_imcrop_bbox_dict.json'
 44 | query_file = './data/metadata/referit_query_dict.json'
 45 | vocab_file = './data/vocabulary.txt'
 46 | 
 47 | # utilize the captioner module from LRCN
 48 | image_net_proto = './prototxt/VGG_ILSVRC_16_layers_deploy.prototxt'
 49 | captioner = Captioner(pretrained_weights_path, image_net_proto, lstm_net_proto,
 50 |                       vocab_file, gpu_id)
 51 | captioner.set_image_batch_size(50)
 52 | vocab_dict = retriever.build_vocab_dict_from_captioner(captioner)
 53 | 
 54 | # Load image and caption list
 55 | imlist = util.io.load_str_list(tst_imlist_file)
 56 | num_im = len(imlist)
 57 | query_dict = util.io.load_json(query_file)
 58 | imcrop_dict = util.io.load_json(imcrop_dict_file)
 59 | imcrop_bbox_dict = util.io.load_json(imcrop_bbox_dict_file)
 60 | 
 61 | # Load candidate regions (bounding boxes)
 62 | load_proposal = (candidate_regions == 'proposal_regions')
 63 | candidate_boxes_dict = {imname: None for imname in imlist}
 64 | for n_im in range(num_im):
 65 |     if n_im % 1000 == 0:
 66 |         print('loading candidate regions %d / %d' % (n_im, num_im))
 67 |     imname = imlist[n_im]
 68 |     if load_proposal:
 69 |         proposal_file_name = imname + '.txt'
 70 |         boxes = np.loadtxt(proposal_dir + proposal_file_name)
 71 |         boxes = boxes.astype(int).reshape((-1, 4))
 72 |     else:
 73 |         boxes = [imcrop_bbox_dict[imcrop_name]
 74 |                  for imcrop_name in imcrop_dict[imname]]
 75 |         boxes = np.array(boxes).astype(int).reshape((-1, 4))
 76 |     candidate_boxes_dict[imname] = boxes
 77 | 
 78 | 
 79 | # Load cached whole-image contextual features
 80 | if use_context:
 81 |     context_features_dict = {imname: None for imname in imlist}
 82 |     for n_im in range(num_im):
 83 |         if n_im % 1000 == 0:
 84 |             print('loading contextual features %d / %d' % (n_im, num_im))
 85 |         imname = imlist[n_im]
 86 |         cached_context_features_file = cached_context_features_dir + imname + '_fc7.npy'
 87 |         context_features_dict[imname] = np.load(cached_context_features_file).reshape((1, 4096))
 88 | 
 89 | ################################################################################
 90 | # Test recall
 91 | K = 100  # evaluate recall at 1, 2, ..., K
 92 | topK_correct_num = np.zeros(K, dtype=np.float32)
 93 | total_num = 0
 94 | for n_im in range(num_im):
 95 |     print('testing image %d / %d' % (n_im, num_im))
 96 |     imname = imlist[n_im]
 97 |     imcrop_names = imcrop_dict[imname]
 98 |     candidate_boxes = candidate_boxes_dict[imname]
 99 | 
100 |     im = skimage.io.imread(image_dir + imname + '.jpg')
101 |     imsize = np.array([im.shape[1], im.shape[0]])  # [width, height]
102 | 
103 |     # Compute local descriptors (local image feature + spatial feature)
104 |     descriptors = retriever.compute_descriptors_edgebox(captioner, im,
105 |                                                         candidate_boxes)
106 |     spatial_feats = retriever.compute_spatial_feat(candidate_boxes, imsize)
107 |     descriptors = np.concatenate((descriptors, spatial_feats), axis=1)
108 | 
109 |     num_imcrop = len(imcrop_names)
110 |     num_proposal = candidate_boxes.shape[0]
111 |     for n_imcrop in range(num_imcrop):
112 |         imcrop_name = imcrop_names[n_imcrop]
113 |         if imcrop_name not in query_dict:
114 |             continue
115 |         gt_bbox = np.array(imcrop_bbox_dict[imcrop_name])
116 |         IoUs = retriever.compute_iou(candidate_boxes, gt_bbox)
117 |         for n_sentence in range(len(query_dict[imcrop_name])):
118 |             sentence = query_dict[imcrop_name][n_sentence]
119 |             # Scores for each candidate region
120 |             if use_context:
121 |                 scores = retriever.score_descriptors_context(descriptors, sentence,
122 |                     context_features_dict[imname], captioner, vocab_dict)
123 |             else:
124 |                 scores = retriever.score_descriptors(descriptors, sentence,
125 |                     captioner, vocab_dict)
126 | 
127 |             # Evaluate the correctness of top K predictions
128 |             topK_ids = np.argsort(-scores)[:K]
129 |             topK_IoUs = IoUs[topK_ids]
130 |             # whether the K-th (ranking from high to low) candidate is correct
131 |             topK_is_correct = np.zeros(K, dtype=bool)
132 |             topK_is_correct[:len(topK_ids)] = (topK_IoUs >= correct_IoU_threshold)
133 |             # whether at least one of the top K candidates is correct
134 |             topK_any_correct = (np.cumsum(topK_is_correct) > 0)
135 |             topK_correct_num += topK_any_correct
136 |             total_num += 1
137 | 
138 |     # print intermediate results during testing
139 |     if (n_im+1) % 1000 == 0:
140 |         print('Recall on first %d test images' % (n_im+1))
141 |         for k in [0, 10-1]:
142 |             print('\trecall @ %d = %f' % (k+1, topK_correct_num[k]/total_num))
143 | 
144 | print('Final recall on the whole test set')
145 | for k in [0, 10-1]:
146 |     print('\trecall @ %d = %f' % (k+1, topK_correct_num[k]/total_num))
147 | ################################################################################
148 | 


--------------------------------------------------------------------------------
/exp-referit/train_scrc_full_on_referit.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | GPU_ID=0
3 | WEIGHTS=./exp-referit/caffemodel/scrc_full_vgg_init.caffemodel
4 | 
5 | caffe train \
6 |     -solver ./prototxt/scrc_full_vgg_solver.prototxt \
7 |     -weights $WEIGHTS \
8 |     -gpu $GPU_ID 2>&1
9 | 


--------------------------------------------------------------------------------
/exp-referit/train_scrc_no_context_on_referit.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | GPU_ID=0
3 | WEIGHTS=./exp-referit/caffemodel/scrc_no_context_vgg_init.caffemodel
4 | 
5 | caffe train \
6 |     -solver ./prototxt/scrc_no_context_vgg_solver.prototxt \
7 |     -weights $WEIGHTS \
8 |     -gpu $GPU_ID 2>&1
9 | 


--------------------------------------------------------------------------------
/external/download_caffe.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | wget -O ./external/caffe-natural-language-object-retrieval.zip http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/caffe-natural-language-object-retrieval.zip
3 | unzip ./external/caffe-natural-language-object-retrieval.zip -d ./external
4 | 


--------------------------------------------------------------------------------
/models/download_trained_models.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | wget -O ./models/VGG_ILSVRC_16_layers.caffemodel http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/models/VGG_ILSVRC_16_layers.caffemodel
3 | wget -O ./models/coco_pretrained_iter_100000.caffemodel http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/models/coco_pretrained_iter_100000.caffemodel
4 | wget -O ./models/scrc_full_vgg.caffemodel http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/models/scrc_full_vgg.caffemodel
5 | wget -O ./models/scrc_no_context_vgg.caffemodel http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/models/scrc_no_context_vgg.caffemodel
6 | 
7 | wget -O ./models/scrc_kitchen.caffemodel http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/models/scrc_kitchen.caffemodel
8 | 


--------------------------------------------------------------------------------
/prototxt/VGG_ILSVRC_16_layers_deploy.prototxt:
--------------------------------------------------------------------------------
  1 | name: "VGG_ILSVRC_16_layers"
  2 | input: "data"
  3 | input_dim: 10
  4 | input_dim: 3
  5 | input_dim: 224
  6 | input_dim: 224
  7 | layer {
  8 |   name: "conv1_1"
  9 |   type: "Convolution"
 10 |   bottom: "data"
 11 |   top: "conv1_1"
 12 |   convolution_param {
 13 |     num_output: 64
 14 |     pad: 1
 15 |     kernel_size: 3
 16 |   }
 17 | }
 18 | layer {
 19 |   name: "relu1_1"
 20 |   type: "ReLU"
 21 |   bottom: "conv1_1"
 22 |   top: "conv1_1"
 23 | }
 24 | layer {
 25 |   name: "conv1_2"
 26 |   type: "Convolution"
 27 |   bottom: "conv1_1"
 28 |   top: "conv1_2"
 29 |   convolution_param {
 30 |     num_output: 64
 31 |     pad: 1
 32 |     kernel_size: 3
 33 |   }
 34 | }
 35 | layer {
 36 |   name: "relu1_2"
 37 |   type: "ReLU"
 38 |   bottom: "conv1_2"
 39 |   top: "conv1_2"
 40 | }
 41 | layer {
 42 |   name: "pool1"
 43 |   type: "Pooling"
 44 |   bottom: "conv1_2"
 45 |   top: "pool1"
 46 |   pooling_param {
 47 |     pool: MAX
 48 |     kernel_size: 2
 49 |     stride: 2
 50 |   }
 51 | }
 52 | layer {
 53 |   name: "conv2_1"
 54 |   type: "Convolution"
 55 |   bottom: "pool1"
 56 |   top: "conv2_1"
 57 |   convolution_param {
 58 |     num_output: 128
 59 |     pad: 1
 60 |     kernel_size: 3
 61 |   }
 62 | }
 63 | layer {
 64 |   name: "relu2_1"
 65 |   type: "ReLU"
 66 |   bottom: "conv2_1"
 67 |   top: "conv2_1"
 68 | }
 69 | layer {
 70 |   name: "conv2_2"
 71 |   type: "Convolution"
 72 |   bottom: "conv2_1"
 73 |   top: "conv2_2"
 74 |   convolution_param {
 75 |     num_output: 128
 76 |     pad: 1
 77 |     kernel_size: 3
 78 |   }
 79 | }
 80 | layer {
 81 |   name: "relu2_2"
 82 |   type: "ReLU"
 83 |   bottom: "conv2_2"
 84 |   top: "conv2_2"
 85 | }
 86 | layer {
 87 |   name: "pool2"
 88 |   type: "Pooling"
 89 |   bottom: "conv2_2"
 90 |   top: "pool2"
 91 |   pooling_param {
 92 |     pool: MAX
 93 |     kernel_size: 2
 94 |     stride: 2
 95 |   }
 96 | }
 97 | layer {
 98 |   name: "conv3_1"
 99 |   type: "Convolution"
100 |   bottom: "pool2"
101 |   top: "conv3_1"
102 |   convolution_param {
103 |     num_output: 256
104 |     pad: 1
105 |     kernel_size: 3
106 |   }
107 | }
108 | layer {
109 |   name: "relu3_1"
110 |   type: "ReLU"
111 |   bottom: "conv3_1"
112 |   top: "conv3_1"
113 | }
114 | layer {
115 |   name: "conv3_2"
116 |   type: "Convolution"
117 |   bottom: "conv3_1"
118 |   top: "conv3_2"
119 |   convolution_param {
120 |     num_output: 256
121 |     pad: 1
122 |     kernel_size: 3
123 |   }
124 | }
125 | layer {
126 |   name: "relu3_2"
127 |   type: "ReLU"
128 |   bottom: "conv3_2"
129 |   top: "conv3_2"
130 | }
131 | layer {
132 |   name: "conv3_3"
133 |   type: "Convolution"
134 |   bottom: "conv3_2"
135 |   top: "conv3_3"
136 |   convolution_param {
137 |     num_output: 256
138 |     pad: 1
139 |     kernel_size: 3
140 |   }
141 | }
142 | layer {
143 |   name: "relu3_3"
144 |   type: "ReLU"
145 |   bottom: "conv3_3"
146 |   top: "conv3_3"
147 | }
148 | layer {
149 |   name: "pool3"
150 |   type: "Pooling"
151 |   bottom: "conv3_3"
152 |   top: "pool3"
153 |   pooling_param {
154 |     pool: MAX
155 |     kernel_size: 2
156 |     stride: 2
157 |   }
158 | }
159 | layer {
160 |   name: "conv4_1"
161 |   type: "Convolution"
162 |   bottom: "pool3"
163 |   top: "conv4_1"
164 |   convolution_param {
165 |     num_output: 512
166 |     pad: 1
167 |     kernel_size: 3
168 |   }
169 | }
170 | layer {
171 |   name: "relu4_1"
172 |   type: "ReLU"
173 |   bottom: "conv4_1"
174 |   top: "conv4_1"
175 | }
176 | layer {
177 |   name: "conv4_2"
178 |   type: "Convolution"
179 |   bottom: "conv4_1"
180 |   top: "conv4_2"
181 |   convolution_param {
182 |     num_output: 512
183 |     pad: 1
184 |     kernel_size: 3
185 |   }
186 | }
187 | layer {
188 |   name: "relu4_2"
189 |   type: "ReLU"
190 |   bottom: "conv4_2"
191 |   top: "conv4_2"
192 | }
193 | layer {
194 |   name: "conv4_3"
195 |   type: "Convolution"
196 |   bottom: "conv4_2"
197 |   top: "conv4_3"
198 |   convolution_param {
199 |     num_output: 512
200 |     pad: 1
201 |     kernel_size: 3
202 |   }
203 | }
204 | layer {
205 |   name: "relu4_3"
206 |   type: "ReLU"
207 |   bottom: "conv4_3"
208 |   top: "conv4_3"
209 | }
210 | layer {
211 |   name: "pool4"
212 |   type: "Pooling"
213 |   bottom: "conv4_3"
214 |   top: "pool4"
215 |   pooling_param {
216 |     pool: MAX
217 |     kernel_size: 2
218 |     stride: 2
219 |   }
220 | }
221 | layer {
222 |   name: "conv5_1"
223 |   type: "Convolution"
224 |   bottom: "pool4"
225 |   top: "conv5_1"
226 |   convolution_param {
227 |     num_output: 512
228 |     pad: 1
229 |     kernel_size: 3
230 |   }
231 | }
232 | layer {
233 |   name: "relu5_1"
234 |   type: "ReLU"
235 |   bottom: "conv5_1"
236 |   top: "conv5_1"
237 | }
238 | layer {
239 |   name: "conv5_2"
240 |   type: "Convolution"
241 |   bottom: "conv5_1"
242 |   top: "conv5_2"
243 |   convolution_param {
244 |     num_output: 512
245 |     pad: 1
246 |     kernel_size: 3
247 |   }
248 | }
249 | layer {
250 |   name: "relu5_2"
251 |   type: "ReLU"
252 |   bottom: "conv5_2"
253 |   top: "conv5_2"
254 | }
255 | layer {
256 |   name: "conv5_3"
257 |   type: "Convolution"
258 |   bottom: "conv5_2"
259 |   top: "conv5_3"
260 |   convolution_param {
261 |     num_output: 512
262 |     pad: 1
263 |     kernel_size: 3
264 |   }
265 | }
266 | layer {
267 |   name: "relu5_3"
268 |   type: "ReLU"
269 |   bottom: "conv5_3"
270 |   top: "conv5_3"
271 | }
272 | layer {
273 |   name: "pool5"
274 |   type: "Pooling"
275 |   bottom: "conv5_3"
276 |   top: "pool5"
277 |   pooling_param {
278 |     pool: MAX
279 |     kernel_size: 2
280 |     stride: 2
281 |   }
282 | }
283 | layer {
284 |   name: "fc6"
285 |   type: "InnerProduct"
286 |   bottom: "pool5"
287 |   top: "fc6"
288 |   inner_product_param {
289 |     num_output: 4096
290 |   }
291 | }
292 | layer {
293 |   name: "relu6"
294 |   type: "ReLU"
295 |   bottom: "fc6"
296 |   top: "fc6"
297 | }
298 | layer {
299 |   name: "drop6"
300 |   type: "Dropout"
301 |   bottom: "fc6"
302 |   top: "fc6"
303 |   dropout_param {
304 |     dropout_ratio: 0.5
305 |   }
306 | }
307 | layer {
308 |   name: "fc7"
309 |   type: "InnerProduct"
310 |   bottom: "fc6"
311 |   top: "fc7"
312 |   inner_product_param {
313 |     num_output: 4096
314 |   }
315 | }
316 | layer {
317 |   name: "relu7"
318 |   type: "ReLU"
319 |   bottom: "fc7"
320 |   top: "fc7"
321 | }
322 | layer {
323 |   name: "drop7"
324 |   type: "Dropout"
325 |   bottom: "fc7"
326 |   top: "fc7"
327 |   dropout_param {
328 |     dropout_ratio: 0.5
329 |   }
330 | }
331 | layer {
332 |   name: "fc8"
333 |   type: "InnerProduct"
334 |   bottom: "fc7"
335 |   top: "fc8"
336 |   inner_product_param {
337 |     num_output: 1000
338 |   }
339 | }
340 | layer {
341 |   name: "prob"
342 |   type: "Softmax"
343 |   bottom: "fc8"
344 |   top: "prob"
345 | }
346 | 


--------------------------------------------------------------------------------
/prototxt/coco_pretrained.prototxt:
--------------------------------------------------------------------------------
  1 | state {
  2 |   phase: TRAIN level: 0
  3 |   stage: 'freeze-convnet' stage: 'factored' stage: '2-layer'
  4 | }
  5 | 
  6 | input: "data"
  7 | input_shape { dim: 1, dim: 3, dim: 224, dim: 224 }
  8 | input: "cont_sentence"
  9 | input_shape { dim: 20, dim: 1 }
 10 | input: "input_sentence"
 11 | input_shape { dim: 20, dim: 1 }
 12 | 
 13 | layer {
 14 |   name: "conv1_1"
 15 |   type: "Convolution"
 16 |   bottom: "data"
 17 |   top: "conv1_1"
 18 |   param { lr_mult: 0 }
 19 |   param { lr_mult: 0 decay_mult: 0 }
 20 |   include { stage: "freeze-convnet" }
 21 |   convolution_param {
 22 |     num_output: 64
 23 |     pad: 1
 24 |     kernel_size: 3
 25 |   }
 26 | }
 27 | layer {
 28 |   name: "conv1_1"
 29 |   type: "Convolution"
 30 |   bottom: "data"
 31 |   top: "conv1_1"
 32 |   param { lr_mult: 0.1 }
 33 |   param { lr_mult: 0.2 decay_mult: 0}
 34 |   exclude { stage: "freeze-convnet" }
 35 |   convolution_param {
 36 |     num_output: 64
 37 |     pad: 1
 38 |     kernel_size: 3
 39 |   }
 40 | }
 41 | layer {
 42 |   name: "relu1_1"
 43 |   type: "ReLU"
 44 |   bottom: "conv1_1"
 45 |   top: "conv1_1"
 46 | }
 47 | layer {
 48 |   name: "conv1_2"
 49 |   type: "Convolution"
 50 |   bottom: "conv1_1"
 51 |   top: "conv1_2"
 52 |   param { lr_mult: 0 }
 53 |   param { lr_mult: 0 decay_mult: 0 }
 54 |   include { stage: "freeze-convnet" }
 55 |   convolution_param {
 56 |     num_output: 64
 57 |     pad: 1
 58 |     kernel_size: 3
 59 |   }
 60 | }
 61 | layer {
 62 |   name: "conv1_2"
 63 |   type: "Convolution"
 64 |   bottom: "conv1_1"
 65 |   top: "conv1_2"
 66 |   param { lr_mult: 0.1 }
 67 |   param { lr_mult: 0.2 decay_mult: 0}
 68 |   exclude { stage: "freeze-convnet" }
 69 |   convolution_param {
 70 |     num_output: 64
 71 |     pad: 1
 72 |     kernel_size: 3
 73 |   }
 74 | }
 75 | layer {
 76 |   name: "relu1_2"
 77 |   type: "ReLU"
 78 |   bottom: "conv1_2"
 79 |   top: "conv1_2"
 80 | }
 81 | layer {
 82 |   name: "pool1"
 83 |   type: "Pooling"
 84 |   bottom: "conv1_2"
 85 |   top: "pool1"
 86 |   pooling_param {
 87 |     pool: MAX
 88 |     kernel_size: 2
 89 |     stride: 2
 90 |   }
 91 | }
 92 | layer {
 93 |   name: "conv2_1"
 94 |   type: "Convolution"
 95 |   bottom: "pool1"
 96 |   top: "conv2_1"
 97 |   param { lr_mult: 0 }
 98 |   param { lr_mult: 0 decay_mult: 0 }
 99 |   include { stage: "freeze-convnet" }
100 |   convolution_param {
101 |     num_output: 128
102 |     pad: 1
103 |     kernel_size: 3
104 |   }
105 | }
106 | layer {
107 |   name: "conv2_1"
108 |   type: "Convolution"
109 |   bottom: "pool1"
110 |   top: "conv2_1"
111 |   param { lr_mult: 0.1 }
112 |   param { lr_mult: 0.2 decay_mult: 0}
113 |   exclude { stage: "freeze-convnet" }
114 |   convolution_param {
115 |     num_output: 128
116 |     pad: 1
117 |     kernel_size: 3
118 |   }
119 | }
120 | layer {
121 |   name: "relu2_1"
122 |   type: "ReLU"
123 |   bottom: "conv2_1"
124 |   top: "conv2_1"
125 | }
126 | layer {
127 |   name: "conv2_2"
128 |   type: "Convolution"
129 |   bottom: "conv2_1"
130 |   top: "conv2_2"
131 |   param { lr_mult: 0 }
132 |   param { lr_mult: 0 decay_mult: 0 }
133 |   include { stage: "freeze-convnet" }
134 |   convolution_param {
135 |     num_output: 128
136 |     pad: 1
137 |     kernel_size: 3
138 |   }
139 | }
140 | layer {
141 |   name: "conv2_2"
142 |   type: "Convolution"
143 |   bottom: "conv2_1"
144 |   top: "conv2_2"
145 |   param { lr_mult: 0.1 }
146 |   param { lr_mult: 0.2 decay_mult: 0}
147 |   exclude { stage: "freeze-convnet" }
148 |   convolution_param {
149 |     num_output: 128
150 |     pad: 1
151 |     kernel_size: 3
152 |   }
153 | }
154 | layer {
155 |   name: "relu2_2"
156 |   type: "ReLU"
157 |   bottom: "conv2_2"
158 |   top: "conv2_2"
159 | }
160 | layer {
161 |   name: "pool2"
162 |   type: "Pooling"
163 |   bottom: "conv2_2"
164 |   top: "pool2"
165 |   pooling_param {
166 |     pool: MAX
167 |     kernel_size: 2
168 |     stride: 2
169 |   }
170 | }
171 | layer {
172 |   name: "conv3_1"
173 |   type: "Convolution"
174 |   bottom: "pool2"
175 |   top: "conv3_1"
176 |   param { lr_mult: 0 }
177 |   param { lr_mult: 0 decay_mult: 0 }
178 |   include { stage: "freeze-convnet" }
179 |   convolution_param {
180 |     num_output: 256
181 |     pad: 1
182 |     kernel_size: 3
183 |   }
184 | }
185 | layer {
186 |   name: "conv3_1"
187 |   type: "Convolution"
188 |   bottom: "pool2"
189 |   top: "conv3_1"
190 |   param { lr_mult: 0.1 }
191 |   param { lr_mult: 0.2 decay_mult: 0}
192 |   exclude { stage: "freeze-convnet" }
193 |   convolution_param {
194 |     num_output: 256
195 |     pad: 1
196 |     kernel_size: 3
197 |   }
198 | }
199 | layer {
200 |   name: "relu3_1"
201 |   type: "ReLU"
202 |   bottom: "conv3_1"
203 |   top: "conv3_1"
204 | }
205 | layer {
206 |   name: "conv3_2"
207 |   type: "Convolution"
208 |   bottom: "conv3_1"
209 |   top: "conv3_2"
210 |   param { lr_mult: 0 }
211 |   param { lr_mult: 0 decay_mult: 0 }
212 |   include { stage: "freeze-convnet" }
213 |   convolution_param {
214 |     num_output: 256
215 |     pad: 1
216 |     kernel_size: 3
217 |   }
218 | }
219 | layer {
220 |   name: "conv3_2"
221 |   type: "Convolution"
222 |   bottom: "conv3_1"
223 |   top: "conv3_2"
224 |   param { lr_mult: 0.1 }
225 |   param { lr_mult: 0.2 decay_mult: 0}
226 |   exclude { stage: "freeze-convnet" }
227 |   convolution_param {
228 |     num_output: 256
229 |     pad: 1
230 |     kernel_size: 3
231 |   }
232 | }
233 | layer {
234 |   name: "relu3_2"
235 |   type: "ReLU"
236 |   bottom: "conv3_2"
237 |   top: "conv3_2"
238 | }
239 | layer {
240 |   name: "conv3_3"
241 |   type: "Convolution"
242 |   bottom: "conv3_2"
243 |   top: "conv3_3"
244 |   param { lr_mult: 0 }
245 |   param { lr_mult: 0 decay_mult: 0 }
246 |   include { stage: "freeze-convnet" }
247 |   convolution_param {
248 |     num_output: 256
249 |     pad: 1
250 |     kernel_size: 3
251 |   }
252 | }
253 | layer {
254 |   name: "conv3_3"
255 |   type: "Convolution"
256 |   bottom: "conv3_2"
257 |   top: "conv3_3"
258 |   param { lr_mult: 0.1 }
259 |   param { lr_mult: 0.2 decay_mult: 0}
260 |   exclude { stage: "freeze-convnet" }
261 |   convolution_param {
262 |     num_output: 256
263 |     pad: 1
264 |     kernel_size: 3
265 |   }
266 | }
267 | layer {
268 |   name: "relu3_3"
269 |   type: "ReLU"
270 |   bottom: "conv3_3"
271 |   top: "conv3_3"
272 | }
273 | layer {
274 |   name: "pool3"
275 |   type: "Pooling"
276 |   bottom: "conv3_3"
277 |   top: "pool3"
278 |   pooling_param {
279 |     pool: MAX
280 |     kernel_size: 2
281 |     stride: 2
282 |   }
283 | }
284 | layer {
285 |   name: "conv4_1"
286 |   type: "Convolution"
287 |   bottom: "pool3"
288 |   top: "conv4_1"
289 |   param { lr_mult: 0 }
290 |   param { lr_mult: 0 decay_mult: 0 }
291 |   include { stage: "freeze-convnet" }
292 |   convolution_param {
293 |     num_output: 512
294 |     pad: 1
295 |     kernel_size: 3
296 |   }
297 | }
298 | layer {
299 |   name: "conv4_1"
300 |   type: "Convolution"
301 |   bottom: "pool3"
302 |   top: "conv4_1"
303 |   param { lr_mult: 0.1 }
304 |   param { lr_mult: 0.2 decay_mult: 0}
305 |   exclude { stage: "freeze-convnet" }
306 |   convolution_param {
307 |     num_output: 512
308 |     pad: 1
309 |     kernel_size: 3
310 |   }
311 | }
312 | layer {
313 |   name: "relu4_1"
314 |   type: "ReLU"
315 |   bottom: "conv4_1"
316 |   top: "conv4_1"
317 | }
318 | layer {
319 |   name: "conv4_2"
320 |   type: "Convolution"
321 |   bottom: "conv4_1"
322 |   top: "conv4_2"
323 |   param { lr_mult: 0 }
324 |   param { lr_mult: 0 decay_mult: 0 }
325 |   include { stage: "freeze-convnet" }
326 |   convolution_param {
327 |     num_output: 512
328 |     pad: 1
329 |     kernel_size: 3
330 |   }
331 | }
332 | layer {
333 |   name: "conv4_2"
334 |   type: "Convolution"
335 |   bottom: "conv4_1"
336 |   top: "conv4_2"
337 |   param { lr_mult: 0.1 }
338 |   param { lr_mult: 0.2 decay_mult: 0}
339 |   exclude { stage: "freeze-convnet" }
340 |   convolution_param {
341 |     num_output: 512
342 |     pad: 1
343 |     kernel_size: 3
344 |   }
345 | }
346 | layer {
347 |   name: "relu4_2"
348 |   type: "ReLU"
349 |   bottom: "conv4_2"
350 |   top: "conv4_2"
351 | }
352 | layer {
353 |   name: "conv4_3"
354 |   type: "Convolution"
355 |   bottom: "conv4_2"
356 |   top: "conv4_3"
357 |   param { lr_mult: 0 }
358 |   param { lr_mult: 0 decay_mult: 0 }
359 |   include { stage: "freeze-convnet" }
360 |   convolution_param {
361 |     num_output: 512
362 |     pad: 1
363 |     kernel_size: 3
364 |   }
365 | }
366 | layer {
367 |   name: "conv4_3"
368 |   type: "Convolution"
369 |   bottom: "conv4_2"
370 |   top: "conv4_3"
371 |   param { lr_mult: 0.1 }
372 |   param { lr_mult: 0.2 decay_mult: 0}
373 |   exclude { stage: "freeze-convnet" }
374 |   convolution_param {
375 |     num_output: 512
376 |     pad: 1
377 |     kernel_size: 3
378 |   }
379 | }
380 | layer {
381 |   name: "relu4_3"
382 |   type: "ReLU"
383 |   bottom: "conv4_3"
384 |   top: "conv4_3"
385 | }
386 | layer {
387 |   name: "pool4"
388 |   type: "Pooling"
389 |   bottom: "conv4_3"
390 |   top: "pool4"
391 |   pooling_param {
392 |     pool: MAX
393 |     kernel_size: 2
394 |     stride: 2
395 |   }
396 | }
397 | layer {
398 |   name: "conv5_1"
399 |   type: "Convolution"
400 |   bottom: "pool4"
401 |   top: "conv5_1"
402 |   param { lr_mult: 0 }
403 |   param { lr_mult: 0 decay_mult: 0 }
404 |   include { stage: "freeze-convnet" }
405 |   convolution_param {
406 |     num_output: 512
407 |     pad: 1
408 |     kernel_size: 3
409 |   }
410 | }
411 | layer {
412 |   name: "conv5_1"
413 |   type: "Convolution"
414 |   bottom: "pool4"
415 |   top: "conv5_1"
416 |   param { lr_mult: 0.1 }
417 |   param { lr_mult: 0.2 decay_mult: 0}
418 |   exclude { stage: "freeze-convnet" }
419 |   convolution_param {
420 |     num_output: 512
421 |     pad: 1
422 |     kernel_size: 3
423 |   }
424 | }
425 | layer {
426 |   name: "relu5_1"
427 |   type: "ReLU"
428 |   bottom: "conv5_1"
429 |   top: "conv5_1"
430 | }
431 | layer {
432 |   name: "conv5_2"
433 |   type: "Convolution"
434 |   bottom: "conv5_1"
435 |   top: "conv5_2"
436 |   param { lr_mult: 0 }
437 |   param { lr_mult: 0 decay_mult: 0 }
438 |   include { stage: "freeze-convnet" }
439 |   convolution_param {
440 |     num_output: 512
441 |     pad: 1
442 |     kernel_size: 3
443 |   }
444 | }
445 | layer {
446 |   name: "conv5_2"
447 |   type: "Convolution"
448 |   bottom: "conv5_1"
449 |   top: "conv5_2"
450 |   param { lr_mult: 0.1 }
451 |   param { lr_mult: 0.2 decay_mult: 0}
452 |   exclude { stage: "freeze-convnet" }
453 |   convolution_param {
454 |     num_output: 512
455 |     pad: 1
456 |     kernel_size: 3
457 |   }
458 | }
459 | layer {
460 |   name: "relu5_2"
461 |   type: "ReLU"
462 |   bottom: "conv5_2"
463 |   top: "conv5_2"
464 | }
465 | layer {
466 |   name: "conv5_3"
467 |   type: "Convolution"
468 |   bottom: "conv5_2"
469 |   top: "conv5_3"
470 |   param { lr_mult: 0 }
471 |   param { lr_mult: 0 decay_mult: 0 }
472 |   include { stage: "freeze-convnet" }
473 |   convolution_param {
474 |     num_output: 512
475 |     pad: 1
476 |     kernel_size: 3
477 |   }
478 | }
479 | layer {
480 |   name: "conv5_3"
481 |   type: "Convolution"
482 |   bottom: "conv5_2"
483 |   top: "conv5_3"
484 |   param { lr_mult: 0.1 }
485 |   param { lr_mult: 0.2 decay_mult: 0}
486 |   exclude { stage: "freeze-convnet" }
487 |   convolution_param {
488 |     num_output: 512
489 |     pad: 1
490 |     kernel_size: 3
491 |   }
492 | }
493 | layer {
494 |   name: "relu5_3"
495 |   type: "ReLU"
496 |   bottom: "conv5_3"
497 |   top: "conv5_3"
498 | }
499 | layer {
500 |   name: "pool5"
501 |   type: "Pooling"
502 |   bottom: "conv5_3"
503 |   top: "pool5"
504 |   pooling_param {
505 |     pool: MAX
506 |     kernel_size: 2
507 |     stride: 2
508 |   }
509 | }
510 | layer {
511 |   name: "fc6"
512 |   type: "InnerProduct"
513 |   bottom: "pool5"
514 |   top: "fc6"
515 |   param { lr_mult: 0 }
516 |   param { lr_mult: 0 decay_mult: 0 }
517 |   include { stage: "freeze-convnet" }
518 |   inner_product_param {
519 |     num_output: 4096
520 |   }
521 | }
522 | layer {
523 |   name: "fc6"
524 |   type: "InnerProduct"
525 |   bottom: "pool5"
526 |   top: "fc6"
527 |   param { lr_mult: 0.1 }
528 |   param { lr_mult: 0.2 decay_mult: 0}
529 |   exclude { stage: "freeze-convnet" }
530 |   inner_product_param {
531 |     num_output: 4096
532 |   }
533 | }
534 | layer {
535 |   name: "relu6"
536 |   type: "ReLU"
537 |   bottom: "fc6"
538 |   top: "fc6"
539 | }
540 | layer {
541 |   name: "drop6"
542 |   type: "Dropout"
543 |   bottom: "fc6"
544 |   top: "fc6"
545 |   dropout_param {
546 |     dropout_ratio: 0.5
547 |   }
548 | }
549 | layer {
550 |   name: "fc7"
551 |   type: "InnerProduct"
552 |   bottom: "fc6"
553 |   top: "fc7"
554 |   param { lr_mult: 0 }
555 |   param { lr_mult: 0 decay_mult: 0 }
556 |   include { stage: "freeze-convnet" }
557 |   inner_product_param {
558 |     num_output: 4096
559 |   }
560 | }
561 | layer {
562 |   name: "fc7"
563 |   type: "InnerProduct"
564 |   bottom: "fc6"
565 |   top: "fc7"
566 |   param { lr_mult: 0.1 }
567 |   param { lr_mult: 0.2 decay_mult: 0}
568 |   exclude { stage: "freeze-convnet" }
569 |   inner_product_param {
570 |     num_output: 4096
571 |   }
572 | }
573 | layer {
574 |   name: "relu7"
575 |   type: "ReLU"
576 |   bottom: "fc7"
577 |   top: "fc7"
578 | }
579 | layer {
580 |   name: "drop7"
581 |   type: "Dropout"
582 |   bottom: "fc7"
583 |   top: "fc7"
584 |   dropout_param {
585 |     dropout_ratio: 0.5
586 |   }
587 | }
588 | layer {
589 |   name: "fc8"
590 |   type: "InnerProduct"
591 |   bottom: "fc7"
592 |   top: "fc8"
593 |   param {
594 |     lr_mult: 0.1
595 |     decay_mult: 1
596 |   }
597 |   param {
598 |     lr_mult: 0.2
599 |     decay_mult: 0
600 |   }
601 |   inner_product_param {
602 |     num_output: 1000
603 |   }
604 | }
605 | layer {
606 |   name: "embedding"
607 |   type: "Embed"
608 |   bottom: "input_sentence"
609 |   top: "embedded_input_sentence"
610 |   param {
611 |     lr_mult: 1
612 |   }
613 |   embed_param {
614 |     bias_term: false
615 |     input_dim: 8801
616 |     num_output: 1000
617 |     weight_filler {
618 |       type: "uniform"
619 |       min: -0.08
620 |       max: 0.08
621 |     }
622 |   }
623 | }
624 | layer {
625 |   name: "lstm1"
626 |   type: "LSTM"
627 |   bottom: "embedded_input_sentence"
628 |   bottom: "cont_sentence"
629 |   bottom: "fc8"
630 |   top: "lstm1"
631 |   include { stage: "unfactored" }
632 |   recurrent_param {
633 |     num_output: 1000
634 |     weight_filler {
635 |       type: "uniform"
636 |       min: -0.08
637 |       max: 0.08
638 |     }
639 |     bias_filler {
640 |       type: "constant"
641 |       value: 0
642 |     }
643 |   }
644 | }
645 | layer {
646 |   name: "lstm2"
647 |   type: "LSTM"
648 |   bottom: "lstm1"
649 |   bottom: "cont_sentence"
650 |   top: "lstm2"
651 |   include {
652 |     stage: "unfactored"
653 |     stage: "2-layer"
654 |   }
655 |   recurrent_param {
656 |     num_output: 1000
657 |     weight_filler {
658 |       type: "uniform"
659 |       min: -0.08
660 |       max: 0.08
661 |     }
662 |     bias_filler {
663 |       type: "constant"
664 |       value: 0
665 |     }
666 |   }
667 | }
668 | layer {
669 |   name: "lstm1"
670 |   type: "LSTM"
671 |   bottom: "embedded_input_sentence"
672 |   bottom: "cont_sentence"
673 |   top: "lstm1"
674 |   include { stage: "factored" }
675 |   recurrent_param {
676 |     num_output: 1000
677 |     weight_filler {
678 |       type: "uniform"
679 |       min: -0.08
680 |       max: 0.08
681 |     }
682 |     bias_filler {
683 |       type: "constant"
684 |       value: 0
685 |     }
686 |   }
687 | }
688 | layer {
689 |   name: "lstm2"
690 |   type: "LSTM"
691 |   bottom: "lstm1"
692 |   bottom: "cont_sentence"
693 |   bottom: "fc8"
694 |   top: "lstm2"
695 |   include { stage: "factored" }
696 |   recurrent_param {
697 |     num_output: 1000
698 |     weight_filler {
699 |       type: "uniform"
700 |       min: -0.08
701 |       max: 0.08
702 |     }
703 |     bias_filler {
704 |       type: "constant"
705 |       value: 0
706 |     }
707 |   }
708 | }
709 | layer {
710 |   name: "predict"
711 |   type: "InnerProduct"
712 |   bottom: "lstm1"
713 |   top: "predict"
714 |   param {
715 |     lr_mult: 1
716 |     decay_mult: 1
717 |   }
718 |   param {
719 |     lr_mult: 2
720 |     decay_mult: 0
721 |   }
722 |   exclude { stage: "2-layer" }
723 |   inner_product_param {
724 |     num_output: 8801
725 |     weight_filler {
726 |       type: "uniform"
727 |       min: -0.08
728 |       max: 0.08
729 |     }
730 |     bias_filler {
731 |       type: "constant"
732 |       value: 0
733 |     }
734 |     axis: 2
735 |   }
736 | }
737 | layer {
738 |   name: "predict"
739 |   type: "InnerProduct"
740 |   bottom: "lstm2"
741 |   top: "predict"
742 |   param {
743 |     lr_mult: 1
744 |     decay_mult: 1
745 |   }
746 |   param {
747 |     lr_mult: 2
748 |     decay_mult: 0
749 |   }
750 |   include { stage: "2-layer" }
751 |   inner_product_param {
752 |     num_output: 8801
753 |     weight_filler {
754 |       type: "uniform"
755 |       min: -0.08
756 |       max: 0.08
757 |     }
758 |     bias_filler {
759 |       type: "constant"
760 |       value: 0
761 |     }
762 |     axis: 2
763 |   }
764 | }
765 | 


--------------------------------------------------------------------------------
/prototxt/scrc_full_vgg_buffer_50.prototxt:
--------------------------------------------------------------------------------
  1 | state {
  2 |   phase: TRAIN level: 0
  3 |   stage: 'freeze-convnet' stage: 'factored' stage: '2-layer'
  4 | }
  5 | 
  6 | # train data layers
  7 | layer {
  8 |   name: "data"
  9 |   type: "ImageData"
 10 |   top: "data"
 11 |   top: "label"
 12 |   transform_param {
 13 |     crop_size: 224
 14 |     mean_value: 104
 15 |     mean_value: 117
 16 |     mean_value: 123
 17 |   }
 18 |   image_data_param {
 19 |     source: "./data/training/train_bbox_context_imcrop_list.txt"
 20 |     batch_size: 50
 21 |   }
 22 | }
 23 | layer {
 24 |   name: "data"
 25 |   type: "HDF5Data"
 26 |   top: "cont_sentence"
 27 |   top: "input_sentence"
 28 |   top: "target_sentence"
 29 |   hdf5_data_param {
 30 |     source: "./data/training/train_bbox_context_hdf5_text_list.txt"
 31 |     batch_size: 20
 32 |   }
 33 | }
 34 | layer {
 35 |   name: "data"
 36 |   type: "HDF5Data"
 37 |   top: "bbox_coordinate"
 38 |   top: "fc7_context"
 39 |   hdf5_data_param {
 40 |     source: "./data/training/train_bbox_context_hdf5_bbox_list.txt"
 41 |     batch_size: 50
 42 |   }
 43 | }
 44 | 
 45 | layer {
 46 |   name: "silence"
 47 |   type: "Silence"
 48 |   bottom: "label"
 49 | }
 50 | layer {
 51 |   name: "conv1_1"
 52 |   type: "Convolution"
 53 |   bottom: "data"
 54 |   top: "conv1_1"
 55 |   param { lr_mult: 0 }
 56 |   param { lr_mult: 0 decay_mult: 0 }
 57 |   include { stage: "freeze-convnet" }
 58 |   convolution_param {
 59 |     num_output: 64
 60 |     pad: 1
 61 |     kernel_size: 3
 62 |   }
 63 | }
 64 | layer {
 65 |   name: "conv1_1"
 66 |   type: "Convolution"
 67 |   bottom: "data"
 68 |   top: "conv1_1"
 69 |   param { lr_mult: 0.1 }
 70 |   param { lr_mult: 0.2 decay_mult: 0}
 71 |   exclude { stage: "freeze-convnet" }
 72 |   convolution_param {
 73 |     num_output: 64
 74 |     pad: 1
 75 |     kernel_size: 3
 76 |   }
 77 | }
 78 | layer {
 79 |   name: "relu1_1"
 80 |   type: "ReLU"
 81 |   bottom: "conv1_1"
 82 |   top: "conv1_1"
 83 | }
 84 | layer {
 85 |   name: "conv1_2"
 86 |   type: "Convolution"
 87 |   bottom: "conv1_1"
 88 |   top: "conv1_2"
 89 |   param { lr_mult: 0 }
 90 |   param { lr_mult: 0 decay_mult: 0 }
 91 |   include { stage: "freeze-convnet" }
 92 |   convolution_param {
 93 |     num_output: 64
 94 |     pad: 1
 95 |     kernel_size: 3
 96 |   }
 97 | }
 98 | layer {
 99 |   name: "conv1_2"
100 |   type: "Convolution"
101 |   bottom: "conv1_1"
102 |   top: "conv1_2"
103 |   param { lr_mult: 0.1 }
104 |   param { lr_mult: 0.2 decay_mult: 0}
105 |   exclude { stage: "freeze-convnet" }
106 |   convolution_param {
107 |     num_output: 64
108 |     pad: 1
109 |     kernel_size: 3
110 |   }
111 | }
112 | layer {
113 |   name: "relu1_2"
114 |   type: "ReLU"
115 |   bottom: "conv1_2"
116 |   top: "conv1_2"
117 | }
118 | layer {
119 |   name: "pool1"
120 |   type: "Pooling"
121 |   bottom: "conv1_2"
122 |   top: "pool1"
123 |   pooling_param {
124 |     pool: MAX
125 |     kernel_size: 2
126 |     stride: 2
127 |   }
128 | }
129 | layer {
130 |   name: "conv2_1"
131 |   type: "Convolution"
132 |   bottom: "pool1"
133 |   top: "conv2_1"
134 |   param { lr_mult: 0 }
135 |   param { lr_mult: 0 decay_mult: 0 }
136 |   include { stage: "freeze-convnet" }
137 |   convolution_param {
138 |     num_output: 128
139 |     pad: 1
140 |     kernel_size: 3
141 |   }
142 | }
143 | layer {
144 |   name: "conv2_1"
145 |   type: "Convolution"
146 |   bottom: "pool1"
147 |   top: "conv2_1"
148 |   param { lr_mult: 0.1 }
149 |   param { lr_mult: 0.2 decay_mult: 0}
150 |   exclude { stage: "freeze-convnet" }
151 |   convolution_param {
152 |     num_output: 128
153 |     pad: 1
154 |     kernel_size: 3
155 |   }
156 | }
157 | layer {
158 |   name: "relu2_1"
159 |   type: "ReLU"
160 |   bottom: "conv2_1"
161 |   top: "conv2_1"
162 | }
163 | layer {
164 |   name: "conv2_2"
165 |   type: "Convolution"
166 |   bottom: "conv2_1"
167 |   top: "conv2_2"
168 |   param { lr_mult: 0 }
169 |   param { lr_mult: 0 decay_mult: 0 }
170 |   include { stage: "freeze-convnet" }
171 |   convolution_param {
172 |     num_output: 128
173 |     pad: 1
174 |     kernel_size: 3
175 |   }
176 | }
177 | layer {
178 |   name: "conv2_2"
179 |   type: "Convolution"
180 |   bottom: "conv2_1"
181 |   top: "conv2_2"
182 |   param { lr_mult: 0.1 }
183 |   param { lr_mult: 0.2 decay_mult: 0}
184 |   exclude { stage: "freeze-convnet" }
185 |   convolution_param {
186 |     num_output: 128
187 |     pad: 1
188 |     kernel_size: 3
189 |   }
190 | }
191 | layer {
192 |   name: "relu2_2"
193 |   type: "ReLU"
194 |   bottom: "conv2_2"
195 |   top: "conv2_2"
196 | }
197 | layer {
198 |   name: "pool2"
199 |   type: "Pooling"
200 |   bottom: "conv2_2"
201 |   top: "pool2"
202 |   pooling_param {
203 |     pool: MAX
204 |     kernel_size: 2
205 |     stride: 2
206 |   }
207 | }
208 | layer {
209 |   name: "conv3_1"
210 |   type: "Convolution"
211 |   bottom: "pool2"
212 |   top: "conv3_1"
213 |   param { lr_mult: 0 }
214 |   param { lr_mult: 0 decay_mult: 0 }
215 |   include { stage: "freeze-convnet" }
216 |   convolution_param {
217 |     num_output: 256
218 |     pad: 1
219 |     kernel_size: 3
220 |   }
221 | }
222 | layer {
223 |   name: "conv3_1"
224 |   type: "Convolution"
225 |   bottom: "pool2"
226 |   top: "conv3_1"
227 |   param { lr_mult: 0.1 }
228 |   param { lr_mult: 0.2 decay_mult: 0}
229 |   exclude { stage: "freeze-convnet" }
230 |   convolution_param {
231 |     num_output: 256
232 |     pad: 1
233 |     kernel_size: 3
234 |   }
235 | }
236 | layer {
237 |   name: "relu3_1"
238 |   type: "ReLU"
239 |   bottom: "conv3_1"
240 |   top: "conv3_1"
241 | }
242 | layer {
243 |   name: "conv3_2"
244 |   type: "Convolution"
245 |   bottom: "conv3_1"
246 |   top: "conv3_2"
247 |   param { lr_mult: 0 }
248 |   param { lr_mult: 0 decay_mult: 0 }
249 |   include { stage: "freeze-convnet" }
250 |   convolution_param {
251 |     num_output: 256
252 |     pad: 1
253 |     kernel_size: 3
254 |   }
255 | }
256 | layer {
257 |   name: "conv3_2"
258 |   type: "Convolution"
259 |   bottom: "conv3_1"
260 |   top: "conv3_2"
261 |   param { lr_mult: 0.1 }
262 |   param { lr_mult: 0.2 decay_mult: 0}
263 |   exclude { stage: "freeze-convnet" }
264 |   convolution_param {
265 |     num_output: 256
266 |     pad: 1
267 |     kernel_size: 3
268 |   }
269 | }
270 | layer {
271 |   name: "relu3_2"
272 |   type: "ReLU"
273 |   bottom: "conv3_2"
274 |   top: "conv3_2"
275 | }
276 | layer {
277 |   name: "conv3_3"
278 |   type: "Convolution"
279 |   bottom: "conv3_2"
280 |   top: "conv3_3"
281 |   param { lr_mult: 0 }
282 |   param { lr_mult: 0 decay_mult: 0 }
283 |   include { stage: "freeze-convnet" }
284 |   convolution_param {
285 |     num_output: 256
286 |     pad: 1
287 |     kernel_size: 3
288 |   }
289 | }
290 | layer {
291 |   name: "conv3_3"
292 |   type: "Convolution"
293 |   bottom: "conv3_2"
294 |   top: "conv3_3"
295 |   param { lr_mult: 0.1 }
296 |   param { lr_mult: 0.2 decay_mult: 0}
297 |   exclude { stage: "freeze-convnet" }
298 |   convolution_param {
299 |     num_output: 256
300 |     pad: 1
301 |     kernel_size: 3
302 |   }
303 | }
304 | layer {
305 |   name: "relu3_3"
306 |   type: "ReLU"
307 |   bottom: "conv3_3"
308 |   top: "conv3_3"
309 | }
310 | layer {
311 |   name: "pool3"
312 |   type: "Pooling"
313 |   bottom: "conv3_3"
314 |   top: "pool3"
315 |   pooling_param {
316 |     pool: MAX
317 |     kernel_size: 2
318 |     stride: 2
319 |   }
320 | }
321 | layer {
322 |   name: "conv4_1"
323 |   type: "Convolution"
324 |   bottom: "pool3"
325 |   top: "conv4_1"
326 |   param { lr_mult: 0 }
327 |   param { lr_mult: 0 decay_mult: 0 }
328 |   include { stage: "freeze-convnet" }
329 |   convolution_param {
330 |     num_output: 512
331 |     pad: 1
332 |     kernel_size: 3
333 |   }
334 | }
335 | layer {
336 |   name: "conv4_1"
337 |   type: "Convolution"
338 |   bottom: "pool3"
339 |   top: "conv4_1"
340 |   param { lr_mult: 0.1 }
341 |   param { lr_mult: 0.2 decay_mult: 0}
342 |   exclude { stage: "freeze-convnet" }
343 |   convolution_param {
344 |     num_output: 512
345 |     pad: 1
346 |     kernel_size: 3
347 |   }
348 | }
349 | layer {
350 |   name: "relu4_1"
351 |   type: "ReLU"
352 |   bottom: "conv4_1"
353 |   top: "conv4_1"
354 | }
355 | layer {
356 |   name: "conv4_2"
357 |   type: "Convolution"
358 |   bottom: "conv4_1"
359 |   top: "conv4_2"
360 |   param { lr_mult: 0 }
361 |   param { lr_mult: 0 decay_mult: 0 }
362 |   include { stage: "freeze-convnet" }
363 |   convolution_param {
364 |     num_output: 512
365 |     pad: 1
366 |     kernel_size: 3
367 |   }
368 | }
369 | layer {
370 |   name: "conv4_2"
371 |   type: "Convolution"
372 |   bottom: "conv4_1"
373 |   top: "conv4_2"
374 |   param { lr_mult: 0.1 }
375 |   param { lr_mult: 0.2 decay_mult: 0}
376 |   exclude { stage: "freeze-convnet" }
377 |   convolution_param {
378 |     num_output: 512
379 |     pad: 1
380 |     kernel_size: 3
381 |   }
382 | }
383 | layer {
384 |   name: "relu4_2"
385 |   type: "ReLU"
386 |   bottom: "conv4_2"
387 |   top: "conv4_2"
388 | }
389 | layer {
390 |   name: "conv4_3"
391 |   type: "Convolution"
392 |   bottom: "conv4_2"
393 |   top: "conv4_3"
394 |   param { lr_mult: 0 }
395 |   param { lr_mult: 0 decay_mult: 0 }
396 |   include { stage: "freeze-convnet" }
397 |   convolution_param {
398 |     num_output: 512
399 |     pad: 1
400 |     kernel_size: 3
401 |   }
402 | }
403 | layer {
404 |   name: "conv4_3"
405 |   type: "Convolution"
406 |   bottom: "conv4_2"
407 |   top: "conv4_3"
408 |   param { lr_mult: 0.1 }
409 |   param { lr_mult: 0.2 decay_mult: 0}
410 |   exclude { stage: "freeze-convnet" }
411 |   convolution_param {
412 |     num_output: 512
413 |     pad: 1
414 |     kernel_size: 3
415 |   }
416 | }
417 | layer {
418 |   name: "relu4_3"
419 |   type: "ReLU"
420 |   bottom: "conv4_3"
421 |   top: "conv4_3"
422 | }
423 | layer {
424 |   name: "pool4"
425 |   type: "Pooling"
426 |   bottom: "conv4_3"
427 |   top: "pool4"
428 |   pooling_param {
429 |     pool: MAX
430 |     kernel_size: 2
431 |     stride: 2
432 |   }
433 | }
434 | layer {
435 |   name: "conv5_1"
436 |   type: "Convolution"
437 |   bottom: "pool4"
438 |   top: "conv5_1"
439 |   param { lr_mult: 0 }
440 |   param { lr_mult: 0 decay_mult: 0 }
441 |   include { stage: "freeze-convnet" }
442 |   convolution_param {
443 |     num_output: 512
444 |     pad: 1
445 |     kernel_size: 3
446 |   }
447 | }
448 | layer {
449 |   name: "conv5_1"
450 |   type: "Convolution"
451 |   bottom: "pool4"
452 |   top: "conv5_1"
453 |   param { lr_mult: 0.1 }
454 |   param { lr_mult: 0.2 decay_mult: 0}
455 |   exclude { stage: "freeze-convnet" }
456 |   convolution_param {
457 |     num_output: 512
458 |     pad: 1
459 |     kernel_size: 3
460 |   }
461 | }
462 | layer {
463 |   name: "relu5_1"
464 |   type: "ReLU"
465 |   bottom: "conv5_1"
466 |   top: "conv5_1"
467 | }
468 | layer {
469 |   name: "conv5_2"
470 |   type: "Convolution"
471 |   bottom: "conv5_1"
472 |   top: "conv5_2"
473 |   param { lr_mult: 0 }
474 |   param { lr_mult: 0 decay_mult: 0 }
475 |   include { stage: "freeze-convnet" }
476 |   convolution_param {
477 |     num_output: 512
478 |     pad: 1
479 |     kernel_size: 3
480 |   }
481 | }
482 | layer {
483 |   name: "conv5_2"
484 |   type: "Convolution"
485 |   bottom: "conv5_1"
486 |   top: "conv5_2"
487 |   param { lr_mult: 0.1 }
488 |   param { lr_mult: 0.2 decay_mult: 0}
489 |   exclude { stage: "freeze-convnet" }
490 |   convolution_param {
491 |     num_output: 512
492 |     pad: 1
493 |     kernel_size: 3
494 |   }
495 | }
496 | layer {
497 |   name: "relu5_2"
498 |   type: "ReLU"
499 |   bottom: "conv5_2"
500 |   top: "conv5_2"
501 | }
502 | layer {
503 |   name: "conv5_3"
504 |   type: "Convolution"
505 |   bottom: "conv5_2"
506 |   top: "conv5_3"
507 |   param { lr_mult: 0 }
508 |   param { lr_mult: 0 decay_mult: 0 }
509 |   include { stage: "freeze-convnet" }
510 |   convolution_param {
511 |     num_output: 512
512 |     pad: 1
513 |     kernel_size: 3
514 |   }
515 | }
516 | layer {
517 |   name: "conv5_3"
518 |   type: "Convolution"
519 |   bottom: "conv5_2"
520 |   top: "conv5_3"
521 |   param { lr_mult: 0.1 }
522 |   param { lr_mult: 0.2 decay_mult: 0}
523 |   exclude { stage: "freeze-convnet" }
524 |   convolution_param {
525 |     num_output: 512
526 |     pad: 1
527 |     kernel_size: 3
528 |   }
529 | }
530 | layer {
531 |   name: "relu5_3"
532 |   type: "ReLU"
533 |   bottom: "conv5_3"
534 |   top: "conv5_3"
535 | }
536 | layer {
537 |   name: "pool5"
538 |   type: "Pooling"
539 |   bottom: "conv5_3"
540 |   top: "pool5"
541 |   pooling_param {
542 |     pool: MAX
543 |     kernel_size: 2
544 |     stride: 2
545 |   }
546 | }
547 | layer {
548 |   name: "fc6"
549 |   type: "InnerProduct"
550 |   bottom: "pool5"
551 |   top: "fc6"
552 |   param { lr_mult: 0 }
553 |   param { lr_mult: 0 decay_mult: 0 }
554 |   include { stage: "freeze-convnet" }
555 |   inner_product_param {
556 |     num_output: 4096
557 |   }
558 | }
559 | layer {
560 |   name: "fc6"
561 |   type: "InnerProduct"
562 |   bottom: "pool5"
563 |   top: "fc6"
564 |   param { lr_mult: 0.1 }
565 |   param { lr_mult: 0.2 decay_mult: 0}
566 |   exclude { stage: "freeze-convnet" }
567 |   inner_product_param {
568 |     num_output: 4096
569 |   }
570 | }
571 | layer {
572 |   name: "relu6"
573 |   type: "ReLU"
574 |   bottom: "fc6"
575 |   top: "fc6"
576 | }
577 | layer {
578 |   name: "drop6"
579 |   type: "Dropout"
580 |   bottom: "fc6"
581 |   top: "fc6"
582 |   dropout_param {
583 |     dropout_ratio: 0.5
584 |   }
585 | }
586 | layer {
587 |   name: "fc7"
588 |   type: "InnerProduct"
589 |   bottom: "fc6"
590 |   top: "fc7"
591 |   param { lr_mult: 0 }
592 |   param { lr_mult: 0 decay_mult: 0 }
593 |   include { stage: "freeze-convnet" }
594 |   inner_product_param {
595 |     num_output: 4096
596 |   }
597 | }
598 | layer {
599 |   name: "fc7"
600 |   type: "InnerProduct"
601 |   bottom: "fc6"
602 |   top: "fc7"
603 |   param { lr_mult: 0.1 }
604 |   param { lr_mult: 0.2 decay_mult: 0}
605 |   exclude { stage: "freeze-convnet" }
606 |   inner_product_param {
607 |     num_output: 4096
608 |   }
609 | }
610 | layer {
611 |   name: "relu7"
612 |   type: "ReLU"
613 |   bottom: "fc7"
614 |   top: "fc7"
615 | }
616 | layer {
617 |   name: "drop7"
618 |   type: "Dropout"
619 |   bottom: "fc7"
620 |   top: "fc7"
621 |   dropout_param {
622 |     dropout_ratio: 0.5
623 |   }
624 | }
625 | layer {
626 |   name: "fc8"
627 |   type: "InnerProduct"
628 |   bottom: "fc7"
629 |   top: "fc8"
630 |   param {
631 |     lr_mult: 0.1
632 |     decay_mult: 1
633 |   }
634 |   param {
635 |     lr_mult: 0.2
636 |     decay_mult: 0
637 |   }
638 |   inner_product_param {
639 |     num_output: 1000
640 |   }
641 | }
642 | layer {
643 |   name: "local_features"
644 |   type: "Concat"
645 |   bottom: "fc8"
646 |   bottom: "bbox_coordinate"
647 |   top: "local_features"
648 | }
649 | 
650 | layer {
651 |   name: "embedding"
652 |   type: "Embed"
653 |   bottom: "input_sentence"
654 |   top: "embedded_input_sentence"
655 |   param {
656 |     lr_mult: 1
657 |   }
658 |   embed_param {
659 |     bias_term: false
660 |     input_dim: 8801
661 |     num_output: 1000
662 |     weight_filler {
663 |       type: "uniform"
664 |       min: -0.08
665 |       max: 0.08
666 |     }
667 |   }
668 | }
669 | layer {
670 |   name: "lstm1"
671 |   type: "LSTM"
672 |   bottom: "embedded_input_sentence"
673 |   bottom: "cont_sentence"
674 |   bottom: "local_features"
675 |   top: "lstm1"
676 |   include { stage: "unfactored" }
677 |   recurrent_param {
678 |     num_output: 1000
679 |     weight_filler {
680 |       type: "uniform"
681 |       min: -0.08
682 |       max: 0.08
683 |     }
684 |     bias_filler {
685 |       type: "constant"
686 |       value: 0
687 |     }
688 |   }
689 | }
690 | layer {
691 |   name: "lstm2"
692 |   type: "LSTM"
693 |   bottom: "lstm1"
694 |   bottom: "cont_sentence"
695 |   top: "lstm2"
696 |   include {
697 |     stage: "unfactored"
698 |     stage: "2-layer"
699 |   }
700 |   recurrent_param {
701 |     num_output: 1000
702 |     weight_filler {
703 |       type: "uniform"
704 |       min: -0.08
705 |       max: 0.08
706 |     }
707 |     bias_filler {
708 |       type: "constant"
709 |       value: 0
710 |     }
711 |   }
712 | }
713 | layer {
714 |   name: "lstm1"
715 |   type: "LSTM"
716 |   bottom: "embedded_input_sentence"
717 |   bottom: "cont_sentence"
718 |   top: "lstm1"
719 |   include { stage: "factored" }
720 |   recurrent_param {
721 |     num_output: 1000
722 |     weight_filler {
723 |       type: "uniform"
724 |       min: -0.08
725 |       max: 0.08
726 |     }
727 |     bias_filler {
728 |       type: "constant"
729 |       value: 0
730 |     }
731 |   }
732 | }
733 | layer {
734 |   name: "lstm2-extended"
735 |   type: "LSTM"
736 |   bottom: "lstm1"
737 |   bottom: "cont_sentence"
738 |   bottom: "local_features"
739 |   top: "lstm2"
740 |   include { stage: "factored" }
741 |   recurrent_param {
742 |     num_output: 1000
743 |     weight_filler {
744 |       type: "uniform"
745 |       min: -0.08
746 |       max: 0.08
747 |     }
748 |     bias_filler {
749 |       type: "constant"
750 |       value: 0
751 |     }
752 |   }
753 | }
754 | layer {
755 |   name: "predict"
756 |   type: "InnerProduct"
757 |   bottom: "lstm1"
758 |   top: "predict"
759 |   param {
760 |     lr_mult: 1
761 |     decay_mult: 1
762 |   }
763 |   param {
764 |     lr_mult: 2
765 |     decay_mult: 0
766 |   }
767 |   exclude { stage: "2-layer" }
768 |   inner_product_param {
769 |     num_output: 8801
770 |     weight_filler {
771 |       type: "uniform"
772 |       min: -0.08
773 |       max: 0.08
774 |     }
775 |     bias_filler {
776 |       type: "constant"
777 |       value: 0
778 |     }
779 |     axis: 2
780 |   }
781 | }
782 | layer {
783 |   name: "predict"
784 |   type: "InnerProduct"
785 |   bottom: "lstm2"
786 |   top: "predict"
787 |   param {
788 |     lr_mult: 1
789 |     decay_mult: 1
790 |   }
791 |   param {
792 |     lr_mult: 2
793 |     decay_mult: 0
794 |   }
795 |   include { stage: "2-layer" }
796 |   inner_product_param {
797 |     num_output: 8801
798 |     weight_filler {
799 |       type: "uniform"
800 |       min: -0.08
801 |       max: 0.08
802 |     }
803 |     bias_filler {
804 |       type: "constant"
805 |       value: 0
806 |     }
807 |     axis: 2
808 |   }
809 | }
810 | 
811 | # Context LSTM
812 | layer {
813 |   name: "fc8_context"
814 |   type: "InnerProduct"
815 |   bottom: "fc7_context"
816 |   top: "fc8_context"
817 |   param {
818 |     lr_mult: 0.1
819 |     decay_mult: 1
820 |   }
821 |   param {
822 |     lr_mult: 0.2
823 |     decay_mult: 0
824 |   }
825 |   inner_product_param {
826 |     num_output: 1000
827 |   }
828 | }
829 | layer {
830 |   name: "lstm2_context"
831 |   type: "LSTM"
832 |   bottom: "lstm1"
833 |   bottom: "cont_sentence"
834 |   bottom: "fc8_context"
835 |   top: "lstm2_context"
836 |   include { stage: "factored" }
837 |   recurrent_param {
838 |     num_output: 1000
839 |     weight_filler {
840 |       type: "uniform"
841 |       min: -0.08
842 |       max: 0.08
843 |     }
844 |     bias_filler {
845 |       type: "constant"
846 |       value: 0
847 |     }
848 |   }
849 | }
850 | layer {
851 |   name: "predict_context"
852 |   type: "InnerProduct"
853 |   bottom: "lstm2_context"
854 |   top: "predict_context"
855 |   param {
856 |     lr_mult: 1
857 |     decay_mult: 1
858 |   }
859 |   param {
860 |     lr_mult: 2
861 |     decay_mult: 0
862 |   }
863 |   include { stage: "2-layer" }
864 |   inner_product_param {
865 |     num_output: 8801
866 |     weight_filler {
867 |       type: "constant"
868 |       value: 0
869 |     }
870 |     bias_filler {
871 |       type: "constant"
872 |       value: 0
873 |     }
874 |     axis: 2
875 |   }
876 | }
877 | layer {
878 |   name: "predict_combined"
879 |   type: "Eltwise"
880 |   bottom: "predict"
881 |   bottom: "predict_context"
882 |   top: "predict_combined"
883 |   eltwise_param { operation: SUM }
884 | }
885 | 
886 | layer {
887 |   name: "cross_entropy_loss"
888 |   type: "SoftmaxWithLoss"
889 |   bottom: "predict_combined"
890 |   bottom: "target_sentence"
891 |   top: "cross_entropy_loss"
892 |   loss_weight: 20
893 |   loss_param {
894 |     ignore_label: -1
895 |   }
896 |   softmax_param {
897 |     axis: 2
898 |   }
899 | }
900 | layer {
901 |   name: "accuracy"
902 |   type: "Accuracy"
903 |   bottom: "predict"
904 |   bottom: "target_sentence"
905 |   top: "accuracy"
906 |   include { phase: TEST }
907 |   accuracy_param {
908 |     axis: 2
909 |     ignore_label: -1
910 |   }
911 | }
912 | 


--------------------------------------------------------------------------------
/prototxt/scrc_full_vgg_solver.prototxt:
--------------------------------------------------------------------------------
 1 | net: "./prototxt/scrc_full_vgg_buffer_50.prototxt"
 2 | 
 3 | train_state: { stage: 'freeze-convnet' stage: 'factored' stage: '2-layer' }
 4 | base_lr: 0.001
 5 | lr_policy: "step"
 6 | gamma: 0.5
 7 | stepsize: 30000
 8 | display: 1
 9 | max_iter: 90000
10 | momentum: 0.9
11 | weight_decay: 0.0000
12 | snapshot: 10000
13 | snapshot_prefix: "./exp-referit/caffemodel/scrc_full_vgg"
14 | solver_mode: GPU
15 | random_seed: 1701
16 | average_loss: 100
17 | clip_gradients: 10
18 | 


--------------------------------------------------------------------------------
/prototxt/scrc_kitchen_buffer_50.prototxt:
--------------------------------------------------------------------------------
  1 | state {
  2 |   phase: TRAIN level: 0
  3 |   stage: 'freeze-convnet' stage: 'factored' stage: '2-layer'
  4 | }
  5 | 
  6 | # train data layers
  7 | layer {
  8 |   name: "data"
  9 |   type: "ImageData"
 10 |   top: "data"
 11 |   top: "label"
 12 |   transform_param {
 13 |     mirror: true
 14 |     crop_size: 224
 15 |     mean_value: 104
 16 |     mean_value: 117
 17 |     mean_value: 123
 18 |   }
 19 |   image_data_param {
 20 |     source: "./data/kitchen_train_image_list.txt"
 21 |     batch_size: 50
 22 |     new_height: 256
 23 |     new_width: 256
 24 |   }
 25 | }
 26 | layer {
 27 |   name: "data"
 28 |   type: "HDF5Data"
 29 |   top: "cont_sentence"
 30 |   top: "input_sentence"
 31 |   top: "target_sentence"
 32 |   hdf5_data_param {
 33 |     source: "./data/kitchen_train_hdf5_list.txt"
 34 |     batch_size: 20
 35 |   }
 36 | }
 37 | 
 38 | layer {
 39 |   name: "silence"
 40 |   type: "Silence"
 41 |   bottom: "label"
 42 | }
 43 | layer {
 44 |   name: "conv1_1"
 45 |   type: "Convolution"
 46 |   bottom: "data"
 47 |   top: "conv1_1"
 48 |   param { lr_mult: 0 }
 49 |   param { lr_mult: 0 decay_mult: 0 }
 50 |   include { stage: "freeze-convnet" }
 51 |   convolution_param {
 52 |     num_output: 64
 53 |     pad: 1
 54 |     kernel_size: 3
 55 |   }
 56 | }
 57 | layer {
 58 |   name: "conv1_1"
 59 |   type: "Convolution"
 60 |   bottom: "data"
 61 |   top: "conv1_1"
 62 |   param { lr_mult: 0.1 }
 63 |   param { lr_mult: 0.2 decay_mult: 0}
 64 |   exclude { stage: "freeze-convnet" }
 65 |   convolution_param {
 66 |     num_output: 64
 67 |     pad: 1
 68 |     kernel_size: 3
 69 |   }
 70 | }
 71 | layer {
 72 |   name: "relu1_1"
 73 |   type: "ReLU"
 74 |   bottom: "conv1_1"
 75 |   top: "conv1_1"
 76 | }
 77 | layer {
 78 |   name: "conv1_2"
 79 |   type: "Convolution"
 80 |   bottom: "conv1_1"
 81 |   top: "conv1_2"
 82 |   param { lr_mult: 0 }
 83 |   param { lr_mult: 0 decay_mult: 0 }
 84 |   include { stage: "freeze-convnet" }
 85 |   convolution_param {
 86 |     num_output: 64
 87 |     pad: 1
 88 |     kernel_size: 3
 89 |   }
 90 | }
 91 | layer {
 92 |   name: "conv1_2"
 93 |   type: "Convolution"
 94 |   bottom: "conv1_1"
 95 |   top: "conv1_2"
 96 |   param { lr_mult: 0.1 }
 97 |   param { lr_mult: 0.2 decay_mult: 0}
 98 |   exclude { stage: "freeze-convnet" }
 99 |   convolution_param {
100 |     num_output: 64
101 |     pad: 1
102 |     kernel_size: 3
103 |   }
104 | }
105 | layer {
106 |   name: "relu1_2"
107 |   type: "ReLU"
108 |   bottom: "conv1_2"
109 |   top: "conv1_2"
110 | }
111 | layer {
112 |   name: "pool1"
113 |   type: "Pooling"
114 |   bottom: "conv1_2"
115 |   top: "pool1"
116 |   pooling_param {
117 |     pool: MAX
118 |     kernel_size: 2
119 |     stride: 2
120 |   }
121 | }
122 | layer {
123 |   name: "conv2_1"
124 |   type: "Convolution"
125 |   bottom: "pool1"
126 |   top: "conv2_1"
127 |   param { lr_mult: 0 }
128 |   param { lr_mult: 0 decay_mult: 0 }
129 |   include { stage: "freeze-convnet" }
130 |   convolution_param {
131 |     num_output: 128
132 |     pad: 1
133 |     kernel_size: 3
134 |   }
135 | }
136 | layer {
137 |   name: "conv2_1"
138 |   type: "Convolution"
139 |   bottom: "pool1"
140 |   top: "conv2_1"
141 |   param { lr_mult: 0.1 }
142 |   param { lr_mult: 0.2 decay_mult: 0}
143 |   exclude { stage: "freeze-convnet" }
144 |   convolution_param {
145 |     num_output: 128
146 |     pad: 1
147 |     kernel_size: 3
148 |   }
149 | }
150 | layer {
151 |   name: "relu2_1"
152 |   type: "ReLU"
153 |   bottom: "conv2_1"
154 |   top: "conv2_1"
155 | }
156 | layer {
157 |   name: "conv2_2"
158 |   type: "Convolution"
159 |   bottom: "conv2_1"
160 |   top: "conv2_2"
161 |   param { lr_mult: 0 }
162 |   param { lr_mult: 0 decay_mult: 0 }
163 |   include { stage: "freeze-convnet" }
164 |   convolution_param {
165 |     num_output: 128
166 |     pad: 1
167 |     kernel_size: 3
168 |   }
169 | }
170 | layer {
171 |   name: "conv2_2"
172 |   type: "Convolution"
173 |   bottom: "conv2_1"
174 |   top: "conv2_2"
175 |   param { lr_mult: 0.1 }
176 |   param { lr_mult: 0.2 decay_mult: 0}
177 |   exclude { stage: "freeze-convnet" }
178 |   convolution_param {
179 |     num_output: 128
180 |     pad: 1
181 |     kernel_size: 3
182 |   }
183 | }
184 | layer {
185 |   name: "relu2_2"
186 |   type: "ReLU"
187 |   bottom: "conv2_2"
188 |   top: "conv2_2"
189 | }
190 | layer {
191 |   name: "pool2"
192 |   type: "Pooling"
193 |   bottom: "conv2_2"
194 |   top: "pool2"
195 |   pooling_param {
196 |     pool: MAX
197 |     kernel_size: 2
198 |     stride: 2
199 |   }
200 | }
201 | layer {
202 |   name: "conv3_1"
203 |   type: "Convolution"
204 |   bottom: "pool2"
205 |   top: "conv3_1"
206 |   param { lr_mult: 0 }
207 |   param { lr_mult: 0 decay_mult: 0 }
208 |   include { stage: "freeze-convnet" }
209 |   convolution_param {
210 |     num_output: 256
211 |     pad: 1
212 |     kernel_size: 3
213 |   }
214 | }
215 | layer {
216 |   name: "conv3_1"
217 |   type: "Convolution"
218 |   bottom: "pool2"
219 |   top: "conv3_1"
220 |   param { lr_mult: 0.1 }
221 |   param { lr_mult: 0.2 decay_mult: 0}
222 |   exclude { stage: "freeze-convnet" }
223 |   convolution_param {
224 |     num_output: 256
225 |     pad: 1
226 |     kernel_size: 3
227 |   }
228 | }
229 | layer {
230 |   name: "relu3_1"
231 |   type: "ReLU"
232 |   bottom: "conv3_1"
233 |   top: "conv3_1"
234 | }
235 | layer {
236 |   name: "conv3_2"
237 |   type: "Convolution"
238 |   bottom: "conv3_1"
239 |   top: "conv3_2"
240 |   param { lr_mult: 0 }
241 |   param { lr_mult: 0 decay_mult: 0 }
242 |   include { stage: "freeze-convnet" }
243 |   convolution_param {
244 |     num_output: 256
245 |     pad: 1
246 |     kernel_size: 3
247 |   }
248 | }
249 | layer {
250 |   name: "conv3_2"
251 |   type: "Convolution"
252 |   bottom: "conv3_1"
253 |   top: "conv3_2"
254 |   param { lr_mult: 0.1 }
255 |   param { lr_mult: 0.2 decay_mult: 0}
256 |   exclude { stage: "freeze-convnet" }
257 |   convolution_param {
258 |     num_output: 256
259 |     pad: 1
260 |     kernel_size: 3
261 |   }
262 | }
263 | layer {
264 |   name: "relu3_2"
265 |   type: "ReLU"
266 |   bottom: "conv3_2"
267 |   top: "conv3_2"
268 | }
269 | layer {
270 |   name: "conv3_3"
271 |   type: "Convolution"
272 |   bottom: "conv3_2"
273 |   top: "conv3_3"
274 |   param { lr_mult: 0 }
275 |   param { lr_mult: 0 decay_mult: 0 }
276 |   include { stage: "freeze-convnet" }
277 |   convolution_param {
278 |     num_output: 256
279 |     pad: 1
280 |     kernel_size: 3
281 |   }
282 | }
283 | layer {
284 |   name: "conv3_3"
285 |   type: "Convolution"
286 |   bottom: "conv3_2"
287 |   top: "conv3_3"
288 |   param { lr_mult: 0.1 }
289 |   param { lr_mult: 0.2 decay_mult: 0}
290 |   exclude { stage: "freeze-convnet" }
291 |   convolution_param {
292 |     num_output: 256
293 |     pad: 1
294 |     kernel_size: 3
295 |   }
296 | }
297 | layer {
298 |   name: "relu3_3"
299 |   type: "ReLU"
300 |   bottom: "conv3_3"
301 |   top: "conv3_3"
302 | }
303 | layer {
304 |   name: "pool3"
305 |   type: "Pooling"
306 |   bottom: "conv3_3"
307 |   top: "pool3"
308 |   pooling_param {
309 |     pool: MAX
310 |     kernel_size: 2
311 |     stride: 2
312 |   }
313 | }
314 | layer {
315 |   name: "conv4_1"
316 |   type: "Convolution"
317 |   bottom: "pool3"
318 |   top: "conv4_1"
319 |   param { lr_mult: 0 }
320 |   param { lr_mult: 0 decay_mult: 0 }
321 |   include { stage: "freeze-convnet" }
322 |   convolution_param {
323 |     num_output: 512
324 |     pad: 1
325 |     kernel_size: 3
326 |   }
327 | }
328 | layer {
329 |   name: "conv4_1"
330 |   type: "Convolution"
331 |   bottom: "pool3"
332 |   top: "conv4_1"
333 |   param { lr_mult: 0.1 }
334 |   param { lr_mult: 0.2 decay_mult: 0}
335 |   exclude { stage: "freeze-convnet" }
336 |   convolution_param {
337 |     num_output: 512
338 |     pad: 1
339 |     kernel_size: 3
340 |   }
341 | }
342 | layer {
343 |   name: "relu4_1"
344 |   type: "ReLU"
345 |   bottom: "conv4_1"
346 |   top: "conv4_1"
347 | }
348 | layer {
349 |   name: "conv4_2"
350 |   type: "Convolution"
351 |   bottom: "conv4_1"
352 |   top: "conv4_2"
353 |   param { lr_mult: 0 }
354 |   param { lr_mult: 0 decay_mult: 0 }
355 |   include { stage: "freeze-convnet" }
356 |   convolution_param {
357 |     num_output: 512
358 |     pad: 1
359 |     kernel_size: 3
360 |   }
361 | }
362 | layer {
363 |   name: "conv4_2"
364 |   type: "Convolution"
365 |   bottom: "conv4_1"
366 |   top: "conv4_2"
367 |   param { lr_mult: 0.1 }
368 |   param { lr_mult: 0.2 decay_mult: 0}
369 |   exclude { stage: "freeze-convnet" }
370 |   convolution_param {
371 |     num_output: 512
372 |     pad: 1
373 |     kernel_size: 3
374 |   }
375 | }
376 | layer {
377 |   name: "relu4_2"
378 |   type: "ReLU"
379 |   bottom: "conv4_2"
380 |   top: "conv4_2"
381 | }
382 | layer {
383 |   name: "conv4_3"
384 |   type: "Convolution"
385 |   bottom: "conv4_2"
386 |   top: "conv4_3"
387 |   param { lr_mult: 0 }
388 |   param { lr_mult: 0 decay_mult: 0 }
389 |   include { stage: "freeze-convnet" }
390 |   convolution_param {
391 |     num_output: 512
392 |     pad: 1
393 |     kernel_size: 3
394 |   }
395 | }
396 | layer {
397 |   name: "conv4_3"
398 |   type: "Convolution"
399 |   bottom: "conv4_2"
400 |   top: "conv4_3"
401 |   param { lr_mult: 0.1 }
402 |   param { lr_mult: 0.2 decay_mult: 0}
403 |   exclude { stage: "freeze-convnet" }
404 |   convolution_param {
405 |     num_output: 512
406 |     pad: 1
407 |     kernel_size: 3
408 |   }
409 | }
410 | layer {
411 |   name: "relu4_3"
412 |   type: "ReLU"
413 |   bottom: "conv4_3"
414 |   top: "conv4_3"
415 | }
416 | layer {
417 |   name: "pool4"
418 |   type: "Pooling"
419 |   bottom: "conv4_3"
420 |   top: "pool4"
421 |   pooling_param {
422 |     pool: MAX
423 |     kernel_size: 2
424 |     stride: 2
425 |   }
426 | }
427 | layer {
428 |   name: "conv5_1"
429 |   type: "Convolution"
430 |   bottom: "pool4"
431 |   top: "conv5_1"
432 |   param { lr_mult: 0 }
433 |   param { lr_mult: 0 decay_mult: 0 }
434 |   include { stage: "freeze-convnet" }
435 |   convolution_param {
436 |     num_output: 512
437 |     pad: 1
438 |     kernel_size: 3
439 |   }
440 | }
441 | layer {
442 |   name: "conv5_1"
443 |   type: "Convolution"
444 |   bottom: "pool4"
445 |   top: "conv5_1"
446 |   param { lr_mult: 0.1 }
447 |   param { lr_mult: 0.2 decay_mult: 0}
448 |   exclude { stage: "freeze-convnet" }
449 |   convolution_param {
450 |     num_output: 512
451 |     pad: 1
452 |     kernel_size: 3
453 |   }
454 | }
455 | layer {
456 |   name: "relu5_1"
457 |   type: "ReLU"
458 |   bottom: "conv5_1"
459 |   top: "conv5_1"
460 | }
461 | layer {
462 |   name: "conv5_2"
463 |   type: "Convolution"
464 |   bottom: "conv5_1"
465 |   top: "conv5_2"
466 |   param { lr_mult: 0 }
467 |   param { lr_mult: 0 decay_mult: 0 }
468 |   include { stage: "freeze-convnet" }
469 |   convolution_param {
470 |     num_output: 512
471 |     pad: 1
472 |     kernel_size: 3
473 |   }
474 | }
475 | layer {
476 |   name: "conv5_2"
477 |   type: "Convolution"
478 |   bottom: "conv5_1"
479 |   top: "conv5_2"
480 |   param { lr_mult: 0.1 }
481 |   param { lr_mult: 0.2 decay_mult: 0}
482 |   exclude { stage: "freeze-convnet" }
483 |   convolution_param {
484 |     num_output: 512
485 |     pad: 1
486 |     kernel_size: 3
487 |   }
488 | }
489 | layer {
490 |   name: "relu5_2"
491 |   type: "ReLU"
492 |   bottom: "conv5_2"
493 |   top: "conv5_2"
494 | }
495 | layer {
496 |   name: "conv5_3"
497 |   type: "Convolution"
498 |   bottom: "conv5_2"
499 |   top: "conv5_3"
500 |   param { lr_mult: 0 }
501 |   param { lr_mult: 0 decay_mult: 0 }
502 |   include { stage: "freeze-convnet" }
503 |   convolution_param {
504 |     num_output: 512
505 |     pad: 1
506 |     kernel_size: 3
507 |   }
508 | }
509 | layer {
510 |   name: "conv5_3"
511 |   type: "Convolution"
512 |   bottom: "conv5_2"
513 |   top: "conv5_3"
514 |   param { lr_mult: 0.1 }
515 |   param { lr_mult: 0.2 decay_mult: 0}
516 |   exclude { stage: "freeze-convnet" }
517 |   convolution_param {
518 |     num_output: 512
519 |     pad: 1
520 |     kernel_size: 3
521 |   }
522 | }
523 | layer {
524 |   name: "relu5_3"
525 |   type: "ReLU"
526 |   bottom: "conv5_3"
527 |   top: "conv5_3"
528 | }
529 | layer {
530 |   name: "pool5"
531 |   type: "Pooling"
532 |   bottom: "conv5_3"
533 |   top: "pool5"
534 |   pooling_param {
535 |     pool: MAX
536 |     kernel_size: 2
537 |     stride: 2
538 |   }
539 | }
540 | layer {
541 |   name: "fc6"
542 |   type: "InnerProduct"
543 |   bottom: "pool5"
544 |   top: "fc6"
545 |   param { lr_mult: 0 }
546 |   param { lr_mult: 0 decay_mult: 0 }
547 |   include { stage: "freeze-convnet" }
548 |   inner_product_param {
549 |     num_output: 4096
550 |   }
551 | }
552 | layer {
553 |   name: "fc6"
554 |   type: "InnerProduct"
555 |   bottom: "pool5"
556 |   top: "fc6"
557 |   param { lr_mult: 0.1 }
558 |   param { lr_mult: 0.2 decay_mult: 0}
559 |   exclude { stage: "freeze-convnet" }
560 |   inner_product_param {
561 |     num_output: 4096
562 |   }
563 | }
564 | layer {
565 |   name: "relu6"
566 |   type: "ReLU"
567 |   bottom: "fc6"
568 |   top: "fc6"
569 | }
570 | layer {
571 |   name: "drop6"
572 |   type: "Dropout"
573 |   bottom: "fc6"
574 |   top: "fc6"
575 |   dropout_param {
576 |     dropout_ratio: 0.5
577 |   }
578 | }
579 | layer {
580 |   name: "fc7"
581 |   type: "InnerProduct"
582 |   bottom: "fc6"
583 |   top: "fc7"
584 |   param { lr_mult: 0 }
585 |   param { lr_mult: 0 decay_mult: 0 }
586 |   include { stage: "freeze-convnet" }
587 |   inner_product_param {
588 |     num_output: 4096
589 |   }
590 | }
591 | layer {
592 |   name: "fc7"
593 |   type: "InnerProduct"
594 |   bottom: "fc6"
595 |   top: "fc7"
596 |   param { lr_mult: 0.1 }
597 |   param { lr_mult: 0.2 decay_mult: 0}
598 |   exclude { stage: "freeze-convnet" }
599 |   inner_product_param {
600 |     num_output: 4096
601 |   }
602 | }
603 | layer {
604 |   name: "relu7"
605 |   type: "ReLU"
606 |   bottom: "fc7"
607 |   top: "fc7"
608 | }
609 | layer {
610 |   name: "drop7"
611 |   type: "Dropout"
612 |   bottom: "fc7"
613 |   top: "fc7"
614 |   dropout_param {
615 |     dropout_ratio: 0.5
616 |   }
617 | }
618 | layer {
619 |   name: "fc8"
620 |   type: "InnerProduct"
621 |   bottom: "fc7"
622 |   top: "fc8"
623 |   param {
624 |     lr_mult: 0.1
625 |     decay_mult: 1
626 |   }
627 |   param {
628 |     lr_mult: 0.2
629 |     decay_mult: 0
630 |   }
631 |   inner_product_param {
632 |     num_output: 1000
633 |   }
634 | }
635 | layer {
636 |   name: "embedding"
637 |   type: "Embed"
638 |   bottom: "input_sentence"
639 |   top: "embedded_input_sentence"
640 |   param {
641 |     lr_mult: 1
642 |   }
643 |   embed_param {
644 |     bias_term: false
645 |     input_dim: 8801
646 |     num_output: 1000
647 |     weight_filler {
648 |       type: "uniform"
649 |       min: -0.08
650 |       max: 0.08
651 |     }
652 |   }
653 | }
654 | layer {
655 |   name: "lstm1"
656 |   type: "LSTM"
657 |   bottom: "embedded_input_sentence"
658 |   bottom: "cont_sentence"
659 |   bottom: "fc8"
660 |   top: "lstm1"
661 |   include { stage: "unfactored" }
662 |   recurrent_param {
663 |     num_output: 1000
664 |     weight_filler {
665 |       type: "uniform"
666 |       min: -0.08
667 |       max: 0.08
668 |     }
669 |     bias_filler {
670 |       type: "constant"
671 |       value: 0
672 |     }
673 |   }
674 | }
675 | layer {
676 |   name: "lstm2"
677 |   type: "LSTM"
678 |   bottom: "lstm1"
679 |   bottom: "cont_sentence"
680 |   top: "lstm2"
681 |   include {
682 |     stage: "unfactored"
683 |     stage: "2-layer"
684 |   }
685 |   recurrent_param {
686 |     num_output: 1000
687 |     weight_filler {
688 |       type: "uniform"
689 |       min: -0.08
690 |       max: 0.08
691 |     }
692 |     bias_filler {
693 |       type: "constant"
694 |       value: 0
695 |     }
696 |   }
697 | }
698 | layer {
699 |   name: "lstm1"
700 |   type: "LSTM"
701 |   bottom: "embedded_input_sentence"
702 |   bottom: "cont_sentence"
703 |   top: "lstm1"
704 |   include { stage: "factored" }
705 |   recurrent_param {
706 |     num_output: 1000
707 |     weight_filler {
708 |       type: "uniform"
709 |       min: -0.08
710 |       max: 0.08
711 |     }
712 |     bias_filler {
713 |       type: "constant"
714 |       value: 0
715 |     }
716 |   }
717 | }
718 | layer {
719 |   name: "lstm2"
720 |   type: "LSTM"
721 |   bottom: "lstm1"
722 |   bottom: "cont_sentence"
723 |   bottom: "fc8"
724 |   top: "lstm2"
725 |   include { stage: "factored" }
726 |   recurrent_param {
727 |     num_output: 1000
728 |     weight_filler {
729 |       type: "uniform"
730 |       min: -0.08
731 |       max: 0.08
732 |     }
733 |     bias_filler {
734 |       type: "constant"
735 |       value: 0
736 |     }
737 |   }
738 | }
739 | layer {
740 |   name: "predict"
741 |   type: "InnerProduct"
742 |   bottom: "lstm1"
743 |   top: "predict"
744 |   param {
745 |     lr_mult: 1
746 |     decay_mult: 1
747 |   }
748 |   param {
749 |     lr_mult: 2
750 |     decay_mult: 0
751 |   }
752 |   exclude { stage: "2-layer" }
753 |   inner_product_param {
754 |     num_output: 8801
755 |     weight_filler {
756 |       type: "uniform"
757 |       min: -0.08
758 |       max: 0.08
759 |     }
760 |     bias_filler {
761 |       type: "constant"
762 |       value: 0
763 |     }
764 |     axis: 2
765 |   }
766 | }
767 | layer {
768 |   name: "predict"
769 |   type: "InnerProduct"
770 |   bottom: "lstm2"
771 |   top: "predict"
772 |   param {
773 |     lr_mult: 1
774 |     decay_mult: 1
775 |   }
776 |   param {
777 |     lr_mult: 2
778 |     decay_mult: 0
779 |   }
780 |   include { stage: "2-layer" }
781 |   inner_product_param {
782 |     num_output: 8801
783 |     weight_filler {
784 |       type: "uniform"
785 |       min: -0.08
786 |       max: 0.08
787 |     }
788 |     bias_filler {
789 |       type: "constant"
790 |       value: 0
791 |     }
792 |     axis: 2
793 |   }
794 | }
795 | layer {
796 |   name: "cross_entropy_loss"
797 |   type: "SoftmaxWithLoss"
798 |   bottom: "predict"
799 |   bottom: "target_sentence"
800 |   top: "cross_entropy_loss"
801 |   loss_weight: 20
802 |   loss_param {
803 |     ignore_label: -1
804 |   }
805 |   softmax_param {
806 |     axis: 2
807 |   }
808 | }
809 | layer {
810 |   name: "accuracy"
811 |   type: "Accuracy"
812 |   bottom: "predict"
813 |   bottom: "target_sentence"
814 |   top: "accuracy"
815 |   include { phase: TEST }
816 |   accuracy_param {
817 |     axis: 2
818 |     ignore_label: -1
819 |   }
820 | }
821 | 


--------------------------------------------------------------------------------
/prototxt/scrc_kitchen_solver.prototxt:
--------------------------------------------------------------------------------
 1 | net: "prototxt/scrc_kitchen_buffer_50.prototxt"
 2 | 
 3 | train_state: { stage: 'freeze-convnet' stage: 'factored' stage: '2-layer' }
 4 | base_lr: 0.001
 5 | lr_policy: "step"
 6 | gamma: 0.5
 7 | stepsize: 1000
 8 | display: 1
 9 | max_iter: 3000
10 | momentum: 0.9
11 | weight_decay: 0.0000
12 | snapshot: 3000
13 | snapshot_prefix: "./exp-kitchen/caffemodel/scrc_kitchen"
14 | solver_mode: GPU
15 | random_seed: 1701
16 | average_loss: 100
17 | clip_gradients: 10
18 | 


--------------------------------------------------------------------------------
/prototxt/scrc_no_context_vgg_buffer_50.prototxt:
--------------------------------------------------------------------------------
  1 | state {
  2 |   phase: TRAIN level: 0
  3 |   stage: 'freeze-convnet' stage: 'factored' stage: '2-layer'
  4 | }
  5 | 
  6 | # train data layers
  7 | layer {
  8 |   name: "data"
  9 |   type: "ImageData"
 10 |   top: "data"
 11 |   top: "label"
 12 |   transform_param {
 13 |     crop_size: 224
 14 |     mean_value: 104
 15 |     mean_value: 117
 16 |     mean_value: 123
 17 |   }
 18 |   image_data_param {
 19 |     source: "./data/training/train_bbox_context_imcrop_list.txt"
 20 |     batch_size: 50
 21 |   }
 22 | }
 23 | layer {
 24 |   name: "data"
 25 |   type: "HDF5Data"
 26 |   top: "cont_sentence"
 27 |   top: "input_sentence"
 28 |   top: "target_sentence"
 29 |   hdf5_data_param {
 30 |     source: "./data/training/train_bbox_context_hdf5_text_list.txt"
 31 |     batch_size: 20
 32 |   }
 33 | }
 34 | layer {
 35 |   name: "data"
 36 |   type: "HDF5Data"
 37 |   top: "bbox_coordinate"
 38 |   hdf5_data_param {
 39 |     source: "./data/training/train_bbox_context_hdf5_bbox_list.txt"
 40 |     batch_size: 50
 41 |   }
 42 | }
 43 | 
 44 | layer {
 45 |   name: "silence"
 46 |   type: "Silence"
 47 |   bottom: "label"
 48 | }
 49 | layer {
 50 |   name: "conv1_1"
 51 |   type: "Convolution"
 52 |   bottom: "data"
 53 |   top: "conv1_1"
 54 |   param { lr_mult: 0 }
 55 |   param { lr_mult: 0 decay_mult: 0 }
 56 |   include { stage: "freeze-convnet" }
 57 |   convolution_param {
 58 |     num_output: 64
 59 |     pad: 1
 60 |     kernel_size: 3
 61 |   }
 62 | }
 63 | layer {
 64 |   name: "conv1_1"
 65 |   type: "Convolution"
 66 |   bottom: "data"
 67 |   top: "conv1_1"
 68 |   param { lr_mult: 0.1 }
 69 |   param { lr_mult: 0.2 decay_mult: 0}
 70 |   exclude { stage: "freeze-convnet" }
 71 |   convolution_param {
 72 |     num_output: 64
 73 |     pad: 1
 74 |     kernel_size: 3
 75 |   }
 76 | }
 77 | layer {
 78 |   name: "relu1_1"
 79 |   type: "ReLU"
 80 |   bottom: "conv1_1"
 81 |   top: "conv1_1"
 82 | }
 83 | layer {
 84 |   name: "conv1_2"
 85 |   type: "Convolution"
 86 |   bottom: "conv1_1"
 87 |   top: "conv1_2"
 88 |   param { lr_mult: 0 }
 89 |   param { lr_mult: 0 decay_mult: 0 }
 90 |   include { stage: "freeze-convnet" }
 91 |   convolution_param {
 92 |     num_output: 64
 93 |     pad: 1
 94 |     kernel_size: 3
 95 |   }
 96 | }
 97 | layer {
 98 |   name: "conv1_2"
 99 |   type: "Convolution"
100 |   bottom: "conv1_1"
101 |   top: "conv1_2"
102 |   param { lr_mult: 0.1 }
103 |   param { lr_mult: 0.2 decay_mult: 0}
104 |   exclude { stage: "freeze-convnet" }
105 |   convolution_param {
106 |     num_output: 64
107 |     pad: 1
108 |     kernel_size: 3
109 |   }
110 | }
111 | layer {
112 |   name: "relu1_2"
113 |   type: "ReLU"
114 |   bottom: "conv1_2"
115 |   top: "conv1_2"
116 | }
117 | layer {
118 |   name: "pool1"
119 |   type: "Pooling"
120 |   bottom: "conv1_2"
121 |   top: "pool1"
122 |   pooling_param {
123 |     pool: MAX
124 |     kernel_size: 2
125 |     stride: 2
126 |   }
127 | }
128 | layer {
129 |   name: "conv2_1"
130 |   type: "Convolution"
131 |   bottom: "pool1"
132 |   top: "conv2_1"
133 |   param { lr_mult: 0 }
134 |   param { lr_mult: 0 decay_mult: 0 }
135 |   include { stage: "freeze-convnet" }
136 |   convolution_param {
137 |     num_output: 128
138 |     pad: 1
139 |     kernel_size: 3
140 |   }
141 | }
142 | layer {
143 |   name: "conv2_1"
144 |   type: "Convolution"
145 |   bottom: "pool1"
146 |   top: "conv2_1"
147 |   param { lr_mult: 0.1 }
148 |   param { lr_mult: 0.2 decay_mult: 0}
149 |   exclude { stage: "freeze-convnet" }
150 |   convolution_param {
151 |     num_output: 128
152 |     pad: 1
153 |     kernel_size: 3
154 |   }
155 | }
156 | layer {
157 |   name: "relu2_1"
158 |   type: "ReLU"
159 |   bottom: "conv2_1"
160 |   top: "conv2_1"
161 | }
162 | layer {
163 |   name: "conv2_2"
164 |   type: "Convolution"
165 |   bottom: "conv2_1"
166 |   top: "conv2_2"
167 |   param { lr_mult: 0 }
168 |   param { lr_mult: 0 decay_mult: 0 }
169 |   include { stage: "freeze-convnet" }
170 |   convolution_param {
171 |     num_output: 128
172 |     pad: 1
173 |     kernel_size: 3
174 |   }
175 | }
176 | layer {
177 |   name: "conv2_2"
178 |   type: "Convolution"
179 |   bottom: "conv2_1"
180 |   top: "conv2_2"
181 |   param { lr_mult: 0.1 }
182 |   param { lr_mult: 0.2 decay_mult: 0}
183 |   exclude { stage: "freeze-convnet" }
184 |   convolution_param {
185 |     num_output: 128
186 |     pad: 1
187 |     kernel_size: 3
188 |   }
189 | }
190 | layer {
191 |   name: "relu2_2"
192 |   type: "ReLU"
193 |   bottom: "conv2_2"
194 |   top: "conv2_2"
195 | }
196 | layer {
197 |   name: "pool2"
198 |   type: "Pooling"
199 |   bottom: "conv2_2"
200 |   top: "pool2"
201 |   pooling_param {
202 |     pool: MAX
203 |     kernel_size: 2
204 |     stride: 2
205 |   }
206 | }
207 | layer {
208 |   name: "conv3_1"
209 |   type: "Convolution"
210 |   bottom: "pool2"
211 |   top: "conv3_1"
212 |   param { lr_mult: 0 }
213 |   param { lr_mult: 0 decay_mult: 0 }
214 |   include { stage: "freeze-convnet" }
215 |   convolution_param {
216 |     num_output: 256
217 |     pad: 1
218 |     kernel_size: 3
219 |   }
220 | }
221 | layer {
222 |   name: "conv3_1"
223 |   type: "Convolution"
224 |   bottom: "pool2"
225 |   top: "conv3_1"
226 |   param { lr_mult: 0.1 }
227 |   param { lr_mult: 0.2 decay_mult: 0}
228 |   exclude { stage: "freeze-convnet" }
229 |   convolution_param {
230 |     num_output: 256
231 |     pad: 1
232 |     kernel_size: 3
233 |   }
234 | }
235 | layer {
236 |   name: "relu3_1"
237 |   type: "ReLU"
238 |   bottom: "conv3_1"
239 |   top: "conv3_1"
240 | }
241 | layer {
242 |   name: "conv3_2"
243 |   type: "Convolution"
244 |   bottom: "conv3_1"
245 |   top: "conv3_2"
246 |   param { lr_mult: 0 }
247 |   param { lr_mult: 0 decay_mult: 0 }
248 |   include { stage: "freeze-convnet" }
249 |   convolution_param {
250 |     num_output: 256
251 |     pad: 1
252 |     kernel_size: 3
253 |   }
254 | }
255 | layer {
256 |   name: "conv3_2"
257 |   type: "Convolution"
258 |   bottom: "conv3_1"
259 |   top: "conv3_2"
260 |   param { lr_mult: 0.1 }
261 |   param { lr_mult: 0.2 decay_mult: 0}
262 |   exclude { stage: "freeze-convnet" }
263 |   convolution_param {
264 |     num_output: 256
265 |     pad: 1
266 |     kernel_size: 3
267 |   }
268 | }
269 | layer {
270 |   name: "relu3_2"
271 |   type: "ReLU"
272 |   bottom: "conv3_2"
273 |   top: "conv3_2"
274 | }
275 | layer {
276 |   name: "conv3_3"
277 |   type: "Convolution"
278 |   bottom: "conv3_2"
279 |   top: "conv3_3"
280 |   param { lr_mult: 0 }
281 |   param { lr_mult: 0 decay_mult: 0 }
282 |   include { stage: "freeze-convnet" }
283 |   convolution_param {
284 |     num_output: 256
285 |     pad: 1
286 |     kernel_size: 3
287 |   }
288 | }
289 | layer {
290 |   name: "conv3_3"
291 |   type: "Convolution"
292 |   bottom: "conv3_2"
293 |   top: "conv3_3"
294 |   param { lr_mult: 0.1 }
295 |   param { lr_mult: 0.2 decay_mult: 0}
296 |   exclude { stage: "freeze-convnet" }
297 |   convolution_param {
298 |     num_output: 256
299 |     pad: 1
300 |     kernel_size: 3
301 |   }
302 | }
303 | layer {
304 |   name: "relu3_3"
305 |   type: "ReLU"
306 |   bottom: "conv3_3"
307 |   top: "conv3_3"
308 | }
309 | layer {
310 |   name: "pool3"
311 |   type: "Pooling"
312 |   bottom: "conv3_3"
313 |   top: "pool3"
314 |   pooling_param {
315 |     pool: MAX
316 |     kernel_size: 2
317 |     stride: 2
318 |   }
319 | }
320 | layer {
321 |   name: "conv4_1"
322 |   type: "Convolution"
323 |   bottom: "pool3"
324 |   top: "conv4_1"
325 |   param { lr_mult: 0 }
326 |   param { lr_mult: 0 decay_mult: 0 }
327 |   include { stage: "freeze-convnet" }
328 |   convolution_param {
329 |     num_output: 512
330 |     pad: 1
331 |     kernel_size: 3
332 |   }
333 | }
334 | layer {
335 |   name: "conv4_1"
336 |   type: "Convolution"
337 |   bottom: "pool3"
338 |   top: "conv4_1"
339 |   param { lr_mult: 0.1 }
340 |   param { lr_mult: 0.2 decay_mult: 0}
341 |   exclude { stage: "freeze-convnet" }
342 |   convolution_param {
343 |     num_output: 512
344 |     pad: 1
345 |     kernel_size: 3
346 |   }
347 | }
348 | layer {
349 |   name: "relu4_1"
350 |   type: "ReLU"
351 |   bottom: "conv4_1"
352 |   top: "conv4_1"
353 | }
354 | layer {
355 |   name: "conv4_2"
356 |   type: "Convolution"
357 |   bottom: "conv4_1"
358 |   top: "conv4_2"
359 |   param { lr_mult: 0 }
360 |   param { lr_mult: 0 decay_mult: 0 }
361 |   include { stage: "freeze-convnet" }
362 |   convolution_param {
363 |     num_output: 512
364 |     pad: 1
365 |     kernel_size: 3
366 |   }
367 | }
368 | layer {
369 |   name: "conv4_2"
370 |   type: "Convolution"
371 |   bottom: "conv4_1"
372 |   top: "conv4_2"
373 |   param { lr_mult: 0.1 }
374 |   param { lr_mult: 0.2 decay_mult: 0}
375 |   exclude { stage: "freeze-convnet" }
376 |   convolution_param {
377 |     num_output: 512
378 |     pad: 1
379 |     kernel_size: 3
380 |   }
381 | }
382 | layer {
383 |   name: "relu4_2"
384 |   type: "ReLU"
385 |   bottom: "conv4_2"
386 |   top: "conv4_2"
387 | }
388 | layer {
389 |   name: "conv4_3"
390 |   type: "Convolution"
391 |   bottom: "conv4_2"
392 |   top: "conv4_3"
393 |   param { lr_mult: 0 }
394 |   param { lr_mult: 0 decay_mult: 0 }
395 |   include { stage: "freeze-convnet" }
396 |   convolution_param {
397 |     num_output: 512
398 |     pad: 1
399 |     kernel_size: 3
400 |   }
401 | }
402 | layer {
403 |   name: "conv4_3"
404 |   type: "Convolution"
405 |   bottom: "conv4_2"
406 |   top: "conv4_3"
407 |   param { lr_mult: 0.1 }
408 |   param { lr_mult: 0.2 decay_mult: 0}
409 |   exclude { stage: "freeze-convnet" }
410 |   convolution_param {
411 |     num_output: 512
412 |     pad: 1
413 |     kernel_size: 3
414 |   }
415 | }
416 | layer {
417 |   name: "relu4_3"
418 |   type: "ReLU"
419 |   bottom: "conv4_3"
420 |   top: "conv4_3"
421 | }
422 | layer {
423 |   name: "pool4"
424 |   type: "Pooling"
425 |   bottom: "conv4_3"
426 |   top: "pool4"
427 |   pooling_param {
428 |     pool: MAX
429 |     kernel_size: 2
430 |     stride: 2
431 |   }
432 | }
433 | layer {
434 |   name: "conv5_1"
435 |   type: "Convolution"
436 |   bottom: "pool4"
437 |   top: "conv5_1"
438 |   param { lr_mult: 0 }
439 |   param { lr_mult: 0 decay_mult: 0 }
440 |   include { stage: "freeze-convnet" }
441 |   convolution_param {
442 |     num_output: 512
443 |     pad: 1
444 |     kernel_size: 3
445 |   }
446 | }
447 | layer {
448 |   name: "conv5_1"
449 |   type: "Convolution"
450 |   bottom: "pool4"
451 |   top: "conv5_1"
452 |   param { lr_mult: 0.1 }
453 |   param { lr_mult: 0.2 decay_mult: 0}
454 |   exclude { stage: "freeze-convnet" }
455 |   convolution_param {
456 |     num_output: 512
457 |     pad: 1
458 |     kernel_size: 3
459 |   }
460 | }
461 | layer {
462 |   name: "relu5_1"
463 |   type: "ReLU"
464 |   bottom: "conv5_1"
465 |   top: "conv5_1"
466 | }
467 | layer {
468 |   name: "conv5_2"
469 |   type: "Convolution"
470 |   bottom: "conv5_1"
471 |   top: "conv5_2"
472 |   param { lr_mult: 0 }
473 |   param { lr_mult: 0 decay_mult: 0 }
474 |   include { stage: "freeze-convnet" }
475 |   convolution_param {
476 |     num_output: 512
477 |     pad: 1
478 |     kernel_size: 3
479 |   }
480 | }
481 | layer {
482 |   name: "conv5_2"
483 |   type: "Convolution"
484 |   bottom: "conv5_1"
485 |   top: "conv5_2"
486 |   param { lr_mult: 0.1 }
487 |   param { lr_mult: 0.2 decay_mult: 0}
488 |   exclude { stage: "freeze-convnet" }
489 |   convolution_param {
490 |     num_output: 512
491 |     pad: 1
492 |     kernel_size: 3
493 |   }
494 | }
495 | layer {
496 |   name: "relu5_2"
497 |   type: "ReLU"
498 |   bottom: "conv5_2"
499 |   top: "conv5_2"
500 | }
501 | layer {
502 |   name: "conv5_3"
503 |   type: "Convolution"
504 |   bottom: "conv5_2"
505 |   top: "conv5_3"
506 |   param { lr_mult: 0 }
507 |   param { lr_mult: 0 decay_mult: 0 }
508 |   include { stage: "freeze-convnet" }
509 |   convolution_param {
510 |     num_output: 512
511 |     pad: 1
512 |     kernel_size: 3
513 |   }
514 | }
515 | layer {
516 |   name: "conv5_3"
517 |   type: "Convolution"
518 |   bottom: "conv5_2"
519 |   top: "conv5_3"
520 |   param { lr_mult: 0.1 }
521 |   param { lr_mult: 0.2 decay_mult: 0}
522 |   exclude { stage: "freeze-convnet" }
523 |   convolution_param {
524 |     num_output: 512
525 |     pad: 1
526 |     kernel_size: 3
527 |   }
528 | }
529 | layer {
530 |   name: "relu5_3"
531 |   type: "ReLU"
532 |   bottom: "conv5_3"
533 |   top: "conv5_3"
534 | }
535 | layer {
536 |   name: "pool5"
537 |   type: "Pooling"
538 |   bottom: "conv5_3"
539 |   top: "pool5"
540 |   pooling_param {
541 |     pool: MAX
542 |     kernel_size: 2
543 |     stride: 2
544 |   }
545 | }
546 | layer {
547 |   name: "fc6"
548 |   type: "InnerProduct"
549 |   bottom: "pool5"
550 |   top: "fc6"
551 |   param { lr_mult: 0 }
552 |   param { lr_mult: 0 decay_mult: 0 }
553 |   include { stage: "freeze-convnet" }
554 |   inner_product_param {
555 |     num_output: 4096
556 |   }
557 | }
558 | layer {
559 |   name: "fc6"
560 |   type: "InnerProduct"
561 |   bottom: "pool5"
562 |   top: "fc6"
563 |   param { lr_mult: 0.1 }
564 |   param { lr_mult: 0.2 decay_mult: 0}
565 |   exclude { stage: "freeze-convnet" }
566 |   inner_product_param {
567 |     num_output: 4096
568 |   }
569 | }
570 | layer {
571 |   name: "relu6"
572 |   type: "ReLU"
573 |   bottom: "fc6"
574 |   top: "fc6"
575 | }
576 | layer {
577 |   name: "drop6"
578 |   type: "Dropout"
579 |   bottom: "fc6"
580 |   top: "fc6"
581 |   dropout_param {
582 |     dropout_ratio: 0.5
583 |   }
584 | }
585 | layer {
586 |   name: "fc7"
587 |   type: "InnerProduct"
588 |   bottom: "fc6"
589 |   top: "fc7"
590 |   param { lr_mult: 0 }
591 |   param { lr_mult: 0 decay_mult: 0 }
592 |   include { stage: "freeze-convnet" }
593 |   inner_product_param {
594 |     num_output: 4096
595 |   }
596 | }
597 | layer {
598 |   name: "fc7"
599 |   type: "InnerProduct"
600 |   bottom: "fc6"
601 |   top: "fc7"
602 |   param { lr_mult: 0.1 }
603 |   param { lr_mult: 0.2 decay_mult: 0}
604 |   exclude { stage: "freeze-convnet" }
605 |   inner_product_param {
606 |     num_output: 4096
607 |   }
608 | }
609 | layer {
610 |   name: "relu7"
611 |   type: "ReLU"
612 |   bottom: "fc7"
613 |   top: "fc7"
614 | }
615 | layer {
616 |   name: "drop7"
617 |   type: "Dropout"
618 |   bottom: "fc7"
619 |   top: "fc7"
620 |   dropout_param {
621 |     dropout_ratio: 0.5
622 |   }
623 | }
624 | layer {
625 |   name: "fc8"
626 |   type: "InnerProduct"
627 |   bottom: "fc7"
628 |   top: "fc8"
629 |   param {
630 |     lr_mult: 0.1
631 |     decay_mult: 1
632 |   }
633 |   param {
634 |     lr_mult: 0.2
635 |     decay_mult: 0
636 |   }
637 |   inner_product_param {
638 |     num_output: 1000
639 |   }
640 | }
641 | layer {
642 |   name: "local_features"
643 |   type: "Concat"
644 |   bottom: "fc8"
645 |   bottom: "bbox_coordinate"
646 |   top: "local_features"
647 | }
648 | 
649 | layer {
650 |   name: "embedding"
651 |   type: "Embed"
652 |   bottom: "input_sentence"
653 |   top: "embedded_input_sentence"
654 |   param {
655 |     lr_mult: 1
656 |   }
657 |   embed_param {
658 |     bias_term: false
659 |     input_dim: 8801
660 |     num_output: 1000
661 |     weight_filler {
662 |       type: "uniform"
663 |       min: -0.08
664 |       max: 0.08
665 |     }
666 |   }
667 | }
668 | layer {
669 |   name: "lstm1"
670 |   type: "LSTM"
671 |   bottom: "embedded_input_sentence"
672 |   bottom: "cont_sentence"
673 |   bottom: "local_features"
674 |   top: "lstm1"
675 |   include { stage: "unfactored" }
676 |   recurrent_param {
677 |     num_output: 1000
678 |     weight_filler {
679 |       type: "uniform"
680 |       min: -0.08
681 |       max: 0.08
682 |     }
683 |     bias_filler {
684 |       type: "constant"
685 |       value: 0
686 |     }
687 |   }
688 | }
689 | layer {
690 |   name: "lstm2"
691 |   type: "LSTM"
692 |   bottom: "lstm1"
693 |   bottom: "cont_sentence"
694 |   top: "lstm2"
695 |   include {
696 |     stage: "unfactored"
697 |     stage: "2-layer"
698 |   }
699 |   recurrent_param {
700 |     num_output: 1000
701 |     weight_filler {
702 |       type: "uniform"
703 |       min: -0.08
704 |       max: 0.08
705 |     }
706 |     bias_filler {
707 |       type: "constant"
708 |       value: 0
709 |     }
710 |   }
711 | }
712 | layer {
713 |   name: "lstm1"
714 |   type: "LSTM"
715 |   bottom: "embedded_input_sentence"
716 |   bottom: "cont_sentence"
717 |   top: "lstm1"
718 |   include { stage: "factored" }
719 |   recurrent_param {
720 |     num_output: 1000
721 |     weight_filler {
722 |       type: "uniform"
723 |       min: -0.08
724 |       max: 0.08
725 |     }
726 |     bias_filler {
727 |       type: "constant"
728 |       value: 0
729 |     }
730 |   }
731 | }
732 | layer {
733 |   name: "lstm2-extended"
734 |   type: "LSTM"
735 |   bottom: "lstm1"
736 |   bottom: "cont_sentence"
737 |   bottom: "local_features"
738 |   top: "lstm2"
739 |   include { stage: "factored" }
740 |   recurrent_param {
741 |     num_output: 1000
742 |     weight_filler {
743 |       type: "uniform"
744 |       min: -0.08
745 |       max: 0.08
746 |     }
747 |     bias_filler {
748 |       type: "constant"
749 |       value: 0
750 |     }
751 |   }
752 | }
753 | layer {
754 |   name: "predict"
755 |   type: "InnerProduct"
756 |   bottom: "lstm1"
757 |   top: "predict"
758 |   param {
759 |     lr_mult: 1
760 |     decay_mult: 1
761 |   }
762 |   param {
763 |     lr_mult: 2
764 |     decay_mult: 0
765 |   }
766 |   exclude { stage: "2-layer" }
767 |   inner_product_param {
768 |     num_output: 8801
769 |     weight_filler {
770 |       type: "uniform"
771 |       min: -0.08
772 |       max: 0.08
773 |     }
774 |     bias_filler {
775 |       type: "constant"
776 |       value: 0
777 |     }
778 |     axis: 2
779 |   }
780 | }
781 | layer {
782 |   name: "predict"
783 |   type: "InnerProduct"
784 |   bottom: "lstm2"
785 |   top: "predict"
786 |   param {
787 |     lr_mult: 1
788 |     decay_mult: 1
789 |   }
790 |   param {
791 |     lr_mult: 2
792 |     decay_mult: 0
793 |   }
794 |   include { stage: "2-layer" }
795 |   inner_product_param {
796 |     num_output: 8801
797 |     weight_filler {
798 |       type: "uniform"
799 |       min: -0.08
800 |       max: 0.08
801 |     }
802 |     bias_filler {
803 |       type: "constant"
804 |       value: 0
805 |     }
806 |     axis: 2
807 |   }
808 | }
809 | layer {
810 |   name: "cross_entropy_loss"
811 |   type: "SoftmaxWithLoss"
812 |   bottom: "predict"
813 |   bottom: "target_sentence"
814 |   top: "cross_entropy_loss"
815 |   loss_weight: 20
816 |   loss_param {
817 |     ignore_label: -1
818 |   }
819 |   softmax_param {
820 |     axis: 2
821 |   }
822 | }
823 | layer {
824 |   name: "accuracy"
825 |   type: "Accuracy"
826 |   bottom: "predict"
827 |   bottom: "target_sentence"
828 |   top: "accuracy"
829 |   include { phase: TEST }
830 |   accuracy_param {
831 |     axis: 2
832 |     ignore_label: -1
833 |   }
834 | }
835 | 


--------------------------------------------------------------------------------
/prototxt/scrc_no_context_vgg_solver.prototxt:
--------------------------------------------------------------------------------
 1 | net: "./prototxt/scrc_no_context_vgg_buffer_50.prototxt"
 2 | 
 3 | train_state: { stage: 'freeze-convnet' stage: 'factored' stage: '2-layer' }
 4 | base_lr: 0.001
 5 | lr_policy: "step"
 6 | gamma: 0.5
 7 | stepsize: 30000
 8 | display: 1
 9 | max_iter: 90000
10 | momentum: 0.9
11 | weight_decay: 0.0000
12 | snapshot: 10000
13 | snapshot_prefix: "./exp-referit/caffemodel/scrc_no_context_vgg"
14 | solver_mode: GPU
15 | random_seed: 1701
16 | average_loss: 100
17 | clip_gradients: 10
18 | 


--------------------------------------------------------------------------------
/prototxt/scrc_word_to_preds_full.prototxt:
--------------------------------------------------------------------------------
 1 | input: "cont_sentence"
 2 | input_shape { dim: 20 dim: 100 }
 3 | 
 4 | input: "input_sentence"
 5 | input_shape { dim: 20 dim: 100 }
 6 | 
 7 | input: "image_features"
 8 | input_shape { dim: 100 dim: 1008 }
 9 | 
10 | input: "fc7_context"
11 | input_shape { dim: 100 dim: 4096 }
12 | 
13 | layer {
14 |   name: "embedding"
15 |   type: "Embed"
16 |   bottom: "input_sentence"
17 |   top: "embedded_input_sentence"
18 |   embed_param {
19 |     input_dim: 8801
20 |     num_output: 1000
21 |     bias_term: false
22 |   }
23 | }
24 | layer {
25 |   name: "lstm1"
26 |   type: "LSTM"
27 |   bottom: "embedded_input_sentence"
28 |   bottom: "cont_sentence"
29 |   top: "lstm1"
30 |   recurrent_param { num_output: 1000 }
31 | }
32 | layer {
33 |   name: "lstm2-extended"
34 |   type: "LSTM"
35 |   bottom: "lstm1"
36 |   bottom: "cont_sentence"
37 |   bottom: "image_features"
38 |   top: "lstm2"
39 |   recurrent_param { num_output: 1000 }
40 | }
41 | layer {
42 |   name: "predict"
43 |   type: "InnerProduct"
44 |   bottom: "lstm2"
45 |   top: "predict"
46 |   inner_product_param {
47 |     axis: 2
48 |     num_output: 8801
49 |   }
50 | }
51 | 
52 | # Context LSTM
53 | layer {
54 |   name: "fc8_context"
55 |   type: "InnerProduct"
56 |   bottom: "fc7_context"
57 |   top: "fc8_context"
58 |   inner_product_param { num_output: 1000 }
59 | }
60 | layer {
61 |   name: "lstm2_context"
62 |   type: "LSTM"
63 |   bottom: "lstm1"
64 |   bottom: "cont_sentence"
65 |   bottom: "fc8_context"
66 |   top: "lstm2_context"
67 |   recurrent_param { num_output: 1000 }
68 | }
69 | layer {
70 |   name: "predict_context"
71 |   type: "InnerProduct"
72 |   bottom: "lstm2_context"
73 |   top: "predict_context"
74 |   inner_product_param {
75 |     num_output: 8801
76 |     axis: 2
77 |   }
78 | }
79 | layer {
80 |   name: "predict_combined"
81 |   type: "Eltwise"
82 |   bottom: "predict"
83 |   bottom: "predict_context"
84 |   top: "predict_combined"
85 |   eltwise_param { operation: SUM }
86 | }
87 | layer {
88 |   name: "probs"
89 |   type: "Softmax"
90 |   bottom: "predict_combined"
91 |   top: "probs"
92 |   softmax_param { axis: 2 }
93 | }
94 | 


--------------------------------------------------------------------------------
/prototxt/scrc_word_to_preds_no_context.prototxt:
--------------------------------------------------------------------------------
 1 | input: "cont_sentence"
 2 | input_shape { dim: 20 dim: 100 }
 3 | 
 4 | input: "input_sentence"
 5 | input_shape { dim: 20 dim: 100 }
 6 | 
 7 | input: "image_features"
 8 | input_shape { dim: 100 dim: 1008 }
 9 | 
10 | layer {
11 |   name: "embedding"
12 |   type: "Embed"
13 |   bottom: "input_sentence"
14 |   top: "embedded_input_sentence"
15 |   embed_param {
16 |     input_dim: 8801
17 |     num_output: 1000
18 |     bias_term: false
19 |   }
20 | }
21 | layer {
22 |   name: "lstm1"
23 |   type: "LSTM"
24 |   bottom: "embedded_input_sentence"
25 |   bottom: "cont_sentence"
26 |   top: "lstm1"
27 |   recurrent_param { num_output: 1000 }
28 | }
29 | layer {
30 |   name: "lstm2-extended"
31 |   type: "LSTM"
32 |   bottom: "lstm1"
33 |   bottom: "cont_sentence"
34 |   bottom: "image_features"
35 |   top: "lstm2"
36 |   recurrent_param { num_output: 1000 }
37 | }
38 | layer {
39 |   name: "predict"
40 |   type: "InnerProduct"
41 |   bottom: "lstm2"
42 |   top: "predict"
43 |   inner_product_param {
44 |     axis: 2
45 |     num_output: 8801
46 |   }
47 | }
48 | layer {
49 |   name: "probs"
50 |   type: "Softmax"
51 |   bottom: "predict"
52 |   top: "probs"
53 |   softmax_param { axis: 2 }
54 | }
55 | 


--------------------------------------------------------------------------------
/prototxt/scrc_word_to_preds_no_spatial_no_context.prototxt:
--------------------------------------------------------------------------------
 1 | input: "cont_sentence"
 2 | input_shape { dim: 20 dim: 1000 }
 3 | 
 4 | input: "input_sentence"
 5 | input_shape { dim: 20 dim: 1000 }
 6 | 
 7 | input: "image_features"
 8 | input_shape { dim: 1000 dim: 1000 }
 9 | 
10 | layer {
11 |   name: "embedding"
12 |   type: "Embed"
13 |   bottom: "input_sentence"
14 |   top: "embedded_input_sentence"
15 |   embed_param {
16 |     input_dim: 8801
17 |     num_output: 1000
18 |     bias_term: false
19 |   }
20 | }
21 | layer {
22 |   name: "lstm1"
23 |   type: "LSTM"
24 |   bottom: "embedded_input_sentence"
25 |   bottom: "cont_sentence"
26 |   top: "lstm1"
27 |   recurrent_param { num_output: 1000 }
28 | }
29 | layer {
30 |   name: "lstm2"
31 |   type: "LSTM"
32 |   bottom: "lstm1"
33 |   bottom: "cont_sentence"
34 |   bottom: "image_features"
35 |   top: "lstm2"
36 |   recurrent_param { num_output: 1000 }
37 | }
38 | layer {
39 |   name: "predict"
40 |   type: "InnerProduct"
41 |   bottom: "lstm2"
42 |   top: "predict"
43 |   inner_product_param {
44 |     axis: 2
45 |     num_output: 8801
46 |   }
47 | }
48 | layer {
49 |   name: "probs"
50 |   type: "Softmax"
51 |   bottom: "predict"
52 |   top: "probs"
53 |   softmax_param { axis: 2 }
54 | }
55 | 


--------------------------------------------------------------------------------
/retriever.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division, print_function
  2 | 
  3 | import os
  4 | import re
  5 | import numpy as np
  6 | import h5py
  7 | import skimage.io
  8 | 
  9 | # Compute vocabulary indices from sentence
 10 | MAX_WORDS = 20
 11 | UNK_IDENTIFIER = '<unk>' # <unk> is the word used to identify unknown words
 12 | SENTENCE_SPLIT_REGEX = re.compile(r'(\W+)')
 13 | def sentence2vocab_indices(raw_sentence, vocab_dict):
 14 |     splits = SENTENCE_SPLIT_REGEX.split(raw_sentence.strip())
 15 |     sentence = [ s.lower() for s in splits if len(s.strip()) > 0 ]
 16 |     # remove .
 17 |     if sentence[-1] == '.':
 18 |         sentence = sentence[:-1]
 19 |     vocab_indices = [ (vocab_dict[s] if vocab_dict.has_key(s) else vocab_dict[UNK_IDENTIFIER])
 20 |         for s in sentence ]
 21 |     if len(vocab_indices) > MAX_WORDS:
 22 |         vocab_indices = vocab_indices[:MAX_WORDS]
 23 |     return vocab_indices
 24 | 
 25 | # Build vocabulary dictionary from file
 26 | def build_vocab_dict_from_file(vocab_file):
 27 |     vocab = ['<EOS>']
 28 |     with open(vocab_file, 'r') as f:
 29 |         lines = f.readlines()
 30 |         vocab += [ word.strip() for word in lines ]
 31 |     vocab_dict = { vocab[n] : n for n in range(len(vocab)) }
 32 |     return vocab_dict
 33 | 
 34 | # Build vocabulary dictionary from captioner
 35 | def build_vocab_dict_from_captioner(captioner):
 36 |     vocab_dict = {captioner.vocab[n] : n for n in range(len(captioner.vocab))}
 37 |     return vocab_dict
 38 | 
 39 | def score_descriptors(descriptors, raw_sentence, captioner, vocab_dict):
 40 |     vocab_indices = sentence2vocab_indices(raw_sentence, vocab_dict)
 41 |     num_descriptors = descriptors.shape[0]
 42 |     scores = np.zeros(num_descriptors)
 43 | 
 44 |     net = captioner.lstm_net
 45 | 
 46 |     T = len(vocab_indices)
 47 |     N = descriptors.shape[0]
 48 |     # reshape only when necessary
 49 |     if list(net.blobs['cont_sentence'].shape) != [MAX_WORDS, N]:
 50 |         net.blobs['cont_sentence'].reshape(MAX_WORDS, N)
 51 |         net.blobs['input_sentence'].reshape(MAX_WORDS, N)
 52 |         net.blobs['image_features'].reshape(N, *net.blobs['image_features'].data.shape[1:])
 53 |         # print('LSTM net reshape to ' + str([MAX_WORDS, N]))
 54 | 
 55 |     cont_sentence = np.array([0] + [1 for v in vocab_indices[:-1] ]).reshape((-1, 1))
 56 |     input_sentence = np.array([0] + vocab_indices[:-1] ).reshape((-1, 1))
 57 | 
 58 |     net.blobs['cont_sentence'].data[:T, :] = cont_sentence
 59 |     net.blobs['input_sentence'].data[:T, :] = input_sentence
 60 |     net.blobs['image_features'].data[...] = descriptors
 61 |     net.forward()
 62 | 
 63 |     probs = net.blobs['probs'].data[:T, :, :]
 64 |     for t in range(T):
 65 |         scores += np.log(probs[t, :, vocab_indices[t] ])
 66 |     return scores
 67 | 
 68 | def score_descriptors_context(descriptors, raw_sentence, fc7_context, captioner, vocab_dict):
 69 |     vocab_indices = sentence2vocab_indices(raw_sentence, vocab_dict)
 70 |     num_descriptors = descriptors.shape[0]
 71 |     scores = np.zeros(num_descriptors)
 72 | 
 73 |     net = captioner.lstm_net
 74 | 
 75 |     T = len(vocab_indices)
 76 |     N = descriptors.shape[0]
 77 |     # reshape only when necessary
 78 |     if list(net.blobs['cont_sentence'].shape) != [MAX_WORDS, N]:
 79 |         net.blobs['cont_sentence'].reshape(MAX_WORDS, N)
 80 |         net.blobs['input_sentence'].reshape(MAX_WORDS, N)
 81 |         net.blobs['image_features'].reshape(N, *net.blobs['image_features'].data.shape[1:])
 82 |         net.blobs['fc7_context'].reshape(N, *net.blobs['fc7_context'].data.shape[1:])
 83 |         # print('LSTM net reshape to ' + str([MAX_WORDS, N]))
 84 | 
 85 |     cont_sentence = np.array([0] + [1 for v in vocab_indices[:-1] ]).reshape((-1, 1))
 86 |     input_sentence = np.array([0] + vocab_indices[:-1] ).reshape((-1, 1))
 87 | 
 88 |     net.blobs['cont_sentence'].data[:T, :] = cont_sentence
 89 |     net.blobs['input_sentence'].data[:T, :] = input_sentence
 90 |     net.blobs['image_features'].data[...] = descriptors
 91 |     net.blobs['fc7_context'].data[...] = fc7_context
 92 |     net.forward()
 93 | 
 94 |     probs = net.blobs['probs'].data[:T, :, :]
 95 |     for t in range(T):
 96 |         scores += np.log(probs[t, :, vocab_indices[t] ])
 97 |     return scores
 98 | 
 99 | 
100 | # all boxes are [xmin, ymin, xmax, ymax] format, 0-indexed, including xmax and ymax
101 | def compute_iou(boxes, target):
102 |     assert(target.ndim == 1 and boxes.ndim == 2)
103 |     A_boxes = (boxes[:, 2] - boxes[:, 0] + 1) * (boxes[:, 3] - boxes[:, 1] + 1)
104 |     A_target = (target[2] - target[0] + 1) * (target[3] - target[1] + 1)
105 |     assert(np.all(A_boxes >= 0))
106 |     assert(np.all(A_target >= 0))
107 |     I_x1 = np.maximum(boxes[:, 0], target[0])
108 |     I_y1 = np.maximum(boxes[:, 1], target[1])
109 |     I_x2 = np.minimum(boxes[:, 2], target[2])
110 |     I_y2 = np.minimum(boxes[:, 3], target[3])
111 |     A_I = np.maximum(I_x2 - I_x1 + 1, 0) * np.maximum(I_y2 - I_y1 + 1, 0)
112 |     IoUs = A_I / (A_boxes + A_target - A_I)
113 |     assert(np.all(0 <= IoUs) and np.all(IoUs <= 1))
114 |     return IoUs
115 | 
116 | def crop_edge_boxes(image, edge_boxes):
117 |     # load images
118 |     if type(image) in (str, unicode):
119 |         image = skimage.io.imread(image)
120 |     if image.dtype == np.float32:
121 |         image *= 255
122 |         image = image.astype(np.uint8)
123 |     # Gray scale to RGB
124 |     if image.ndim == 2:
125 |         image = np.tile(image[..., np.newaxis], (1, 1, 3))
126 |     # RGBA to RGB
127 |     image = image[:, :, :3]
128 |     x1, y1, x2, y2 = edge_boxes[:, 0], edge_boxes[:, 1], edge_boxes[:, 2], edge_boxes[:, 3]
129 |     crops = [image[y1[n]:y2[n]+1, x1[n]:x2[n]+1, :] for n in range(edge_boxes.shape[0])]
130 |     return crops
131 | 
132 | def compute_descriptors_edgebox(captioner, image, edge_boxes, output_name='fc8'):
133 |     crops = crop_edge_boxes(image, edge_boxes);
134 |     return compute_descriptors(captioner, crops, output_name)
135 | 
136 | def preprocess_image(captioner, image, verbose=False):
137 |     if type(image) in (str, unicode):
138 |         image = skimage.io.imread(image)
139 |     if image.dtype == np.float32:
140 |         image *= 255
141 |         image = image.astype(np.uint8)
142 |     # Gray scale to RGB
143 |     if image.ndim == 2:
144 |         image = np.tile(image[..., np.newaxis], (1, 1, 3))
145 |     # RGBA to RGB
146 |     image = image[:, :, :3]
147 |     preprocessed_image = captioner.transformer.preprocess('data', image)
148 |     return preprocessed_image
149 | 
150 | def compute_descriptors(captioner, image_list, output_name='fc8'):
151 |     batch = np.zeros_like(captioner.image_net.blobs['data'].data)
152 |     batch_shape = batch.shape
153 |     batch_size = batch_shape[0]
154 |     descriptors_shape = (len(image_list), ) + \
155 |         captioner.image_net.blobs[output_name].data.shape[1:]
156 |     descriptors = np.zeros(descriptors_shape)
157 |     for batch_start_index in range(0, len(image_list), batch_size):
158 |         batch_list = image_list[batch_start_index:(batch_start_index + batch_size)]
159 |         for batch_index, image_path in enumerate(batch_list):
160 |             batch[batch_index:(batch_index + 1)] = preprocess_image(captioner, image_path)
161 |         current_batch_size = min(batch_size, len(image_list) - batch_start_index)
162 |         captioner.image_net.forward(data=batch)
163 |         descriptors[batch_start_index:(batch_start_index + current_batch_size)] = \
164 |             captioner.image_net.blobs[output_name].data[:current_batch_size]
165 |     return descriptors
166 | 
167 | # normalize bounding box features into 8-D feature
168 | def compute_spatial_feat(bboxes, image_size):
169 |     if bboxes.ndim == 1:
170 |         bboxes = bboxes.reshape((1, 4))
171 |     im_w = image_size[0]
172 |     im_h = image_size[1]
173 |     assert(np.all(bboxes[:, 0] < im_w) and np.all(bboxes[:, 2] < im_w))
174 |     assert(np.all(bboxes[:, 1] < im_h) and np.all(bboxes[:, 3] < im_h))
175 | 
176 |     feats = np.zeros((bboxes.shape[0], 8))
177 |     feats[:, 0] = bboxes[:, 0] * 2.0 / im_w - 1  # x1
178 |     feats[:, 1] = bboxes[:, 1] * 2.0 / im_h - 1  # y1
179 |     feats[:, 2] = bboxes[:, 2] * 2.0 / im_w - 1  # x2
180 |     feats[:, 3] = bboxes[:, 3] * 2.0 / im_h - 1  # y2
181 |     feats[:, 4] = (feats[:, 0] + feats[:, 2]) / 2  # x0
182 |     feats[:, 5] = (feats[:, 1] + feats[:, 3]) / 2  # y0
183 |     feats[:, 6] = feats[:, 2] - feats[:, 0]  # w
184 |     feats[:, 7] = feats[:, 3] - feats[:, 1]  # h
185 |     return feats
186 | 
187 | # Write a batch of sentences to HDF5
188 | def write_batch_to_hdf5(filename, cont_sentences, input_sentences,
189 |                         target_sentences, dtype=np.float32):
190 |     h5file = h5py.File(filename, 'w')
191 |     dataset = h5file.create_dataset('cont_sentence',
192 |         shape=cont_sentences.shape, dtype=np.float32)
193 |     dataset[:] = cont_sentences
194 |     dataset = h5file.create_dataset('input_sentence',
195 |         shape=input_sentences.shape, dtype=np.float32)
196 |     dataset[:] = input_sentences
197 |     dataset = h5file.create_dataset('target_sentence',
198 |         shape=target_sentences.shape, dtype=np.float32)
199 |     dataset[:] = target_sentences
200 |     h5file.close()
201 | 
202 | # Write a batch of sentences to HDF5
203 | def write_bbox_to_hdf5(filename, bbox_coordinates, dtype=np.float32):
204 |     h5file = h5py.File(filename, 'w')
205 |     dataset = h5file.create_dataset('bbox_coordinate',
206 |         shape=bbox_coordinates.shape, dtype=np.float32)
207 |     dataset[:] = bbox_coordinates
208 |     h5file.close()
209 | 
210 | # Write a batch of sentences to HDF5
211 | def write_bbox_context_to_hdf5(filename, bbox_coordinates, fc7_context, dtype=np.float32):
212 |     h5file = h5py.File(filename, 'w')
213 |     dataset = h5file.create_dataset('bbox_coordinate',
214 |         shape=bbox_coordinates.shape, dtype=np.float32)
215 |     dataset[:] = bbox_coordinates
216 |     dataset = h5file.create_dataset('fc7_context',
217 |         shape=fc7_context.shape, dtype=np.float32)
218 |     dataset[:] = fc7_context
219 |     h5file.close()
220 | 


--------------------------------------------------------------------------------
/util/__init__.py:
--------------------------------------------------------------------------------
1 | from . import io
2 | 


--------------------------------------------------------------------------------
/util/io.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | def load_str_list(filename):
 4 |     with open(filename, 'r') as f:
 5 |         str_list = f.readlines()
 6 |     str_list = [s[:-1] for s in str_list]
 7 |     return str_list
 8 | 
 9 | def save_str_list(str_list, filename):
10 |     str_list = [s+'\n' for s in str_list]
11 |     with open(filename, 'w') as f:
12 |         f.writelines(str_list)
13 | 
14 | def load_json(filename):
15 |     with open(filename, 'r') as f:
16 |         return json.load(f)
17 | 
18 | def save_json(json_obj, filename):
19 |     with open(filename, 'w') as f:
20 |         json.dump(json_obj, f, separators=(',\n', ':\n'))
21 | 


--------------------------------------------------------------------------------